metadata

language:
  - en
license: apache-2.0
datasets:
  - Open-Orca/SlimOrca
base_model: mistralai/Mistral-7B-v0.1
model-index:
  - name: mistral-11b-slimorca
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 64.25
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/mistral-11b-slimorca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 83.81
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/mistral-11b-slimorca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 63.66
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/mistral-11b-slimorca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 54.66
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/mistral-11b-slimorca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 77.98
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/mistral-11b-slimorca
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 52.39
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/mistral-11b-slimorca
          name: Open LLM Leaderboard

Full weight fine tuned on two epochs of SlimOrca. Uses Mistral Instruct's prompt format.

The base model for this came from a variation on Undi's Mistral 11B recipe. The o_proj and down_proj tensors were set to zero in the added layers, making the output exactly identical to Mistral 7B before training.

~~Benchmarks look good locally but still evaluating actual usefulness.~~ Update: this turned out great! 10/10 would recommend as a training approach.

Reproducing

This mergekit config was used to produce the base model:

slices:
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [0, 24]
  - sources: # add middle layers with residuals scaled to zero
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [8, 24]
        parameters:
          scale:
            - filter: o_proj
              value: 0.0
            - filter: down_proj
              value: 0.0
            - value: 1.0
  - sources:
      - model: mistralai/Mistral-7B-v0.1
        layer_range: [24, 32]
merge_method: passthrough
dtype: bfloat16

The axolotl config for fine tuning is available here.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	66.12
AI2 Reasoning Challenge (25-Shot)	64.25
HellaSwag (10-Shot)	83.81
MMLU (5-Shot)	63.66
TruthfulQA (0-shot)	54.66
Winogrande (5-shot)	77.98
GSM8k (5-shot)	52.39