leaderboard-pr-bot's picture
Adding Evaluation Results
c678115 verified
|
raw
history blame
5.86 kB
metadata
language:
  - en
license: mit
library_name: adapter-transformers
datasets:
  - argilla/distilabel-intel-orca-dpo-pairs
  - jondurbin/truthy-dpo-v0.1
  - argilla/distilabel-math-preference-dpo
  - argilla/distilabel-capybara-dpo-7k-binarized
base_model: Technoculture/MT7Bi-sft
model-index:
  - name: MedMerge-6-7b-alpha-dpo
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 54.27
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MedMerge-6-7b-alpha-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 75.6
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MedMerge-6-7b-alpha-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 52.65
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MedMerge-6-7b-alpha-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 43.94
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MedMerge-6-7b-alpha-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 71.03
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MedMerge-6-7b-alpha-dpo
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 26.16
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=Technoculture/MedMerge-6-7b-alpha-dpo
          name: Open LLM Leaderboard

Technoculture/MedMerge-6-7b-alpha-dpo

Open LLM Leaderboard

image/png

Model Name ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
Orca-2-7b 78.4 76.1 53.7 52.4 74.2 47.2
LLAMA-2-7b 43.2 77.1 44.4 38.7 69.5 16
MT7Bi-sft 54.1 75.11 - 43.08 72.14 15.54
MedMerge-6-7b 29.52 41.04 - 37.53 59.35 0.91
MedMerge-6-7b-alpha-dpo 54.27 75.6 52.65 43.94 71.03 26.16

Training Details

  • GPU: Nvidia A100 Tensor Core GPU
  • Total Batches: 4266
  • Epochs: 3
  • Duration: 3 hours, 57 minutes, and 00 seconds

DPO Training Dataset Mixture

Dataset Name Original Size(Rows) Ratio Size After Ratio(Rows)
argilla/distilabel-math-preference-dpo 2.4k 1.0 2.4k
argilla/distilabel-intel-orca-dpo-pairs 12.9k 0.5 6.45k
jondurbin/truthy-dpo-v0.1 1.04k 1.0 1.04k
argilla/distilabel-capybara-dpo-7k-binarized 7.5k 0.2 1.5k
Total Size: 11.38k

Training Loss Plot

image/png

Training Loss Smoothed Plot

image/png

For full details of this dpo-training please read our notebook.

Open In Colab # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_Technoculture__MedMerge-6-7b-alpha-dpo)
Metric Value
Avg. 53.94
AI2 Reasoning Challenge (25-Shot) 54.27
HellaSwag (10-Shot) 75.60
MMLU (5-Shot) 52.65
TruthfulQA (0-shot) 43.94
Winogrande (5-shot) 71.03
GSM8k (5-shot) 26.16