MBX-7B-v3-DPO / README.md

macadeliccc

Adding Evaluation Results (#2)

185606f verified 7 months ago

preview code

raw

history blame contribute delete

No virus

10.7 kB

	---
	license: cc
	library_name: transformers
	datasets:
	- jondurbin/truthy-dpo-v0.1
	model-index:
	- name: MBX-7B-v3-DPO
	results:
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: AI2 Reasoning Challenge (25-Shot)
	type: ai2_arc
	config: ARC-Challenge
	split: test
	args:
	num_few_shot: 25
	metrics:
	- type: acc_norm
	value: 73.55
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: HellaSwag (10-Shot)
	type: hellaswag
	split: validation
	args:
	num_few_shot: 10
	metrics:
	- type: acc_norm
	value: 89.11
	name: normalized accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: MMLU (5-Shot)
	type: cais/mmlu
	config: all
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 64.91
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: TruthfulQA (0-shot)
	type: truthful_qa
	config: multiple_choice
	split: validation
	args:
	num_few_shot: 0
	metrics:
	- type: mc2
	value: 74.0
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: Winogrande (5-shot)
	type: winogrande
	config: winogrande_xl
	split: validation
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 85.56
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
	name: Open LLM Leaderboard
	- task:
	type: text-generation
	name: Text Generation
	dataset:
	name: GSM8k (5-shot)
	type: gsm8k
	config: main
	split: test
	args:
	num_few_shot: 5
	metrics:
	- type: acc
	value: 69.67
	name: accuracy
	source:
	url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/MBX-7B-v3-DPO
	name: Open LLM Leaderboard
	---

	# MBX-7B-v3-DPO

	This model is a finetune of [flemmingmiguel/MBX-7B-v3](https://huggingface.co/flemmingmiguel/MBX-7B-v3) using jondurbin/truthy-dpo-v0.1

	![MBX-v3-orca](MBX-v3-orca.png)

	## Code Example

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("macadeliccc/MBX-7B-v3-DPO")
	model = AutoModelForCausalLM.from_pretrained("macadeliccc/MBX-7B-v3-DPO")

	messages = [
	{"role": "system", "content": "Respond to the users request like a pirate"},
	{"role": "user", "content": "Can you write me a quicksort algorithm?"}
	]
	gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")
	```

	## Example Output

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6455cc8d679315e4ef16fbec/g5_PTJhGJAcG88wmZz1IO.png)

	## GGUF

	Available [here](https://huggingface.co/macadeliccc/MBX-7B-v3-DPO-GGUF/tree/main)

	## Exllamav2

	Quants are available from bartowski, check them out [here](https://huggingface.co/bartowski/MBX-7B-v3-DPO-exl2)

	Download the size you want below, VRAM figures are estimates.

	\| Branch \| Bits \| lm_head bits \| VRAM (4k) \| VRAM (16k) \| VRAM (32k) \| Description \|
	\| ----- \| ---- \| ------- \| ------ \| ------ \| ------ \| ------------ \|
	\| [8_0](https://huggingface.co/bartowski/MBX-7B-v3-DPO-exl2/tree/8_0) \| 8.0 \| 8.0 \| 8.4 GB \| 9.8 GB \| 11.8 GB \| Maximum quality that ExLlamaV2 can produce, near unquantized performance. \|
	\| [6_5](https://huggingface.co/bartowski/MBX-7B-v3-DPO-exl2/tree/6_5) \| 6.5 \| 8.0 \| 7.2 GB \| 8.6 GB \| 10.6 GB \| Very similar to 8.0, good tradeoff of size vs performance, recommended. \|
	\| [5_0](https://huggingface.co/bartowski/MBX-7B-v3-DPO-exl2/tree/5_0) \| 5.0 \| 6.0 \| 6.0 GB \| 7.4 GB \| 9.4 GB \| Slightly lower quality vs 6.5, but usable on 8GB cards. \|
	\| [4_25](https://huggingface.co/bartowski/MBX-7B-v3-DPO-exl2/tree/4_25) \| 4.25 \| 6.0 \| 5.3 GB \| 6.7 GB \| 8.7 GB \| GPTQ equivalent bits per weight, slightly higher quality. \|
	\| [3_5](https://huggingface.co/bartowski/MBX-7B-v3-DPO-exl2/tree/3_5) \| 3.5 \| 6.0 \| 4.7 GB \| 6.1 GB \| 8.1 GB \| Lower quality, only use if you have to. \|

	## Evaluations

	## EQ-Bench Comparison

	<pre>----Benchmark Complete----
	2024-01-30 15:22:18
	Time taken: 145.9 mins
	Prompt Format: ChatML
	Model: macadeliccc/MBX-7B-v3-DPO
	Score (v2): 74.32
	Parseable: 166.0
	---------------
	Batch completed
	Time taken: 145.9 mins
	---------------
	</pre>

	### Original Model
	<pre>----Benchmark Complete----
	2024-01-31 01:26:26
	Time taken: 89.1 mins
	Prompt Format: Mistral
	Model: flemmingmiguel/MBX-7B-v3
	Score (v2): 73.87
	Parseable: 168.0
	---------------
	Batch completed
	Time taken: 89.1 mins
	---------------
	</pre>

	\| Model \|AGIEval\|GPT4All\|TruthfulQA\|Bigbench\|Average\|
	\|-----------------------------------------------------------------\|------:\|------:\|---------:\|-------:\|------:\|
	\|[MBX-7B-v3-DPO](https://huggingface.co/macadeliccc/MBX-7B-v3-DPO)\| 45.16\| 77.73\| 74.62\| 48.83\| 61.58\|

	### AGIEval
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------\|------:\|--------\|----:\|---\|-----:\|
	\|agieval_aqua_rat \| 0\|acc \|27.95\|± \| 2.82\|
	\| \| \|acc_norm\|26.77\|± \| 2.78\|
	\|agieval_logiqa_en \| 0\|acc \|41.01\|± \| 1.93\|
	\| \| \|acc_norm\|40.55\|± \| 1.93\|
	\|agieval_lsat_ar \| 0\|acc \|25.65\|± \| 2.89\|
	\| \| \|acc_norm\|23.91\|± \| 2.82\|
	\|agieval_lsat_lr \| 0\|acc \|50.78\|± \| 2.22\|
	\| \| \|acc_norm\|52.94\|± \| 2.21\|
	\|agieval_lsat_rc \| 0\|acc \|66.54\|± \| 2.88\|
	\| \| \|acc_norm\|65.80\|± \| 2.90\|
	\|agieval_sat_en \| 0\|acc \|77.67\|± \| 2.91\|
	\| \| \|acc_norm\|77.67\|± \| 2.91\|
	\|agieval_sat_en_without_passage\| 0\|acc \|43.20\|± \| 3.46\|
	\| \| \|acc_norm\|43.20\|± \| 3.46\|
	\|agieval_sat_math \| 0\|acc \|32.27\|± \| 3.16\|
	\| \| \|acc_norm\|30.45\|± \| 3.11\|

	Average: 45.16%

	### GPT4All
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|-------------\|------:\|--------\|----:\|---\|-----:\|
	\|arc_challenge\| 0\|acc \|68.43\|± \| 1.36\|
	\| \| \|acc_norm\|68.34\|± \| 1.36\|
	\|arc_easy \| 0\|acc \|87.54\|± \| 0.68\|
	\| \| \|acc_norm\|82.11\|± \| 0.79\|
	\|boolq \| 1\|acc \|88.20\|± \| 0.56\|
	\|hellaswag \| 0\|acc \|69.76\|± \| 0.46\|
	\| \| \|acc_norm\|87.40\|± \| 0.33\|
	\|openbookqa \| 0\|acc \|40.20\|± \| 2.19\|
	\| \| \|acc_norm\|49.60\|± \| 2.24\|
	\|piqa \| 0\|acc \|83.68\|± \| 0.86\|
	\| \| \|acc_norm\|85.36\|± \| 0.82\|
	\|winogrande \| 0\|acc \|83.11\|± \| 1.05\|

	Average: 77.73%

	### TruthfulQA
	\| Task \|Version\|Metric\|Value\| \|Stderr\|
	\|-------------\|------:\|------\|----:\|---\|-----:\|
	\|truthfulqa_mc\| 1\|mc1 \|58.87\|± \| 1.72\|
	\| \| \|mc2 \|74.62\|± \| 1.44\|

	Average: 74.62%

	### Bigbench
	\| Task \|Version\| Metric \|Value\| \|Stderr\|
	\|------------------------------------------------\|------:\|---------------------\|----:\|---\|-----:\|
	\|bigbench_causal_judgement \| 0\|multiple_choice_grade\|60.00\|± \| 3.56\|
	\|bigbench_date_understanding \| 0\|multiple_choice_grade\|63.14\|± \| 2.51\|
	\|bigbench_disambiguation_qa \| 0\|multiple_choice_grade\|47.67\|± \| 3.12\|
	\|bigbench_geometric_shapes \| 0\|multiple_choice_grade\|22.56\|± \| 2.21\|
	\| \| \|exact_str_match \| 0.84\|± \| 0.48\|
	\|bigbench_logical_deduction_five_objects \| 0\|multiple_choice_grade\|33.20\|± \| 2.11\|
	\|bigbench_logical_deduction_seven_objects \| 0\|multiple_choice_grade\|23.00\|± \| 1.59\|
	\|bigbench_logical_deduction_three_objects \| 0\|multiple_choice_grade\|59.67\|± \| 2.84\|
	\|bigbench_movie_recommendation \| 0\|multiple_choice_grade\|47.40\|± \| 2.24\|
	\|bigbench_navigate \| 0\|multiple_choice_grade\|56.10\|± \| 1.57\|
	\|bigbench_reasoning_about_colored_objects \| 0\|multiple_choice_grade\|71.25\|± \| 1.01\|
	\|bigbench_ruin_names \| 0\|multiple_choice_grade\|56.47\|± \| 2.35\|
	\|bigbench_salient_translation_error_detection \| 0\|multiple_choice_grade\|35.27\|± \| 1.51\|
	\|bigbench_snarks \| 0\|multiple_choice_grade\|73.48\|± \| 3.29\|
	\|bigbench_sports_understanding \| 0\|multiple_choice_grade\|75.46\|± \| 1.37\|
	\|bigbench_temporal_sequences \| 0\|multiple_choice_grade\|52.10\|± \| 1.58\|
	\|bigbench_tracking_shuffled_objects_five_objects \| 0\|multiple_choice_grade\|22.64\|± \| 1.18\|
	\|bigbench_tracking_shuffled_objects_seven_objects\| 0\|multiple_choice_grade\|19.83\|± \| 0.95\|
	\|bigbench_tracking_shuffled_objects_three_objects\| 0\|multiple_choice_grade\|59.67\|± \| 2.84\|

	Average: 48.83%

	Average score: 61.58%

	Elapsed time: 02:37:39
	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_macadeliccc__MBX-7B-v3-DPO)

	\| Metric \|Value\|
	\|---------------------------------\|----:\|
	\|Avg. \|76.13\|
	\|AI2 Reasoning Challenge (25-Shot)\|73.55\|
	\|HellaSwag (10-Shot) \|89.11\|
	\|MMLU (5-Shot) \|64.91\|
	\|TruthfulQA (0-shot) \|74.00\|
	\|Winogrande (5-shot) \|85.56\|
	\|GSM8k (5-shot) \|69.67\|