Adding Evaluation Results (#3)

310d540 verified 5 months ago

2.2 kB

	CAMEL-13B-Combined-Data is a chat large language model obtained by finetuning LLaMA-13B model on a total of 229K conversations collected through our [CAMEL](https://arxiv.org/abs/2303.17760) framework, 100K English public conversations from ShareGPT that can be found [here](https://github.com/lm-sys/FastChat/issues/90#issuecomment-1493250773), and 52K instructions from Alpaca dataset that can be found [here](https://github.com/tatsu-lab/stanford_alpaca/blob/761dc5bfbdeeffa89b8bff5d038781a4055f796a/alpaca_data.json). We evaluate our model offline using EleutherAI's language model evaluation harness used by Huggingface's Open LLM Benchmark. CAMEL<sup>*</sup>-13B scores an average of 58.9.

	\| Model \| size \| ARC-C (25 shots, acc_norm) \| HellaSwag (10 shots, acc_norm) \| MMLU (5 shots, acc_norm) \| TruthfulQA (0 shot, mc2) \| Average \| Delta \|
	\|-------------\|:----:\|:---------------------------:\|:-------------------------------:\|:-------------------------:\|:-------------------------:\|:-------:\|-------\|
	\| LLaMA \| 13B \| 56.3 \| 80.9 \| 46.7 \| 39.9 \| 56.0 \| - \|
	\| Vicuna \| 13B \| 52.8 \| 80.1 \| 50.5 \| 51.8 \| 58.8 \| 2.8 \|
	\| CAMEL<sup>*</sup> \| 13B \| 56.1 \| 79.9 \| 50.5 \| 49.0 \| 58.9 \| 2.9 \|

	---
	license: cc-by-nc-4.0
	---

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_camel-ai__CAMEL-13B-Combined-Data)

	\| Metric \| Value \|
	\|-----------------------\|---------------------------\|
	\| Avg. \| 46.07 \|
	\| ARC (25-shot) \| 55.63 \|
	\| HellaSwag (10-shot) \| 79.25 \|
	\| MMLU (5-shot) \| 49.74 \|
	\| TruthfulQA (0-shot) \| 47.42 \|
	\| Winogrande (5-shot) \| 75.45 \|
	\| GSM8K (5-shot) \| 7.13 \|
	\| DROP (3-shot) \| 7.86 \|

	CAMEL-13B-Combined-Data is a chat large language model obtained by finetuning LLaMA-13B model on a total of 229K conversations collected through our [CAMEL](https://arxiv.org/abs/2303.17760) framework, 100K English public conversations from ShareGPT that can be found [here](https://github.com/lm-sys/FastChat/issues/90#issuecomment-1493250773), and 52K instructions from Alpaca dataset that can be found [here](https://github.com/tatsu-lab/stanford_alpaca/blob/761dc5bfbdeeffa89b8bff5d038781a4055f796a/alpaca_data.json). We evaluate our model offline using EleutherAI's language model evaluation harness used by Huggingface's Open LLM Benchmark. CAMEL<sup>*</sup>-13B scores an average of 58.9.

	\| Model \| size \| ARC-C (25 shots, acc_norm) \| HellaSwag (10 shots, acc_norm) \| MMLU (5 shots, acc_norm) \| TruthfulQA (0 shot, mc2) \| Average \| Delta \|
	\|-------------\|:----:\|:---------------------------:\|:-------------------------------:\|:-------------------------:\|:-------------------------:\|:-------:\|-------\|
	\| LLaMA \| 13B \| 56.3 \| 80.9 \| 46.7 \| 39.9 \| 56.0 \| - \|
	\| Vicuna \| 13B \| 52.8 \| 80.1 \| 50.5 \| 51.8 \| 58.8 \| 2.8 \|
	\| CAMEL<sup>*</sup> \| 13B \| 56.1 \| 79.9 \| 50.5 \| 49.0 \| 58.9 \| 2.9 \|

	---
	license: cc-by-nc-4.0
	---

	# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
	Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_camel-ai__CAMEL-13B-Combined-Data)

	\| Metric \| Value \|
	\|-----------------------\|---------------------------\|
	\| Avg. \| 46.07 \|
	\| ARC (25-shot) \| 55.63 \|
	\| HellaSwag (10-shot) \| 79.25 \|
	\| MMLU (5-shot) \| 49.74 \|
	\| TruthfulQA (0-shot) \| 47.42 \|
	\| Winogrande (5-shot) \| 75.45 \|
	\| GSM8K (5-shot) \| 7.13 \|
	\| DROP (3-shot) \| 7.86 \|