Evaluation of Phi-3.5 on long-context BABILong bench

#12
by yurakuratov - opened

Hi! I want to share some promising results of the new Phi-3.5 models on the BABILong benchmark, which evaluates models' long-context reasoning over multiple distributed facts.

The new Phi-3.5 models show much better results on the BABILong benchmark compared to previous versions and larger competitor models. The Phi-3.5-mini-instruct model notably improves on context lengths up to 32K, outperforming the Phi-3-mini-128k-instruct. The Phi-3.5-MoE-instruct achieves similar performance to the Phi-3-medium-128k-instruct but with much fewer active parameters (6.6B vs 14B), and is very close to LLama-3.1-8B-Instruct.

Here are the results for models that support 128k contexts. Full BABILong leaderboard is here.

Models (128k) params 0K 1K 2K 4K 8K 16K 32K 64K 128K avg
Phi-3-mini-128k-instruct 3.8B 64 57 55 51 50 46 42 37 7 45,4
ai21labs/Jamba-v0.1 12B (52B) 65 53 50 48 46 45 41 40 34 46,9
Phi-3.5-mini-instruct 3.8B 70 70 62 59 58 53 43 38 10 51,4
c4ai-command-r-v01 35B 64 64 63 61 59 52 51 46 38 55,4
Phi-3.5-MoE-instruct 6.6B (16x3.8B) 77 71 65 61 59 52 50 43 37 57,2
Phi-3-medium-128k-instruct 14B 72 70 67 62 60 57 53 45 30 57,5
Meta-Llama-3.1-8B-Instruct 8B 67 68 66 66 62 60 56 49 39 59,2
gpt-4o-mini-2024-07-18 - 74 72 71 65 62 60 54 45 43 60,7
GPT-4 (gpt-4-0125-preview) - 87 81 77 74 71 64 53 43 36 65,1
Meta-Llama-3.1-70B-Instruct 70B 85 81 78 74 70 65 59 53 45 67,8
Microsoft org

Thank you @yurakuratov for benchmarking Phi-3 and Phi-3.5 models with BABILong, they are very helpful!

Sign up or log in to comment