Model is Overaligned, Unusable and gamed for the leaderboard
I have heard from multiple people on good authority that this model is barely usable and has been created just to game the openLLM benchmarks. I am not sure whether this has been by cheating, or just through ignorant model creation.
Here are some quantitative examples of what I mean:
Smaug-72B-v0.1 has a score of 80.48 on average on the leaderboard, and this is averaged across ARC, Hellaswag, MMLU, TruthfulQA, Winogrande and GSM8k. However, outside of these benchmarks, Smaug does not hold its position, not even close to what it claims.
Some examples:
MT-Bench: 7.75 points (https://huggingface.co/abacusai/Smaug-72B-v0.1/discussions/12)
EQ-Bench: 70.96 79.75 points (updated) (https://eqbench.com)
edit: the quants were wrong for the eq bench measurement, however the concerns still apply regardless
For comparison, miqu (152334H/miqu-1-70b-sf) gets 8.4 points on MT-Bench (https://twitter.com/abacaj/status/1752583301284393006) and 82.91 on EQ-Bench. That is a very significant difference between the models, which makes no sense for a model that is top of the openLLM leaderboard.
The point I am making is that in the mission to be the #1 scoring model on the leaderboard, and the desperate race to be the best, you have sacrificed making an actually valuable model. It is clear that Smaug's capabilities are not replicated on benchmarks that are not on the leaderboard, and also it is clear it is subpar in subjective use.
We hope you can use your funds to make a model that is truly better, not one just made for publicity.
Thanks.
@huyn1608 how did you query the model? We don't see this issue - is it possible you were using a quantized model?
@distantquant The point behind Smaug was to explore this new finetuning technique: https://arxiv.org/abs/2402.13228 and use it to improve reasoning on LLMs. We haven't focussed as much as conversational abilities as seen by a much higher first turn MT-Bench score (8.18). We haven't looked at EQ bench, but many good models like Mixtral-8x7B-Instruct and Claude-2.1 also seem to do poorly on EQ bench.
@arvindabacus Thanks for replying. I use the model as is, without quantization or conversion to other formats. The chat template that I use is LLama.
In the image above, the "gray" text is the input, and the "white" text is the model response. For some reason, it includes the """[INST] ... on the DNA template""" paragraph that is not part of my input. Pretty weird.
For some other questions, the model does not include weird text like that but includes the text CONTEXT INFORMATION at the beginning of the response and a single, double quote at the end of the response.
Please let me know your thoughts.
@distantquant also your updated EQBench scores (79.75 is very near the top of other open source models) undermine your original point. And the updates clearly negate claims like "Smaug-72B is also surpassed in EQ-Bench by WestLake-7B-v2, which has a score of 78.7 points." which is just incorrect.
@distantquant also your updated EQBench scores (79.75 is very near the top of other open source models) undermine your original point. And the updates clearly negate claims like "Smaug-72B is also surpassed in EQ-Bench by WestLake-7B-v2, which has a score of 78.7 points." which is just incorrect.
Forgot to remove that section, but still mt-bench is 0.65 points lower than miqu 70b + the other concerns around overalignment still stand
Miqu is not open and I couldn't find any ARC, MMLU benchmarks for it to indicate that it does poorly on the openllm but does well on EQBench. Overall, the point of having different benchmarks is to measure different things and I don't see much point in this whole issue other than a sensational rant.
Miqu is open as your model is, model weights can't be copyrighted, and you are missing my point.
Rather than making models to game the leaderboard, that are over-DPO'd, the focus should be on creating actually intelligent models.
Otherwise, you are wasting your own time and everybody else's.
Why do you have test datasets on your profile?
https://huggingface.co/datasets/abacusai/HellaSwag_DPO_FewShot
https://huggingface.co/datasets/abacusai/ARC_DPO_FewShot
Makes it seem like you trained it on the test dataset
Please read this paper for full details on the datasets and procedure used for training: https://arxiv.org/pdf/2402.13228.pdf
Those datasets were preprocessed versions of the train and val parts of HellaSwag and ARC - those have been converted using the DPOP technique that is the basis of the training procedure. Please also look at our extensive contamination analysis on the model card which we have used to verify that there is no test contamination on the final model.
Yeah I'm necroing an old post. This thread was titled in the form of accusation, what is its resolution?
I feel like this thread should have been posed as question rather than accusation from what I am reading here. The paper explains the method used. What's the bother? Abacus produced a useful novel work and it proves itself through industrial use as well, even if its responses are inherently formatted strangely.