so whats the hypothesis?
why ?
because it was also noticed increased TQA when TQA was decontaminated on the used dataset.
so this shows actually opposite result, but in other metrics. its unlikely for those samples to hold any "magic". how big was the portion? the score is clearly out of mark.
will be looking forward for your paper, interim I would suggest you to raise a discussion to get it flagged before the community outrages....
I am new here and have now clue what "flagged" means or how to raise a discussion about it.
I simply took 1200 winograd schemata, modified them slightly * and with them, DPOed the model which dominated the leaderboard.
Surprisingly - or maybe not so, if one realizes what attention layers are about and what Winograd schemata are about - additionally to Winogrande, three other metrics apparently went up.
Gonna dive deeper into the thing when I'll have time, putting the small speculative paper on Researchgate, Researchub or even here in git repo.
- maybe the magic consists in the way how that "slight" modification happened
the contaminated models are not part of the leaderboard, as they are contaminated. I believe yours is not the first model that is contaminated for academic purposes. but i believe they are not ranked within the non contaminated models. At the end of the day the model you uploaded has in its corpus the answers that he is being asked.
IMO due the age of those datasets, it is possible that there are some overlap, or maybe just reinforcing the format (arc, tqa, etc) follows a very dry format. To be bonest the most surprising is the ARC itself, does this means that reinforcing some type of conversation can affect the model performace? <= very probable.
Just uploaded the dataset with which I executed the DPO : hromi/winograd_dpo
As You may notice, the dataset has slightly special properties not to be found in the wild... Can one, in such case, still speak of contamination and if yes, to what degree ? Any tool to quantify that ?
It's a fairly short dataset and I didn't use anything else. I doubt there would be explicit overlaps with tqa, arc or so, but this is definitely to be explored.
My intuition would be that either the combination of winograd, DPO, and repetition either
- does something interesting to attention layers, transformers etc. in the model
- exploits the way how diverse tasks are evaluated by lm-eval-harness
The first option sounds cool, the second one less so. Still, I look forward into diving deeper into this riddle and would be grateful for any insights / feedback / mutual colaboration in writing the paper about this oddity.
I find this very interesting. The Winogrande questions were carefully chosen to challenge the language skills of LLMs. So it makes perfect sense that they would also be the ideal questions to DPO train them with.
And once said training is generalized it should help on Arc in particular because many LLMs get easy questions wrong only because they fail to comprehend what is being asked. So a boost in comprehension afforded by improved language skills thanks to deliberate contamination with Winogrande questions designed to exploit common blind spots in LLMs should help.
What would be interesting is if someone trained the exact same way with an equal number of equivalent but different Winogrande questions (e.g. the same type and complexity of each question, just with different subject matter so it's not contamination), and then see if other tests like Arc get the same boost.
there was a case like that before, a coding model was contaminated and the authors deleted it and also their account,
but people were saying it has very good performance in practice even if it was contaminated, so TheBloke left the model in place,
see here https://huggingface.co/TheBloke/NewHope-GGML/discussions/4
my intuition is that if you take olympiad tasks and train a student, he will become proficient in the school curricula too
IMO thats the important point. But why? Is not that those small test sets holds any super magical thing.. and how can we holistically evaluate them further..?
What happened to the dataset? It went 404.