@lbourdois on Hugging Face: "Let me introduce you LLE: Leaks, leaks everywhere! A quick experiment I've…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

lbourdois

posted an update Feb 1

Post

Let me introduce you LLE: Leaks, leaks everywhere!

A quick experiment I've carried out on around 600 datasets from the HF Hub, the results are stored in lbourdois/LLE, and the methodology is described in
https://huggingface.co/blog/lbourdois/lle

tomaarsen

Feb 1

I did not expect that many datasets to have such notable issues! Very interesting, thanks for sharing.
I would also be interested in the data quality bot that you describe at the end - I think that would be quite useful.

lbourdois

Feb 1

It's the exchanges I've had with you that have led me to question the quality of the data 🤗

On which desk in the Paris office should I leave a post-it note asking for the creation of the bot?

dhuynh95

Feb 1

Pretty cool stuff! Maybe you should do a leaderboard of major datasets and their leakage score

lastrosade

Feb 7

•

edited Feb 7

A little glossary would be nice, I'm not even sure what NER is or what a "leak" means.

lbourdois

Feb 9

For NER (Name Entity Recognition) you can consult https://huggingface.co/tasks/token-classification.
A leak is when data of the train split is found in the test split, biasing the results and benchmarks.

In this post