Spaces:

CONDA-Workshop
/

Data-Contamination-Database

Running

App Files Files Community

Iker commited on Apr 13

Commit

f77074b

•

1 Parent(s): 6738f41

update urls

Browse files

Files changed (1) hide show

markdown.py +2 -2

markdown.py CHANGED Viewed

@@ -6,7 +6,7 @@ The Data Contamination Database is a community-driven project and we welcome con
 We are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported. With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes.
-If you wish to contribute to the project by reporting a data contamination case, please open a pull request in the [✋Community Tab](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/discussions). Your [pull request](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/discussions?new_pr=true) should edit the [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/blob/main/contamination_report.csv) file and add a new row with the details of the contamination case, or evidence of lack of contamination. Please edit the following template with the details of the contamination case. Pull Requests that do not follow the template won't be accepted.
 As a companion to the contamination evidence platform, we will produce a paper that will provide a summary and overview of the evidence collected in the shared task. The participants who contribute to the shared task will be listed as co-authors in the paper. If you have any questions, please contact us at [email protected] or open a discussion in the space itself.
@@ -58,7 +58,7 @@ Citation: `@inproceedings{...`
 ### How to update the contamination_report.csv file
-The [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report/blob/main/contamination_report.csv) file is a csv filed with `;` delimiters. You will need to update the following columns:
 - **Evaluation Dataset**: Name of the evaluation dataset that has has (not) been compromised. If available in the HuggingFace Hub please write the path  (e.g. `uonlp/CulturaX`), otherwise  proviede the name of the dataset.
 - **Subset**: Many HuggingFace datasets have different subsets or splits on a single dataset. This field is to define a particular subset of a given dataset. For example, `qnli` subset of `glue`.
 - **Contaminated Source**: Name of the model that has been trained with the evaluation dataset or name of the pre-training copora that contains the evaluation datset. If available in the HuggingFace Hub please write the path  (e.g. `allenai/OLMo-7B`), otherwise proviede the name of the model/dataset.

 We are organizing a community effort on centralized data contamination evidence collection. While the problem of data contamination is prevalent and serious, the breadth and depth of this contamination are still largely unknown. The concrete evidence of contamination is scattered across papers, blog posts, and social media, and it is suspected that the true scope of data contamination in NLP is significantly larger than reported. With this shared task we aim to provide a structured, centralized platform for contamination evidence collection to help the community understand the extent of the problem and to help researchers avoid repeating the same mistakes.
+If you wish to contribute to the project by reporting a data contamination case, please open a pull request in the [✋Community Tab](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/discussions). Your [pull request](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/discussions?new_pr=true) should edit the [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/blob/main/contamination_report.csv) file and add a new row with the details of the contamination case, or evidence of lack of contamination. Please edit the following template with the details of the contamination case. Pull Requests that do not follow the template won't be accepted.
 As a companion to the contamination evidence platform, we will produce a paper that will provide a summary and overview of the evidence collected in the shared task. The participants who contribute to the shared task will be listed as co-authors in the paper. If you have any questions, please contact us at [email protected] or open a discussion in the space itself.
 ### How to update the contamination_report.csv file
+The [contamination_report.csv](https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Database/blob/main/contamination_report.csv) file is a csv filed with `;` delimiters. You will need to update the following columns:
 - **Evaluation Dataset**: Name of the evaluation dataset that has has (not) been compromised. If available in the HuggingFace Hub please write the path  (e.g. `uonlp/CulturaX`), otherwise  proviede the name of the dataset.
 - **Subset**: Many HuggingFace datasets have different subsets or splits on a single dataset. This field is to define a particular subset of a given dataset. For example, `qnli` subset of `glue`.
 - **Contaminated Source**: Name of the model that has been trained with the evaluation dataset or name of the pre-training copora that contains the evaluation datset. If available in the HuggingFace Hub please write the path  (e.g. `allenai/OLMo-7B`), otherwise proviede the name of the model/dataset.