Add data from "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus"

#6

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): LAMA (T-REx), LAMA (Google-RE), XSum, TIFU-short, TIFU-long, WikiBio, AMR-to-text, GLUE (BoolQ, CoLA, MNLI, MRPC, QNLI, RTE, SST-2 ,STS-B, WNLI)

Contaminated model(s): NA

Contaminated corpora: allenai/c4

Contaminated split(s): All test splits

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):
The approach was simple: exact matches, normalized for capitalization and punctuation, for more details see sec 4.2 in https://arxiv.org/abs/2104.08758
For evidence of contamination, see the original paper.

Citation

Yes , here is the link:
URL: https://arxiv.org/pdf/2104.08758.pdf
Citation:



@article

	{dodge2021documenting,
  title={Documenting large webtext corpora: A case study on the colossal clean crawled corpus},
  author={Dodge, Jesse and Sap, Maarten and Marasovi{\'c}, Ana and Agnew, William and Ilharco, Gabriel and Groeneveld, Dirk and Mitchell, Margaret and Gardner, Matt},
  journal={arXiv preprint arXiv:2104.08758},
  year={2021}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

  • Full name: Vishaal Udandarao
  • Institution: University of Tuebingen, University of Cambridge
  • Email: vu214@cam.ac.uk
Workshop on Data Contamination org

@vishaal27 Thank you! Merged :D

Iker changed pull request status to merged

Sign up or log in to comment