Superglue/RealNews Contamination based on "Noise-Robust De-Duplication at Scale"

#15
by emilys - opened

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): superglue

Contaminated corpora: allenai/c4 - we only look at the realnewslike variant

Contaminated split(s):

Subset Contamination
super_glue (boolq) 0.6 %
super_glue (cb) 0.0%
super_glue (copa) 0.0%
super_glue (multirc) 1.2%
super_glue (record) 7.3%
super_glue (rte) 1.1%
super_glue (wic) 0.0%
super_glue (wsc) 0.0%

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

We contrastively train a bi-encoder on noisy duplicates. We find that the neural approach finds many duplicates that are missed by rule-based approaches like hashing.

image.png

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URLs: https://openreview.net/forum?id=bAz2DBS35i, https://arxiv.org/abs/2210.04261
Citation:

@inproceedings{silcock-etal-2020-noise,
  title = "Noise-Robust De-Duplication at Scale",
  author = "Silcock, Emily and D'Amico-Wong, Luca and Yang, Jinglin and Dell, Melissa",
  booktitle = "International Conference on Learning Representations (ICLR)",
  year = "2023",
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Workshop on Data Contamination org

Hi @emilys !

Thank you for your contribution! Merging to main.

Best,
Oscar

OSainz changed pull request status to merged

Sign up or log in to comment