Superglue/RealNews Contamination based on "Noise-Robust De-Duplication at Scale"
#15
by
emilys
- opened
What are you reporting:
- Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
Evaluation dataset(s): superglue
Contaminated corpora: allenai/c4
- we only look at the realnewslike variant
Contaminated split(s):
Subset | Contamination |
---|---|
super_glue (boolq) |
0.6 % |
super_glue (cb) |
0.0% |
super_glue (copa) |
0.0% |
super_glue (multirc) |
1.2% |
super_glue (record) |
7.3% |
super_glue (rte) |
1.1% |
super_glue (wic) |
0.0% |
super_glue (wsc) |
0.0% |
Briefly describe your method to detect data contamination
- Data-based approach
- Model-based approach
We contrastively train a bi-encoder on noisy duplicates. We find that the neural approach finds many duplicates that are missed by rule-based approaches like hashing.
Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination?
URLs: https://openreview.net/forum?id=bAz2DBS35i, https://arxiv.org/abs/2210.04261
Citation:
@inproceedings{silcock-etal-2020-noise,
title = "Noise-Robust De-Duplication at Scale",
author = "Silcock, Emily and D'Amico-Wong, Luca and Yang, Jinglin and Dell, Melissa",
booktitle = "International Conference on Learning Representations (ICLR)",
year = "2023",
}
Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full names: Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell
- Institution: Harvard University
- Email: emilysilcock@fas.harvard.edu, ldamicowong@college.harvard.edu, melissadell@fas.harvard.edu
OSainz
changed pull request status to
merged