Superglue/RealNews Contamination based on "Noise-Robust De-Duplication at Scale"
Browse files## What are you reporting:
- [X] Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
- [ ] Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)
**Evaluation dataset(s)**: `superglue`
**Contaminated corpora**: `allenai/c4` - we only look at the realnewslike variant
**Contaminated split(s)**:
|Subset | Contamination |
| -------- | ------- |
|`super_glue (boolq)` | 0.6 %|
|`super_glue (cb)` | 0.0%|
|`super_glue (copa)`| 0.0%|
|`super_glue (multirc)` | 1.2% |
|`super_glue (record)`| 7.3%|
|`super_glue (rte)`| 1.1% |
|`super_glue (wic)`| 0.0%|
|`super_glue (wsc)`| 0.0% |
## Briefly describe your method to detect data contamination
- [X] Data-based approach
- [ ] Model-based approach
We contrastively train a bi-encoder on noisy duplicates. We find that the neural approach finds many duplicates that are missed by rule-based approaches like hashing.
![image.png](https://cdn-uploads.huggingface.co/production/uploads/61654589b5ec555e8e9c203a/c6bY4_HtU5scdcDeVL3jT.png)
## Citation
Is there a paper that reports the data contamination or describes the method used to detect data contamination?
URLs: https://openreview.net/forum?id=bAz2DBS35i, https://arxiv.org/abs/2210.04261
Citation:
```
@inproceedings{silcock-etal-2020-noise,
title = "Noise-Robust De-Duplication at Scale",
author = "Silcock, Emily and D'Amico-Wong, Luca and Yang, Jinglin and Dell, Melissa",
booktitle = "International Conference on Learning Representations (ICLR)",
year = "2023",
}
```
*Important!* If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.
- Full names: Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell
- Institution: Harvard University
- Email: emilysilcock@fas.harvard.edu, ldamicowong@college.harvard.edu, melissadell@fas.harvard.edu
- contamination_report.csv +9 -0
@@ -597,3 +597,12 @@ ibragim-bad/arc_challenge;;FLAN;model;;15.6;;data-based;https://arxiv.org/abs/21
|
|
597 |
facebook/anli;dev_r3;FLAN;model;;40.2;;data-based;https://arxiv.org/abs/2109.01652;13
|
598 |
facebook/anli;dev_r2;FLAN;model;;97.9;;data-based;https://arxiv.org/abs/2109.01652;13
|
599 |
facebook/anli;dev_r1;FLAN;model;;98.6;;data-based;https://arxiv.org/abs/2109.01652;13
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
597 |
facebook/anli;dev_r3;FLAN;model;;40.2;;data-based;https://arxiv.org/abs/2109.01652;13
|
598 |
facebook/anli;dev_r2;FLAN;model;;97.9;;data-based;https://arxiv.org/abs/2109.01652;13
|
599 |
facebook/anli;dev_r1;FLAN;model;;98.6;;data-based;https://arxiv.org/abs/2109.01652;13
|
600 |
+
|
601 |
+
super_glue;boolq;allenai/c4 (realnewslike);corpus;;;0.6;data-based;https://arxiv.org/abs/2210.04261;15
|
602 |
+
super_glue;cb;allenai/c4 (realnewslike);corpus;;;0.0;data-based;https://arxiv.org/abs/2210.04261;15
|
603 |
+
super_glue;copa;allenai/c4 (realnewslike);corpus;;;0.0;data-based;https://arxiv.org/abs/2210.04261;15
|
604 |
+
super_glue;multirc;allenai/c4 (realnewslike);corpus;;;1.2;data-based;https://arxiv.org/abs/2210.04261;15
|
605 |
+
super_glue;record;allenai/c4 (realnewslike);corpus;;;7.3;data-based;https://arxiv.org/abs/2210.04261;15
|
606 |
+
super_glue;rte;allenai/c4 (realnewslike);corpus;;;1.1;data-based;https://arxiv.org/abs/2210.04261;15
|
607 |
+
super_glue;wic;allenai/c4 (realnewslike);corpus;;;0.0;data-based;https://arxiv.org/abs/2210.04261;15
|
608 |
+
super_glue;wsc;allenai/c4 (realnewslike);corpus;;;0.0;data-based;https://arxiv.org/abs/2210.04261;15
|