Add data from WIMBD paper

#2
by OSainz - opened
Workshop on Data Contamination org
โ€ข
edited Mar 24

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s):

  • UCLNLP/adversarial_qa
  • aeslc
  • amazon_reviews_multi
  • billsum
  • cosmos_qa
  • crows_pairs
  • ibm/duorc
  • esnli
  • gigaword
  • glue
  • head_qa
  • health_fact
  • hlgd
  • liar
  • math_dataset
  • math_qa
  • mc_taco
  • mocha
  • openai_humaneval
  • paws-x
  • paws
  • piqa
  • race
  • allenai/ropes
  • samsum
  • scan
  • allenai/scicite
  • scitail
  • sem_eval_2014_task_1
  • sick
  • snli
  • squadshifts
  • stsb_multi_mt
  • subjqa
  • super_glue
  • swag
  • tab_fact
  • wiki_qa
  • winograd_wsc
  • winogrande
  • xnli
  • xsum
  • zest

Contaminated model(s): None

Contaminated corpora:

  • allenai/c4
  • oscar-corpus/OSCAR-2301
  • EleutherAI/pile
  • togethercomputer/RedPajama-Data-V2

Contaminated split(s): Test splits

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

The method used to detect contamination was a data-based approach. Specifically, we have used the WIMBD tool to identify contamination. See Section 4.1.1 in the paper

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://aclanthology.org/2023.findings-emnlp.722/

@inproceedings{
elazar2024whats,
title={What's In My Big Data?},
author={Yanai Elazar and Akshita Bhagia and Ian Helgi Magnusson and Abhilasha Ravichander and Dustin Schwenk and Alane Suhr and Evan Pete Walsh and Dirk Groeneveld and Luca Soldaini and Sameer Singh and Hannaneh Hajishirzi and Noah A. Smith and Jesse Dodge},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=RvfPnOkPV4}
}
OSainz changed pull request status to merged

Sign up or log in to comment