CONDA-Workshop/Data-Contamination-Database · Contamination results based on "Data Contamination Quiz"

Apr 22, 2024

•

edited Apr 22, 2024

What are you reporting:

Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.

imdb
ag_news
yelp_review_full
nyu-mll/glue (rte)
nyu-mll/glue (wnli)
samsum
EdinburghNLP/xsum
openai_humaneval
ucinlp/drop
gsm8k

Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B).

GPT-4
GPT-3.5

Contaminated corpora: N/A

Contaminated split(s): If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.

imdb/test: {GPT-4: 82.00, GPT-3.5: 55.00}
ag_news/test: {GPT-4: 91.00, GPT-3.5: 82.00}
yelp_review_full/test: {GPT-4: 80.00, GPT-3.5: 13.00}
nyu-mll/glue (rte)/validation: {GPT-4: 60.00, GPT-3.5: 71.00}
nyu-mll/glue (wnli)/validation: {GPT-4: 50.70, GPT-3.5: 12.68}
samsum/test: {GPT-4: 77.00, GPT-3.5: 74.00}
EdinburghNLP/xsum/test: {GPT-4: 95.00, GPT-3.5: 79.00}
openai_humaneval/test: {GPT-4: 56.71}
ucinlp/drop/validation: {GPT-4: 44.00}
gsm8k/train: {GPT-4: 79.00}

You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

Briefly describe your method to detect data contamination

Data-based approach
Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

The results are based on the findings of Golchin and Surdeanu, 2023, where they assessed contamination levels using a quiz-based method. This quiz directs an LLM to pick an option containing the original dataset instance from its three word-level perturbations. If the LLM successfully identifies the original dataset instance, it shows the LLM's exposure to the data. Lastly, the quiz performance represents the level of detected contamination within a specific dataset partition for the LLM that administered the quiz.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://arxiv.org/abs/2311.06233
Citation:



@article
	{DBLP:journals/corr/abs-2311-06233,
  author       = {Shahriar Golchin and
                  Mihai Surdeanu},
  title        = {Data Contamination Quiz: {A} Tool to Detect and Estimate Contamination
                  in Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2311.06233},
  year         = {2023},
  url          = {https://doi.org/10.48550/arXiv.2311.06233},
  doi          = {10.48550/ARXIV.2311.06233},
  eprinttype    = {arXiv},
  eprint       = {2311.06233},
  timestamp    = {Wed, 15 Nov 2023 16:23:10 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2311-06233.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

Full name: Shahriar Golchin, Mihai Surdeanu
Institution: University of Arizona
Email: golchin@arizona.edu; msurdeanu@arizona.edu

Contamination results updated based on ``https://arxiv.org/abs/2311.06233``36cae972

shahriargolchin changed pull request title from Contamination results updated based on ``https://arxiv.org/abs/2311.06233`` to Contamination results based on ``https://arxiv.org/abs/2311.06233`` Apr 22, 2024

OSainz

Workshop on Data Contamination org Apr 23, 2024

Hi @shahriargolchin ,

Seems that some of the evidence overlaps with previous evidence already in the database. Particularly these:

imdb (line 449-450)
ag_news (line 452-453)
yelp_review_full (line 455-456)
rte (line 458-459)
wnli (line 461-462)
samsum (line 464-465)
xsum (line 467-468)

Could you please remove the outdated evidence? I think I added them from your paper "Time Travel in LLM", but there were no specific numbers iirc.

Best,
Oscar

OSainz changed pull request title from Contamination results based on ``https://arxiv.org/abs/2311.06233`` to Contamination results based on "Data Contamination Quiz" Apr 23, 2024

shahriargolchin

Apr 23, 2024

•

edited Apr 23, 2024

Hi @OSainz ,

Thanks for your reply. To clarify, the results from the Time Travel paper represent contamination at the partition level in a binary manner, i.e., it shows whether a dataset partition (e.g., IMDB train set) is contaminated or not. However, the results from the Data Contamination Quiz are estimates of contamination levels. Given this, the nature of detection between these two is different. With that in mind, I was wondering if you would like to collect this information separately or replace the existing data with the new information obtained from the Data Contamination Quiz.

Thank you,
Shahriar

OSainz

Workshop on Data Contamination org Apr 24, 2024

•

edited Apr 25, 2024

Okay, we can have duplicates if they are computed by different methods (and/or reported by different sources).

(Comment edited because I changed my mind)

Merge branch 'main' of https://huggingface.co/spaces/CONDA-Workshop/Data-Contamination-Report into pr/9a44e89af

OSainz

Workshop on Data Contamination org Apr 25, 2024

Thanks again @shahriargolchin for your contribution. We are merging to main.

OSainz changed pull request status to merged Apr 25, 2024

shahriargolchin

Apr 25, 2024

Perfect! Thank you, @OSainz