Contamination results based on "Data Contamination Quiz"

#9

What are you reporting:

  • Evaluation dataset(s) found in a pre-training corpus. (e.g. COPA found in ThePile)
  • Evaluation dataset(s) found in a pre-trained model. (e.g. FLAN T5 has been trained on ANLI)

Evaluation dataset(s): Name(s) of the evaluation dataset(s). If available in the HuggingFace Hub please write the path (e.g. uonlp/CulturaX), otherwise provide a link to a paper, GitHub or dataset-card.

  • imdb
  • ag_news
  • yelp_review_full
  • nyu-mll/glue (rte)
  • nyu-mll/glue (wnli)
  • samsum
  • EdinburghNLP/xsum
  • openai_humaneval
  • ucinlp/drop
  • gsm8k

Contaminated model(s): Name of the model(s) (if any) that have been contaminated with the evaluation dataset. If available in the HuggingFace Hub please list the corresponding paths (e.g. allenai/OLMo-7B).

  • GPT-4
  • GPT-3.5

Contaminated corpora: N/A

Contaminated split(s): If the dataset has Train, Development and/or Test splits please report the contaminated split(s). You can report a percentage of the dataset contaminated; if the entire dataset is compromised, report 100%.

  • imdb/test: {GPT-4: 82.00, GPT-3.5: 55.00}
  • ag_news/test: {GPT-4: 91.00, GPT-3.5: 82.00}
  • yelp_review_full/test: {GPT-4: 80.00, GPT-3.5: 13.00}
  • nyu-mll/glue (rte)/validation: {GPT-4: 60.00, GPT-3.5: 71.00}
  • nyu-mll/glue (wnli)/validation: {GPT-4: 50.70, GPT-3.5: 12.68}
  • samsum/test: {GPT-4: 77.00, GPT-3.5: 74.00}
  • EdinburghNLP/xsum/test: {GPT-4: 95.00, GPT-3.5: 79.00}
  • openai_humaneval/test: {GPT-4: 56.71}
  • ucinlp/drop/validation: {GPT-4: 44.00}
  • gsm8k/train: {GPT-4: 79.00}

You may also report instances where there is no contamination. In such cases, follow the previous instructions but report a contamination level of 0%.

Briefly describe your method to detect data contamination

  • Data-based approach
  • Model-based approach

Description of your method, 3-4 sentences. Evidence of data contamination (Read below):

The results are based on the findings of Golchin and Surdeanu, 2023, where they assessed contamination levels using a quiz-based method. This quiz directs an LLM to pick an option containing the original dataset instance from its three word-level perturbations. If the LLM successfully identifies the original dataset instance, it shows the LLM's exposure to the data. Lastly, the quiz performance represents the level of detected contamination within a specific dataset partition for the LLM that administered the quiz.

Citation

Is there a paper that reports the data contamination or describes the method used to detect data contamination?

URL: https://arxiv.org/abs/2311.06233
Citation:



@article

	{DBLP:journals/corr/abs-2311-06233,
  author       = {Shahriar Golchin and
                  Mihai Surdeanu},
  title        = {Data Contamination Quiz: {A} Tool to Detect and Estimate Contamination
                  in Large Language Models},
  journal      = {CoRR},
  volume       = {abs/2311.06233},
  year         = {2023},
  url          = {https://doi.org/10.48550/arXiv.2311.06233},
  doi          = {10.48550/ARXIV.2311.06233},
  eprinttype    = {arXiv},
  eprint       = {2311.06233},
  timestamp    = {Wed, 15 Nov 2023 16:23:10 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2311-06233.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Important! If you wish to be listed as an author in the final report, please complete this information for all the authors of this Pull Request.

shahriargolchin changed pull request title from Contamination results updated based on ``https://arxiv.org/abs/2311.06233`` to Contamination results based on ``https://arxiv.org/abs/2311.06233``
Workshop on Data Contamination org

Hi @shahriargolchin ,

Seems that some of the evidence overlaps with previous evidence already in the database. Particularly these:

  • imdb (line 449-450)
  • ag_news (line 452-453)
  • yelp_review_full (line 455-456)
  • rte (line 458-459)
  • wnli (line 461-462)
  • samsum (line 464-465)
  • xsum (line 467-468)

Could you please remove the outdated evidence? I think I added them from your paper "Time Travel in LLM", but there were no specific numbers iirc.

Best,
Oscar

OSainz changed pull request title from Contamination results based on ``https://arxiv.org/abs/2311.06233`` to Contamination results based on "Data Contamination Quiz"

Hi @OSainz ,

Thanks for your reply. To clarify, the results from the Time Travel paper represent contamination at the partition level in a binary manner, i.e., it shows whether a dataset partition (e.g., IMDB train set) is contaminated or not. However, the results from the Data Contamination Quiz are estimates of contamination levels. Given this, the nature of detection between these two is different. With that in mind, I was wondering if you would like to collect this information separately or replace the existing data with the new information obtained from the Data Contamination Quiz.

Thank you,
Shahriar

Workshop on Data Contamination org
โ€ข
edited Apr 25

Okay, we can have duplicates if they are computed by different methods (and/or reported by different sources).

(Comment edited because I changed my mind)

Workshop on Data Contamination org

Thanks again @shahriargolchin for your contribution. We are merging to main.

OSainz changed pull request status to merged

Perfect! Thank you, @OSainz

Sign up or log in to comment