Is data leakage of MMLU possible?

#12

by tomyoung - opened Mar 7, 2023

Mar 7, 2023

I noticed that the questions in the MMLU dataset come from online collection.

From MMLU paper： “The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online. These include practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination.”

Is it possible those questions were leaked to C4 (based on common crawl) ?

andychase

May 5, 2023

Here is a sample question:

"Of the following compounds, which is LEAST likely to behave as a Lewis acid"

I google searched that and found many results. There is a "contaminated links" file which theoretically model builders could use to exclude urls, but I checked one link, a hugging face one, and it wasn't in that file. In other words the even if they used the file it would not be enough.

I think its safe to assume the questions have been leaked.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment