Text2Text Generation
Transformers
PyTorch
5 languages
t5
flan-ul2
Inference Endpoints
text-generation-inference

Is data leakage of MMLU possible?

#12
by tomyoung - opened

I noticed that the questions in the MMLU dataset come from online collection.

From MMLU paper: “The questions in the dataset were manually collected by graduate and undergraduate students from freely available sources online. These include practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination.”

49f554797957e768e7c331570da4c69.png

Is it possible those questions were leaked to C4 (based on common crawl) ?

Here is a sample question:

"Of the following compounds, which is LEAST likely to behave as a Lewis acid"

I google searched that and found many results. There is a "contaminated links" file which theoretically model builders could use to exclude urls, but I checked one link, a hugging face one, and it wasn't in that file. In other words the even if they used the file it would not be enough.

I think its safe to assume the questions have been leaked.

Sign up or log in to comment