llm_contamination_detector / src /text_content.py
Yeyito's picture
Added How does this work? To the about
e87b0e5
raw
history blame
2.68 kB
ABOUT_TEXT = """# Background
Model contamination is an obstacle that many model creators face and has become a growing issue amongst the top scorers in [🤗 Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). This work is an implementation of the [Detecting Pretraining Data from Large Language Models](https://huggingface.co/papers/2310.16789) following the template provided by [this github repo](https://github.com/swj0419/detect-pretrain-code-contamination/tree/master). I'm aware the Hugginface Team is working on their own implementation of this working directly with the authors of the paper mentioned above. Until that's ready I hope this serves as a metric for evaluating model contamination in open source llms.
# How does this work?
If you train on benchmark data it leaves a mark on the probability distribution over the tokens a model predicts when shown the same sample.
We can compare this distribution to a 'ground truth', or reference model, and obtain a percentage that we can interpret as how likely it is that the model has 'seen' the data before.
according to the authors: "The output of the script provides a metric for dataset contamination. If #the result < 0.1# with a percentage greater than 0.85, it is highly likely that the dataset has been trained.".
The higher the score on a given dataset, the higher the likelihood the dataset has been trained on. At the moment, I wouldn't jump to any conclusions based on the scores obtained, as this is still very new. I'd only be wary of models that score over 0.95 on any given benchmark.
# Disclaimer
This space should NOT be used to flag or accuse models of cheating / being contamined. Instead, it should form part of a holistic assesment by the parties involved. The main goal of this space is to provide more transparency as to what the contents of the datasets used to train models are take whatever is shown in the evaluation's tab as a grain of salt and draw your own conclusions from the data.
As a final note, I've outlined my main concerns with this implementation in a pinned discussion under the community tab. Any type of help would be greatly appreciated :)"""
SUBMISSION_TEXT = """
<h1 align="center">
Submitting models for evaluation.
</h1>
This space is still highly experimental, try to not submit any GGUF, GPTQ or AWQ models, but their raw fp16/bf16 versions. Also try to not submit any model above 13B parameters in size as this space is not equipped with the hardware to handle that.
"""
SUBMISSION_TEXT_2 = """
If you encounter any issues while submitting please make a community post about it, or message my discord directly: yeyito777!
"""