contaminated model!!! SHOULD BE FLAGGED
This model used dataset contaminated with test data in order to show high accuracy on benchmarks , using this model is just a merge and the people releasing it should be ashamed to exploit openllm leaderboard to make profit.
Do you have any Proof of that?
Hello, I am the developer of this model. Currently, the leaderboard is generally overfitted. It is inevitable because, unlike Kaggle, where there's private scoring followed by the end of the competition, here the scores are continuously open.
Even among my models, some received lower scores in internal data evaluations. mncai/agiin-13.6B-v0.1 > mncai/agiin-11.1B-v0.1 > mncai/mistral-7b-dpo-v6. However, on the leaderboard, mncai/mistral-7b-dpo-v6 has the highest score.
When choosing a model to use on the open LLM leaderboard, it would be best to evaluate with your own private dataset that is not publicly available.
Hi, can you tell us more information on what base model this is fine-tuned from?
I assume the base model is already fine-tuned using SFT as the dataset listed for training this model is only for DPO.
I just added detailed information and an overfitting warning to the model readme.
Thanks for the additional info.
Please note that the models (AIDC-ai-business/Marcoroni-7B-v3, viethq188/LeoScorpius-7B-Chat-DPO, GreenNode/GreenNodeLM-7B-v1olet) you used are in the process of being flagged for potential data contamination (see https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#657e6a221e3e9c41a4a8ae23 and https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/471), which by extension would result in this model also being flagged.
In my opinion, the term "overfitting" is not really appropriate and instead the term "data contamination" should be used instead.