Spaces:
Running
on
CPU Upgrade
i have proof that show the evals shouldn't be trusted
Lol..Also ehartford/WizardLM-30B-Uncensored has about 60 points and WizardLM/WizardLM-30B-V1.0 has only 30. I know the two models are different but seems not quite possible to have such huge performance difference.
Several weeks ago the HF team did said they were rewriting code to fix previous eval errors or something.
Hi
@zmcmcc
and
@breadlicker45
!
We have re-run all models to use the new fixed MMLU evals from the Harness, and are currently re-running some Llama scores.
Did you know that you can actually reproduce our results by launching the commands on the About section? If there are models you feel unsure about, feel free to double check them by re-running evals and giving us your results!
Hi @zmcmcc and @breadlicker45 !
We have re-run all models to use the new fixed MMLU evals from the Harness, and are currently re-running some Llama scores.Did you know that you can actually reproduce our results by launching the commands on the About section? If there are models you feel unsure about, feel free to double check them by re-running evals and giving us your results!
Sure run musepy and you will see just by using your brain it is wrong, and use your brain not evals and you can see it is wrong
Lol..Also ehartford/WizardLM-30B-Uncensored has about 60 points and WizardLM/WizardLM-30B-V1.0 has only 30. I know the two models are different but seems not quite possible to have such huge performance difference.
Several weeks ago the HF team did said they were rewriting code to fix previous eval errors or something.
The backend is https://github.com/EleutherAI/lm-evaluation-harness Confirmed by Stella
Hi!
Just checked your model's results (BreadAI/PM_model_V2) more in depth, and you actually get random scores for all evals (around 25% = random baseline) except for TruthfulQA, which has a slightly unbalanced answer distribution - I suspect your random generator got lucky here!
A good way of pointing out that evals, especially for scores so low, do not tell the whole story :)