Failed evaluation for model

#865
by Pretergeek - opened

Hello,

It seems one of the last four models I uploaded failed, which seems odd since they should all perform about the same. If possible I would like to know the reason. I apologize in advance in case the failure was caused by something on my end. Here is the link to the request file:
https://huggingface.co/datasets/open-llm-leaderboard/requests/resolve/main/Pretergeek/OpenChat-3.5-0106_BlockExpansion-48Layers-End_eval_request_False_bfloat16_Original.json

Thank you.

Open LLM Leaderboard org

Hi,

Thanks for the issue!
It failed with the error ValueError: Can't find a checkpoint index (.../models--Pretergeek--OpenChat-3.5-0106_BlockExpansion-48Layers-End/snapshots/04d4eecf8df97e96f965dbd6ea0534467e21ca46/model.safetensors.index.json) in Pretergeek/OpenChat-3.5-0106_BlockExpansion-48Layers-End.

There seems to be a problem in your model configuration. Did you follow the steps in the submit tab and made sure your model could be loaded with AutoModel?

Thank you. I uploaded 4 models that are nearly identical, only changing the number of layers. I had the other three loaded with AutoModel locally but admittedly not this last one because it is the largest and I lack the VRAM to loaded it in my home computer. I assumed that since the other three were loading fine, this one was going too. I truly apologize for that. I will check the model configuration and maybe test if I can load it in 8-Bit at least. I better set it to private for the moment to not cause any problems to anyone that might download it.

Pretergeek changed discussion status to closed

I was able save some memory and to load the model locally with AutoModel, it worked just as well as the others. Nonetheless, I reuploaded all of the files, just in case there was something wrong with the previous ones (doesn't seem to be the case, git didn't change the date on the files, so I am guessing there was no difference). So I decided to re-submit for evaluation, thinking that maybe I did something wrong during submission. That is when I got an error just below the "Submit Eval" button and remembered I got the same error when I uploaded it last time (but since it showed in the requests dataset as pending, I ignored and forgot about it).

Open LLM Leaderboard 2 - a Hugging Face Space by open-llm-leaderboard — Mozilla Firefox 31_07_2024 09_17_34.png

Pretergeek changed discussion status to open
Open LLM Leaderboard org

Hi @Pretergeek ,
As indicated in the FAQ, you should NOT resubmit a model which has already been submitted - please tell me which commit you want me to use, and I'll pass your request file to pending again

The latest commit, full hash is 1091b30480f4cc91f26cb1bd7579e527f490f8d2.

Thank you and I apologize for all the trouble.

Hey @clefourrier , I think I'm in the same boat. About a month ago, I uploaded models which had NaNs due to a pointer miscalculation. I fixed this and reran them, but probably it reused the corrupted models again. You can find the request files here. Would you also be able to rerun mine?

The correct commit hashes are:

model hash
awnr/Mistral-7B-v0.1-signtensors-1-over-2 98d8ea1dedcbd1f0406d229e45f983a0673b01f4
awnr/Mistral-7B-v0.1-signtensors-7-over-16 084bbc5b3d021c08c00031dc2b9830d41cae068d
awnr/Mistral-7B-v0.1-signtensors-3-over-8 bb888e45945f39e6eb7d23f31ebbff2e38b6c4f2
awnr/Mistral-7B-v0.1-signtensors-5-over-16 5ea13b3d0723237889e1512bc70dae72f71884d1
awnr/Mistral-7B-v0.1-signtensors-1-over-4 0a90af3d9032740d4c23f0ddb405f65a2f48f0d4
Open LLM Leaderboard org

Hi @awnr ,
When I look at the request file, it would seem that they are correct and that your models finished recently (except one, which I edited and relaunched).
Please open your own issue next time, it's easier for follow up!

clefourrier changed discussion status to closed

Sign up or log in to comment