Spaces:
Running
on
CPU Upgrade
Failed evaluation for model
Hello,
It seems one of the last four models I uploaded failed, which seems odd since they should all perform about the same. If possible I would like to know the reason. I apologize in advance in case the failure was caused by something on my end. Here is the link to the request file:
https://huggingface.co/datasets/open-llm-leaderboard/requests/resolve/main/Pretergeek/OpenChat-3.5-0106_BlockExpansion-48Layers-End_eval_request_False_bfloat16_Original.json
Thank you.
Hi,
Thanks for the issue!
It failed with the error ValueError: Can't find a checkpoint index (.../models--Pretergeek--OpenChat-3.5-0106_BlockExpansion-48Layers-End/snapshots/04d4eecf8df97e96f965dbd6ea0534467e21ca46/model.safetensors.index.json) in Pretergeek/OpenChat-3.5-0106_BlockExpansion-48Layers-End.
There seems to be a problem in your model configuration. Did you follow the steps in the submit tab and made sure your model could be loaded with AutoModel?
Thank you. I uploaded 4 models that are nearly identical, only changing the number of layers. I had the other three loaded with AutoModel locally but admittedly not this last one because it is the largest and I lack the VRAM to loaded it in my home computer. I assumed that since the other three were loading fine, this one was going too. I truly apologize for that. I will check the model configuration and maybe test if I can load it in 8-Bit at least. I better set it to private for the moment to not cause any problems to anyone that might download it.
I was able save some memory and to load the model locally with AutoModel, it worked just as well as the others. Nonetheless, I reuploaded all of the files, just in case there was something wrong with the previous ones (doesn't seem to be the case, git didn't change the date on the files, so I am guessing there was no difference). So I decided to re-submit for evaluation, thinking that maybe I did something wrong during submission. That is when I got an error just below the "Submit Eval" button and remembered I got the same error when I uploaded it last time (but since it showed in the requests dataset as pending, I ignored and forgot about it).
Hi
@Pretergeek
,
As indicated in the FAQ, you should NOT resubmit a model which has already been submitted - please tell me which commit you want me to use, and I'll pass your request file to pending again
The latest commit, full hash is 1091b30480f4cc91f26cb1bd7579e527f490f8d2.
Thank you and I apologize for all the trouble.
Hey @clefourrier , I think I'm in the same boat. About a month ago, I uploaded models which had NaNs due to a pointer miscalculation. I fixed this and reran them, but probably it reused the corrupted models again. You can find the request files here. Would you also be able to rerun mine?
The correct commit hashes are:
model | hash |
---|---|
awnr/Mistral-7B-v0.1-signtensors-1-over-2 | 98d8ea1dedcbd1f0406d229e45f983a0673b01f4 |
awnr/Mistral-7B-v0.1-signtensors-7-over-16 | 084bbc5b3d021c08c00031dc2b9830d41cae068d |
awnr/Mistral-7B-v0.1-signtensors-3-over-8 | bb888e45945f39e6eb7d23f31ebbff2e38b6c4f2 |
awnr/Mistral-7B-v0.1-signtensors-5-over-16 | 5ea13b3d0723237889e1512bc70dae72f71884d1 |
awnr/Mistral-7B-v0.1-signtensors-1-over-4 | 0a90af3d9032740d4c23f0ddb405f65a2f48f0d4 |
Hi
@awnr
,
When I look at the request file, it would seem that they are correct and that your models finished recently (except one, which I edited and relaunched).
Please open your own issue next time, it's easier for follow up!