Unable to Open tokenizer.model File
When attempting to run the Llama-3.1-70B model by using fairseq2, an error occurs related to the tokenizer model file. Specifically, the error message is:
/Llama-3.1-70B-Instruct/original/tokenizer.model' cannot be opened.
third-party/sentencepiece/src/sentencepiece_processor.cc(1101)
[model_proto->ParseFromArray(serialized.data(), serialized.size())]
Steps Taken:
• Verified the path to the file.
• Checked file permissions.
• Confirmed file size.
• Despite these checks, the error persists.
I noticed that the tokenizer.model file is in ASCII text format, which may not be compatible with the model. It is possible that the tokenizer file is expected to be a binary format, which might be causing the error.
same problem here.Did u find a solution ?
I haven't found a solution yet
use src/transformers/models/llama/convert_llama_weights_to_hf.py from the huggigngface github repository to convert the checkpoint to hf so you can proceed
see https://github.com/abetlen/llama-cpp-python/discussions/696#discussioncomment-8085453
@Bechir24
Thank you for your help. I downloaded the model files from Hugging Face, and my understanding is that the safetensors files, along with config.json and other files, are designed for libraries like Transformers, whereas Fairseq2 uses the pth files and tokenizer.model from the original directory (as indicated in the Fairseq2 documentation’s asset files, reference: https://facebookresearch.github.io/fairseq2/nightly/tutorials/end_to_end_fine_tuning.html#model). You suggested using the command to convert from .pth to Hugging Face format:
python -m transformers.models.llama.convert_llama_weights_to_hf --model_size 7B --input_dir llama-2-7b-chat/ --output_dir llama-2-7b-chat-hf/
However, the issue seems to be related to the tokenizer.model.
I have tried the command. It simply converted the original folder of meta-llama/Llama-3.1-70B-Instruct into the external safetensors and other files, which doesn’t seem related to the issue I’m facing. Nevertheless, thank you!