unknown pre-tokenizer type: 'smaug-bpe'

#1
by kep359 - opened

Love your models! Having an issue with the Llama3 70B instruct Abliterated. v3.5-GGUFs that I didn't have with the v3 variants. I'm using oogabooga.

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'smaug-bpe'

Any thoughts on what the issue may be?

I might have answered my own question. It seems Smaug support was added 3 days ago to llama.ccp in commit https://github.com/ggerganov/llama.cpp/releases/tag/b3001

llama-cpp-python is just a few days behind. Guess I'll wait for the oobabooga update.

But I guess it shouldn't be smaug-bpe for plain llama 3.

Owner

How on earth did it get smaug bpe?

How on earth did it get smaug bpe?

llama.cpp/gguf-py/scripts$ python gguf-dump.py Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf --json
"tokenizer.ggml.pre": {
"index": 17,
"offset": 619,
"type": "STRING",
"value": "smaug-bpe"
},

So I fixed it in a super hacky way:

in /gguf-py/scripts:

python gguf-new-metadata.py --remove-metadata tokenizer.ggml.pre Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5-ptf.gguf
 
nano gguf-new-metadata.py

add in in between one of the if statements after all the args lists, white space lined up with the if below and above it:

new_metadata[gguf.Keys.Tokenizer.PRE] = MetadataDetails(gguf.GGUFValueType.STRING, "llama-bpe")

then save it and delete the original model and run:

  python gguf-new-metadata.py --verbose Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5-ptf.gguf Meta-Llama-3-70B-Instruct-abliterated-v3.5_q5.gguf

@failspy Before I try my hand at that hacky fix, do you have any plans to upload fixed GGUFs? "v4.0"

Owner

@Jobaar That's a wonderful hack! I'll do that and re-upload and credit you for the fix. Saves me having to requant everything

@failspy are you sure this is the right model, not smaug? As I can see convert script checks hashes of tokenized string: https://github.com/ggerganov/llama.cpp/blob/975ec63ff26cdf96156d1126d86f75a395fdc43a/convert-hf-to-gguf.py#L476 so the only way I see it could be smaug-bpe is that the model was indeed smaug :)

The difference between those 2 tokenizers is that original llama has "ignore_merges": true and smaug has "ignore_merges": false. In your model there is no such config at all: https://huggingface.co/failspy/Meta-Llama-3-70B-Instruct-abliterated-v3.5/raw/main/tokenizer.json so it probably defaults to false and that's why convert script recognizes it as smaug-bpe. But it's defined in the previous version: https://huggingface.co/failspy/Smaug-Llama-3-70B-Instruct-abliterated-v3/raw/main/tokenizer.json So looks like something is wrong with your safetensors model.

Owner

Goes to show me for trying to fix the tokenizer manually. Dammit Meta-Llama for not giving me access to the original repo. This is based on the correct model. Thanks for doing the detailed investigation @kurnevsky

Goes to show me for trying to fix the tokenizer manually. Dammit Meta-Llama for not giving me access to the original repo. This is based on the correct model. Thanks for doing the detailed investigation @kurnevsky

Sorry but I think I got confused somewhere in the middle -- can you tell me if I should redownload this or requantize it or is the hack good enough? Thanks again for the wonderful work.

Owner

The hack is good enough!

They added a new arg to fix tokenizers: https://github.com/ggerganov/llama.cpp/pull/7627

@failspy will you fix the safetensors model as well? Because it will have wrong tokenization as well with those missing tokenizer configs.

Owner

Okay, well I finally managed to get my hands on the fixed tokenizer_config.json as it appears in the meta-llama repo. I've published it to the safetensors repos, and fixed Llama-3-8B-Instruct-abliterated-v3s

Meta-Llama-3-70B-Instruct-abliterated-v3.5-GGUF is presently being uploaded. Sorry about this y'all. Thanks for your patience.

But the problem is not with tokenizer_config.json, it's with tokenizer.json.

Owner

@kurnevsky You're right. By fixed tokenizer_config.json, I mean the config that fixed the original EOS token issues that many faced with Llama-3.
I've uploaded an updated tokenizer.json to address the ignore_merges issue

failspy changed discussion status to closed

Sign up or log in to comment