dummy32000
Can you explain how you modified the files in the original repo in order to allow llama.cpp/convert.py
to export to gguf?
I noticed when loading the model in, the init log mentions an eos token of <dummy32000>
. After doing some research on the ChatML format, I figured your goal is to not worry about using the added tokens and rely in the model's ability to figure it out itself (which by spec is acceptable). It is mentioned that there could be some benefit to using these tokens though. I tried to make the most comprehensive changes I could to the original repo's configs, making sure the added tokens for <|im_start|>
and <|im_end|>
made it into the gguf's tokenizer. After some experimenting, I found that the <|im_end|>
was being generated and caught by llama.cpp correctly, but the greedy nature of the tokenizer made it such that the <|im_start|>
token was always ripped apart. I'm starting to see that this is more of an issue regarding llama.cpp/main not giving the user the control to inject a specific token into the prompt, but I know that more sophisticated API's allow us to do so using the GGUF format. I've also got to ask if it any of it really matters at the end of the day as I haven't seen good numbers on how much of a detriment it is to the model to not be using the added tokens anyway.
Never mind, I should've pulled the latest version of the Mistral repo. GGUF conversion seems to work nearly ootb now.