Ollama modelfile

#1
by pesonen - opened

Model seems to output [/INST] at the beginning of the response when GGUF file is loaded to ollama with minimum modelfile. Also output seem to be quite random occasionally. Would it be possible to have some pointers to instructions on how to create Ollama modelfiles for these smaller models?

Model seems to output [/INST] at the beginning of the response when GGUF file is loaded to ollama with minimum modelfile. Also output seem to be quite random occasionally. Would it be possible to have some pointers to instructions on how to create Ollama modelfiles for these smaller models?

I am facing this same issue. Also the prompt format would be useful to know.

I can't help with ollama, and for prompt format, these questions should probably go to the original model. However, from looking at the chat template, the prompt format should be llama 2 (which also explains that [/INST]).

@pesonen , did you manage to find out what was wrong? I am facing the same issue with valid llama2 template applied. I've tried quite many attempts with different variations to the formatting and none of them has been successful. [INST] or [/INST] is almost always present in output.

You should be able to use the model directly from this repo and it should have template correctly set based on the tokenizer.chat_template property
https://huggingface.co/docs/hub/ollama

But it might still be wrong somehow at least based on my fast testing:

image.png

Here is the Ollama based way of defining template (Created with o1-preview based on documentation from Ollama and our tokenizer chat template so might contain errors)
'''
{{- $bos_token := "" }}
{{- $eos_token := "
" }}
<>
{{- if .System }}
{{ .System }}
{{- else if and (gt (len .Messages) 0) (eq ((index .Messages 0).Role) "system") }}
{{ (index .Messages 0).Content }}
{{- else }}
Olet tekoälyavustaja. Vastaat aina mahdollisimman avuliaasti. Vastauksesi eivät saa sisältää mitään haitallista, epäeettistä, rasistista, seksististä, vaarallista tai laitonta sisältöä. Jos kysymyksessä ei ole mitään järkeä tai se ei ole asiasisällöltään johdonmukainen, selitä miksi sen sijaan, että vastaisit jotain väärin. Jos et tiedä vastausta kysymykseen, älä kerro väärää tietoa.
{{- end }}
<>
{{- range $index, $message := .Messages }}
{{- if and (eq $index 0) (eq $message.Role "system") }}
{{- /* Skip the system message already processed */ }}
{{- else }}
{{- if eq $message.Role "user" }}
{{- if and (eq $index 1) (eq ((index .Messages 0).Role) "system") }}
{{- $content := printf "<>\n%s\n<>\n\n%s" ((index .Messages 0).Content) $message.Content }}
{{ printf "%s [INST] %s [/INST]" $bos_token $content }}
{{- else }}
{{ printf "%s [INST] %s [/INST]" $bos_token $message.Content }}
{{- end }}
{{- else if eq $message.Role "assistant" }}
{{ printf " %s%s" $message.Content $eos_token }}
{{- else }}
{{ error "Conversation roles must alternate between 'user' and 'assistant'." }}
{{- end }}
{{- end }}
{{- end }}
'''

I tried also to run the model directly from repo and thought that ollama is missing the template for it, but as you said gguf should contain it already in tokenizer.

But thanks! I will check out your template in the evening 👍

@pesonen , did you manage to find out what was wrong? I am facing the same issue with valid llama2 template applied. I've tried quite many attempts with different variations to the formatting and none of them has been successful. [INST] or [/INST] is almost always present in output.

With new Ollama support for Huggingface GGUF files [INST] tags have disappeared but answers are not great otherwise. Model (original or quantized) is not usable for us.

Ollama show ... --modelfile

TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>

I might later on try to put more attention to these Ollama inference things but for now my focus goes to Ahma-7B-Instruct finetunes. Let me know if someone finds some solution. Our original repo shows how inference works with transformers and you can use tokenizer.apply_chat_template(messages, tokenize=False) to see how it generates data for inference

I stumbled on this problem as well. I'm trying to get Ahma-3B-Instruct running on my Raspberry Pi 5 8GB. That means I can't use HF transformers, because I don't have a real GPU and transformers would probably require more memory than I have and still be very slow. I know the model works with reasonable performance on a RPi, just like other similarly sized models like Phi-3-mini, but I can't get the chat interaction to work. The weird output with extra INST tags happens not just with Ollama, but also llamafile and llama.cpp. And that's not really surprising because both Ollama and llamafile are based on llama.cpp, which handles the inference of GGUF models.

While I haven't solved the problem, I've done a lot of research trying to understand why it occurs, so I'm reporting my findings in case they are useful to others. (It's also a great way to learn how these things work and how to debug them when they don't.)

General findings

The tokenizer_config.json for Ahma-3B-Instruct specifies a chat_template that is similar to the llama2 template, though there is at least one important difference: in the original llama2 template the system prompt is optional and may be omitted completely, while the Ahma prompt template includes a fallback system prompt ("Olet tekoälyavustaja. Vastaat aina...") which is used if the user doesn't provide a system prompt. Thus, the model seems to require a system prompt; my understanding is that without a system prompt, it doesn't "know" that it's expected to work as a chatbot.

The tokenizer_config.json also defines a few special tokens, among them [INST], [/INST], <<SYS>> and <</SYS>> which are used in the chat template to delimit instructions and the system prompt.

I used this HF transformers code to check how the prompt should be tokenized, so that I have something to compare against:

Tokenization code

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Finnish-NLP/Ahma-3B-Instruct")

messages = [
    { "role": "system", "content": "Olet tekoälyavustaja."},
    { "role": "user", "content": "Kerro vitsi." },
]
inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)

for tid in inputs[0]:
    print(f"{tid}\t{repr(tokenizer.decode(tid))}")

Tokenization output

1	'<s>'
64157	''
3	'[INST]'
64157	''
5	'<<SYS>>'
17	'\n'
19709	'Olet'
24080	'tekoäl'
6100	'ya'
40713	'vustaja'
64176	'.'
17	'\n'
6	'<</SYS>>'
17	'\n'
17	'\n'
62707	'Kerro'
20433	'vitsi'
64176	'.'
64157	''
4	'[/INST]'

Note: The special tokens mentioned above (e.g. [INST]) are handled as single tokens. This is important, as we will see later!

Ollama prompt template

I tested how the prompt template and tokenization work on Ollama by running OLLAMA_DEBUG=1 ollama serve which enables additional debugging, then tried running the model like this:

$ bin/ollama run hf.co/mradermacher/Ahma-3B-Instruct-GGUF:Q4_K_M
>>> /set system Olet tekoälyavustaja.
Set system message.
>>> Kerro vitsi.

The model thinks for a while and then responds with empty output. Looking at the debug output I see this:

time=2024-12-01T21:11:45.900+02:00 level=DEBUG source=routes.go:1466 msg="chat request" images=0 prompt="<|im_start|>system\nOlet tekoälyavustaja.<|im_end|>\n<|im_start|>user\nKerro vitsi.<|im_end|>\n<|im_start|>assistant\n"

It looks like Ollama is using the wrong chat template! This looks like ChatML, not at all like the llama2 style that Ahma-3B-Instruct expects.

To fix that, I created a new file ahma_llama2.Modelfile with the following content. I copied the template from the llama2 model for Ollama. (This is a really simplistic single-turn template, but enough for this purpose.)

FROM hf.co/mradermacher/Ahma-3B-Instruct-GGUF:Q4_K_M

TEMPLATE "[INST] <<SYS>>{{ .System }}<</SYS>>

{{ .Prompt }} [/INST]
"

Then I added this customized model using ollama create ahma_llama2 --file ahma_llama2.Modelfile and ran it using ollama run ahma_llama2. Now it works slightly better, but still produces a lot of [INST] and [/INST] tags. I think there's another problem, but it's hard to debug using ollama (I don't know how to make it show token IDs), so I switched to plain llama.cpp instead, which ollama uses for running GGUF models.

Llama.cpp tokenization of system prompt

As noted above, the Ahma model expects the prompt to always start with something like <s>[INST] <<SYS>>...<</SYS>>.... And in this sequence, [INST], <<SYS>> and <</SYS>> are special tokens that should be represented by a single token id. I tested how llama.cpp tokenizes the prompt by starting it up in conversation (chat) mode and also including -v for additional debugging: ./llama-cli -m ~/Ahma-3B-Instruct.Q4_K_M.gguf -p "Olet tekoälyavustaja." -cnv -v

This is the end of the output:

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

embd_inp.size(): 28, n_consumed: 0
 [INST] <<SYS>>
Olet tekoälyavustaja.
<</SYS>>

eval: [ '<s>':1, ' [':2022, 'IN':1892, 'ST':2094, ']':64241, ' ':64157, '<':67, '<':67, 'S':64182, 'YS':8611, '>':69, '>':69, '':17, 'Olet':19709, ' tekoly':26831, 'avus':51008, 'taja':1675, '.':64176, '':17, '<':67, '<':67, '/':64211, 'S':64182, 'YS':8611, '>':69, '>':69, '':17, '':17 ]
n_past = 28
embd_inp.size(): 28, n_consumed: 28
waiting for user input

It seems to be using the correct chat template! However, the tokenization looks very different from what HF transformers did above. The [INST] and <<SYS>> tags are represented as multiple tokens. So although to a human they may look the same, to the model they are completely different things. Because it doesn't receive the special tokens, the model never properly enters the chat/instruct "mindset" (pardon the anthropomorphism) and thus responds in a confused way. It has just seen some mysterious [INST] and [/INST] tags, so it thinks those are normal things, and happily generates some more of them! At least that's my interpretation.

Now the question is: why does the llama.cpp tokenizer handle this input differently than the original tokenizer? I don't have an answer yet, but while researching, I found interesting differences between llama2 (which appears not to have special tokens for e.g. [INST] and <<SYS>>) and for example Phi-3-mini, which does have similar special tokens (e.g. <|system|>), and llama.cpp is able to tokenize them properly.

Llama.cpp has it's own tokenizer implementation, and tries to map the hf tokenizer to its own system at gguf conversion time. Depending on how the model tokenizer differs from what llama.cpp expects, it is quite likely that llama.cpp doesn't get it right, and the result is a different "valid" tokenization that confuses the model, exactly as you describe.

arguably, llama.cpp is always wrong if it differs from what hf does, so this would be a bug or missing support in llama.cpp. it can be as simple as transformers writing out multiple vocabularies, and picks the right one when it runs itself and llama.cpp bpicks the wrong one, which might be broken.

you could try reporting it to llama.cpp upstream, and you might get helpful insights and what goes wrong and how things might have to be fixed, but you would have to be prepared to fix it yourself some, as the llama.cpp devs are severely overloaded.

if anything comes up, I'd be happy to requantize, of course.

Thanks a lot @mradermacher for confirming my suspicions and suggesting possible ways to investigate.

I don't want to bother the llama.cpp team, at least not yet. I'll try to understand the problem a bit better. Perhaps I'll try quantizing this model myself - I've done that for some other models that I've fine-tuned.

It seems to me that this interaction between tokenizer configurations (inc. special tokens), prompt templates, different tokenizer implementations etc. is a very fiddly business with lots of potential for things to go wrong. I've had similar issues when fine-tuning models; models becoming confused and going off the rails because some special token is not where they expect it, or being unable to stop because they don't know how to use end tokens. Nowadays I just try to pick a base model that supports ChatML because everything else seems to be hit-and-miss. I hope that in the near future these things will become consolidated so there are less variations to consider for a project like llama.cpp that has to support all the "innovation" that happens in the HF ecosystem.

It is a very fiddly business. And I frankly only understand some of it. One problem is that llama.cpp doesn't want a dependency on transformers, because it would otherwise have to pull in python and the whole of transformers etc.

That causes issues, but is a valid trade-off - one of the main strengths of llama.cpp is that it is self-contained.

Some other decisions (not applicable in this case) make less sense, such as implementing their own pretokenizers because they didn't want to pull in a regex library on windows. But you gotta respect their choices, because they do all the hard work :)

And lastly, it could potentially also be a problem wiht the finetuning, e.g. unintended changes to the vocabulary, which might affect llama.cpp more than transformers.

I don't think things become more consolidated, but I'll be happy to be wrong :)

The fiddliness is there, e.g. when fine-tuning models, even without including llama.cpp in the picture.

I didn't intend to criticize llama.cpp devs, they are doing an awesome job in a very challenging ecosystem; I also respect their choices and the way they've managed to keep the software self-contained. And I think it's only healthy to have multiple independent implementations of tokenizers, inference engines etc. even if it causes short-term headaches.

I tried quantizing the model myself using latest llama.cpp and the convert_hf_to_gguf.py script, but got the same result as before.

But I did some more investigation, and I think I found the cause! The problem is that in the GGUF metadata, the tokens [INST], [/INST], <<SYS>> and <</SYS>> are not marked as type CONTROL, instead they are type NORMAL. I found an explanation here: https://github.com/ggerganov/llama.cpp/discussions/9379

As suggested in the above discussion, I loaded up one of the GGUF files (Q4_K_M) into the GGUF editor here: https://huggingface.co/spaces/CISCai/gguf-editor

Then I edited the above mentioned four tokens and changed their type to CONTROL. It looks like this:

kuva.png

Finally I downloaded the edited GGUF and ran it using this llama.cpp command: $ ./llama-cli -m ~/Downloads/Ahma-3B-Instruct.Q4_K_M.gguf -p "Olet tekoälyavustaja." -cnv -v

End of output now looks like this:

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

embd_inp.size(): 17, n_consumed: 0
   
Olet tekoälyavustaja.
 

eval: [ '<s>':1, '[INST]':3, ' ':64157, ' ':64157, '<<SYS>>':5, ' ':64157, '':17, 'Olet':19709, ' tekoly':26831, 'avus':51008, 'taja':1675, '.':64176, '':17, '<</SYS>>':6, ' ':64157, '':17, '':17 ]
n_past = 17
embd_inp.size(): 17, n_consumed: 17
waiting for user input

Now the special tokens are decoded as single tokens! And it's possible to chat with the model, it's no longer confused or spewing [INST] tags!
Here is an actual discussion asking it to tell some jokes (same command but without the -v flag):

> Kerro vitsi.
 Toki, tässä on vitsi:

Kysymys: Mitä lehmä tekee silmät kiinni?

Vastaa: Hän valmistaa silkkiä!

Toivottavasti tämä vastaa odotuksiasi. Kerro minulle, jos tarvitset lisää vitsin selityksiä tai jos on muita kysymyksiä.

> Kerro toinen vitsi.
 Toki, tässä on toinen vitsi:

Kysymys: Mitä lehmä tekee silmät kiinni?

Vastaa: Hän menee ulos kävelylle!

Tämä vitsi noudattaa samaa rakennetta kuin ensimmäinen, mutta lisää ylimääräisen aiheen (lehmä), joka on yleinen viittaus ensimmäiseen vitsiin. Se on kevyt ja hauska vitsi, joka saa sinut nauramaan. Kerro minulle, jos tarvitset lisää vitsejä tai jos on muita kysymyksiä.

The jokes are awful, but that is expected from a 3B model.

Now I'm not sure why the special tokens get assigned the wrong type. They do seem to have the "special": true attribute in tokenizer_config.json.

I also checked the convert_hf_to_gguf.py code. The way I read it, a token is marked as CONTROL if either one of the following is true:

  1. It is marked as "special": true in tokenizer_config.json
  2. does_token_look_special(token) returns True

In my understanding, 1 should be True for these tokens, while 2 is not (they don't match <| or similar patterns). Probably I'm missing something here. But hey, now I have a working GGUF model that I can chat with using llama.cpp!

convert_hf_to_gguf.py might use the tokenizer.model file, which unfortunately is in a rotten data format that is hard to process (ok, it's not rotten, just very inconvenient for this use case :). if the tokenizer.model disagrees with the tokenizer.json file that would strictly be a bug in transformers - but it wouldn't matter, because it only writes the files, not reads them.

(that analysis is just out of my ass btw., i didn't actually verify this, these are just my thoughts on possible causes. hopefully they are helpful)

There's no tokenizer.model file in the original HF model: https://huggingface.co/Finnish-NLP/Ahma-3B-Instruct/tree/main

I traced what happens in convert_hf_to_gguf.py for this model. The special token logic I mentioned above doesn't apply to this model architecture (LlamaForCausalLM). The tokenizer part of the conversion is handled by the LlamaHfVocab class in the gguf-py/gguf/vocab.py module. That code initializes the HF tokenizer and reads the special tokens from tokenizer.all_special_tokens. But for some as-yet-mysterious reason, this doesn't include the [INST], <<SYS>> etc. tokens. Here is some test code:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Finnish-NLP/Ahma-3B-Instruct")
print(tokenizer)
print(tokenizer.all_special_tokens)

Output:

PreTrainedTokenizerFast(name_or_path='Finnish-NLP/Ahma-3B-Instruct', vocab_size=64256, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<PAD>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    3: AddedToken("[INST]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    4: AddedToken("[/INST]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    5: AddedToken("<<SYS>>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    6: AddedToken("<</SYS>>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
    64256: AddedToken("<PAD>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
['<s>', '</s>', '<unk>', '<PAD>']

This looks confusing. When I print the tokenizer, it (or rather the added_tokens_decoder) seems to be aware of all the added/special tokens. But accessing .all_special_tokens only returns some of them. If the HF tokenizer doesn't (consistently) know about these special tokens, that explains why convert_hf_to_gguf.py can't convert them properly into CONTROL tokens.

Some possible explanations:

  1. There is something wrong with the tokenizer configuration for this model, and that causes the HF tokenizer to be confused about the special tokens.
  2. The HF tokenizer is buggy.
  3. The way convert_hf_to_gguf.py queries for the special tokens (using tokenizer.all_special_tokens) is wrong, some other method would work better.

Great work @osma and @mradermacher

This work truly helps us in understanding the small nitty gritty details that go into making these work with Ollama/llama.cpp for which we haven't had enough time.
One thing to add here is that to work the model work totally right you would need to always probably use our fallback system prompt as in our finetuning we have not used any other system prompts as the fallback or the default one and this will continue once I get the Ahma-7B-Instruct out. Shouldn't be that long anymore

Sign up or log in to comment