Tokenizer adds space between sentence start and instruction start
#74
by
ldavid
- opened
Is there a way to reconcile the space that is added between the <s>
and [INST]
tokens? A simple example:
from transformers import AutoTokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
chat = [
{
"role": "user",
"content": "You are my Python programming assistant. Write a program that generates the first 10 fibonacci numbers in Python."
}
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False)
print(prompt)
print(tokenizer.decode(tokenizer(prompt, add_special_tokens=False).get("input_ids")))
This prints:
<s>[INST] You are my Python programming assistant. Write a program that generates the first 10 fibonacci numbers in Python. [/INST]
<s> [INST] You are my Python programming assistant. Write a program that generates the first 10 fibonacci numbers in Python. [/INST]
I'm trying to get only the completion of my prompt by subtracting my prompt from the final model output.