Is the BOS token id of 128000 hardcoded into the llama 3.2 tokenizer?
#17
by
rasyosef
- opened
I trained the llama 3.2 tokenizer using an Amharic language corpus and a vocab size of 28k
, but when I use it to tokenize text, the BOS token id is still 128000
.
Here are the first few lines of the tokenizer_config.json
file of the newly trained tokenizer.
{
"added_tokens_decoder": {
"0": {
"content": "<|begin_of_text|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|end_of_text|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
And here's a tokenization of an example text. As can be seen, the first token id is 128000
when it should have been 0
.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("rasyosef/llama-3.2-amharic-tokenizer-28k")
text = "ሁሉም ነገር"
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"])
Output:
tensor([[128000, 1704, 802]])
The same problem
@John985623 Meta support is non-existent ☹️
The issue is with the tokenizer's postprocessor. 128000
was hardcoded in the TemplateProcessor
.
print(tokenizer._tokenizer.post_processor)
Sequence(processors=[ByteLevel(add_prefix_space=True, trim_offsets=False, use_regex=True), TemplateProcessing(single=[SpecialToken(id="<|begin_of_text|>", type_id=0), Sequence(id=A, type_id=0)], pair=[SpecialToken(id="<|begin_of_text|>", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="<|begin_of_text|>", type_id=1), Sequence(id=B, type_id=1)], special_tokens={"<|begin_of_text|>":SpecialToken(id="<|begin_of_text|>", ids=[128000], tokens=["<|begin_of_text|>"])})])
But here's the fix. You have to edit your tokenizer's postprocessor.
# Edit postprocessor
from tokenizers.processors import Sequence, ByteLevel, TemplateProcessing
tokenizer._tokenizer.post_processor = Sequence( [
ByteLevel(add_prefix_space=True, trim_offsets=False, use_regex=True),
TemplateProcessing(
single="<|begin_of_text|> $0",
pair="<|begin_of_text|> $A <|begin_of_text|> $B:1",
special_tokens=[("<|begin_of_text|>", 0)],
)
])
Please refer to this issue for more info
https://github.com/huggingface/transformers/issues/33998#issuecomment-2396191976