Is the BOS token id of 128000 hardcoded into the llama 3.2 tokenizer?

#17

by rasyosef - opened Oct 5

Oct 5

I trained the llama 3.2 tokenizer using an Amharic language corpus and a vocab size of 28k, but when I use it to tokenize text, the BOS token id is still 128000.

Here are the first few lines of the tokenizer_config.json file of the newly trained tokenizer.

{
  "added_tokens_decoder": {
    "0": {
      "content": "<|begin_of_text|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<|end_of_text|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },

And here's a tokenization of an example text. As can be seen, the first token id is 128000 when it should have been 0.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("rasyosef/llama-3.2-amharic-tokenizer-28k")

text = "ሁሉም ነገር"
inputs = tokenizer(text, return_tensors="pt")
print(inputs["input_ids"])

Output:

tensor([[128000,   1704,    802]])

John985623

16 days ago

The same problem

rasyosef

16 days ago

@John985623 Meta support is non-existent ☹️

The issue is with the tokenizer's postprocessor. 128000 was hardcoded in the TemplateProcessor.

print(tokenizer._tokenizer.post_processor)
Sequence(processors=[ByteLevel(add_prefix_space=True, trim_offsets=False, use_regex=True), TemplateProcessing(single=[SpecialToken(id="<|begin_of_text|>", type_id=0), Sequence(id=A, type_id=0)], pair=[SpecialToken(id="<|begin_of_text|>", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="<|begin_of_text|>", type_id=1), Sequence(id=B, type_id=1)], special_tokens={"<|begin_of_text|>":SpecialToken(id="<|begin_of_text|>", ids=[128000], tokens=["<|begin_of_text|>"])})])

But here's the fix. You have to edit your tokenizer's postprocessor.

# Edit postprocessor
from tokenizers.processors import Sequence, ByteLevel, TemplateProcessing

tokenizer._tokenizer.post_processor = Sequence( [ 
ByteLevel(add_prefix_space=True, trim_offsets=False, use_regex=True), 
TemplateProcessing(
    single="<|begin_of_text|> $0",
    pair="<|begin_of_text|> $A <|begin_of_text|> $B:1",
    special_tokens=[("<|begin_of_text|>", 0)],
)
])

Please refer to this issue for more info
https://github.com/huggingface/transformers/issues/33998#issuecomment-2396191976

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment