Tokenizer change has strange order plus format

#21
by Qubitium - opened

@abhi-db The new tokenizer is better than before but I noticed some strange way it is ordered/formatted:

  1. Why is 100256 not used as <|pad|> so that the extra 16 tokens are consecutive? It shouldn't matter which unused token is used as pad but having the extra tokens hop skip is strange for trainers that reuse extra tokens based on incremental +1 code of a offset.
  2. Why is there a strange deviation from special token format for the extra tokens? <||_unused_N_||> vs <|unused_N|>. Here extra care was used to prevent conflict but if this was an issue, than <|pad|> and <|endoftext|> would've gotten the same treatment?
  3. Why is there a new "100276": <|endofprompt|> token for base model? Was this ever used to train in base? This token did not exist in previous tokenizer.

Sorry about the token order, I know its a bit funky:

  1. The pad_token_id: 100277 rather than 100256 was an artifact of our training process for dbrx-instruct , where the new tokens <|pad|>, <|im_start|>, <|im_end|> were added explicitly at the end: 100277, 100278, 100279. I wanted to include a pad token to the base model for safety, but minimize any tokenizer differences between base vs. instruct, so that's why <|pad|> is still located at 100277. This also simplifies the translation from dbrx-base to dbrx-tokenizer, all you have to do is:
base_tokenizer.add_tokens(['<|im_start|>', '<|im_end|>'], special_tokens=True)
  1. I originally wanted to fill in the unused tokens with <|extra_N|> style, but I got guidance from our team and HF folks that because this <|...|> style is becoming more common, we might have an accidental collision with real text. And if we have a collision, the model will use a vocab embedding that has never been trained. So they suggested I use the <||_unused_N_||> format instead.

  2. This token <|endofprompt|> was not used in training but is from the tiktoken package (see links below) I tried to do my best to make our tokenizer exactly 1-1 with tiktoken so if you'd like to use any of the special tokens like <|fim_prefix| or <|endofprompt|>, they will have the exact same ids.

see here for tiktoken special tokens: https://github.com/openai/tiktoken/blob/1b9faf2779855124f05174adf1383e53689ed94b/tiktoken_ext/openai_public.py#L3-L7
and here for expected gaps in tiktoken: https://github.com/openai/tiktoken/issues/47

Hope this helps!

@ahi-db Cleared everything up. Thanks!  

I originally wanted to fill in the unused tokens with <|extra_N|> style, but I got guidance from our team and HF folks that because this <|...|> style is becoming more common, we might have an accidental collision with real text. And if we have a collision, the model will use a vocab embedding that has never been trained. So they suggested I use the <||_unused_N_||> format instead.

Chicken and egg and now snake enters the picture. The next webcrawl will index this and we are back to square one. <||_unused_N_||> is no longer safe. =) Interesting problem. How to add public code for special tokens that will not conflict with itself due to webcrawl/data collection of itself.

Sign up or log in to comment