InternLM2 tokenizer(llamaified version)

Official repo: https://github.com/InternLM/InternLM

Note

This repo converts the InternLM2 tokenizer to LlamaTokenizerFast.

It also replaces the 354 token \u0000 with an emoji so that it can be converted by llama.cpp

How to use

  • Load
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(RangiLyu/InternLM2-tokenizer-llama)
  • Apply chatml template
chat = [{"role": "user", "content": "Hello! What's your name?"},
        {"role": "assistant", "content": "My name is InternLM2!"},
        {"role": "user", "content": "Nice to meet you InternLM2!"},]

chat_ids = llama_tokenizer.apply_chat_template(chat)
print("ids: ", chat_ids)
print("tokens: ", llama_tokenizer.convert_ids_to_tokens(chat_ids))

# convert the chat history to a string for generation
chat_str = llama_tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
print("chat string: ", chat_str)
ids:  [1, 92543, 1008, 364, 9843, 346, 3716, 725, 829, 963, 345, 92542, 364, 92543, 525, 11353, 364, 5211, 963, 505, 4576, 11146, 314, 346, 92542, 364, 92543, 1008, 364, 44501, 442, 3531, 629, 4576, 11146, 314, 346, 92542, 364]
tokens:  ['<s>', '<|im_start|>', 'user', '\n', 'Hello', '!', '▁What', "'s", '▁your', '▁name', '?', '<|im_end|>', '\n', '<|im_start|>', 'ass', 'istant', '\n', 'My', '▁name', '▁is', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n', '<|im_start|>', 'user', '\n', 'Nice', '▁to', '▁meet', '▁you', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n']
chat string:  <s><|im_start|>user
Hello! What's your name?<|im_end|>
<|im_start|>assistant
My name is InternLM2!<|im_end|>
<|im_start|>user
Nice to meet you InternLM2!<|im_end|>
<|im_start|>assistant
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .