metadata
license: other
InternLM2 tokenizer(llamaified version)
Official repo: https://github.com/InternLM/InternLM
Note
This repo converts the InternLM2 tokenizer to LlamaTokenizerFast.
It also replaces the 354 token \u0000
with an emoji so that it can be converted by llama.cpp
How to use
- Load
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(RangiLyu/InternLM2-tokenizer-llama)
- Apply chatml template
chat = [{"role": "user", "content": "Hello! What's your name?"},
{"role": "assistant", "content": "My name is InternLM2!"},
{"role": "user", "content": "Nice to meet you InternLM2!"},]
chat_ids = llama_tokenizer.apply_chat_template(chat)
print("ids: ", chat_ids)
print("tokens: ", llama_tokenizer.convert_ids_to_tokens(chat_ids))
# convert the chat history to a string for generation
chat_str = llama_tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
print("chat string: ", chat_str)
ids: [1, 92543, 1008, 364, 9843, 346, 3716, 725, 829, 963, 345, 92542, 364, 92543, 525, 11353, 364, 5211, 963, 505, 4576, 11146, 314, 346, 92542, 364, 92543, 1008, 364, 44501, 442, 3531, 629, 4576, 11146, 314, 346, 92542, 364]
tokens: ['<s>', '<|im_start|>', 'user', '\n', 'Hello', '!', '▁What', "'s", '▁your', '▁name', '?', '<|im_end|>', '\n', '<|im_start|>', 'ass', 'istant', '\n', 'My', '▁name', '▁is', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n', '<|im_start|>', 'user', '\n', 'Nice', '▁to', '▁meet', '▁you', '▁Intern', 'LM', '2', '!', '<|im_end|>', '\n']
chat string: <s><|im_start|>user
Hello! What's your name?<|im_end|>
<|im_start|>assistant
My name is InternLM2!<|im_end|>
<|im_start|>user
Nice to meet you InternLM2!<|im_end|>
<|im_start|>assistant