File size: 2,002 Bytes
8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 d3d9ba2 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 8d260b4 631c7b7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
---
library_name: transformers
tags: ["gemma","chatml"]
---
# ChatML Tokenizer for Gemma
This repository includes a fast tokenizer for [google/gemma-7b](https://huggingface.co/google/gemma-7b) with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id `106` (`<start_of_turn>`) and `107` (`<end_of_turn>`) with the chatML tokens `<|im_start|>` and `<|im_end|>`.
No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
messages = [
{"role": "system", "content": "You are Gemma."},
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
# <|im_start|>system
# You are Gemma.<|im_end|>
# <|im_start|>user
# Hello, how are you?<|im_end|>
# <|im_start|>assistant
# I'm doing great. How can I help you today?<|im_end|>
```
## Test
```python
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)
# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"
# tokenize messages
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]
chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}")
``` |