File size: 2,002 Bytes
8d260b4
 
631c7b7
8d260b4
 
631c7b7
8d260b4
631c7b7
8d260b4
d3d9ba2
8d260b4
631c7b7
 
8d260b4
631c7b7
8d260b4
631c7b7
 
 
 
 
8d260b4
631c7b7
8d260b4
631c7b7
 
 
 
 
 
8d260b4
631c7b7
8d260b4
 
631c7b7
8d260b4
631c7b7
 
 
8d260b4
631c7b7
 
 
8d260b4
631c7b7
 
8d260b4
631c7b7
 
 
 
 
8d260b4
631c7b7
 
8d260b4
631c7b7
8d260b4
631c7b7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
library_name: transformers
tags: ["gemma","chatml"]
---

# ChatML Tokenizer for Gemma

This repository includes a fast tokenizer for [google/gemma-7b](https://huggingface.co/google/gemma-7b) with the ChatML format. The Tokenizer was created by replacing the string values of original tokens with id `106` (`<start_of_turn>`) and `107` (`<end_of_turn>`) with the chatML tokens `<|im_start|>` and `<|im_end|>`. 

No new tokens were added during that process to ensure that the original model's embedding doesn't need to be modified. 

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")

messages = [
  {"role": "system", "content": "You are Gemma."},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)

# <|im_start|>system
# You are Gemma.<|im_end|>
# <|im_start|>user
# Hello, how are you?<|im_end|>
# <|im_start|>assistant
# I'm doing great. How can I help you today?<|im_end|>

```


## Test

```python
tokenizer = AutoTokenizer.from_pretrained("philschmid/gemma-tokenizer-chatml")
original_tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")

# get special tokens
print(tokenizer.special_tokens_map)
print(original_tokenizer.special_tokens_map)

# check length of vocab
assert len(tokenizer) == len(original_tokenizer), "tokenizer are not having the same length"

# tokenize messages 
messages = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
]

chatml = tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)
google_format = original_tokenizer.apply_chat_template(messages, add_generation_prompt=False, tokenize=False)

print(f"ChatML: \n{chatml}\n-------------------\nGoogle: \n{google_format}")

```