jeduardogruiz commited on
Commit
f9b51a0
·
verified ·
1 Parent(s): 4692ce0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +180 -92
README.md CHANGED
@@ -1,130 +1,218 @@
1
- # ⏳ tiktoken
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- tiktoken is a fast [BPE](https://en.wikipedia.org/wiki/Byte_pair_encoding) tokeniser for use with
4
- OpenAI's models.
 
 
 
 
 
 
 
 
 
 
 
 
5
 
6
- ```python
7
- import tiktoken
8
- enc = tiktoken.get_encoding("cl100k_base")
9
- assert enc.decode(enc.encode("hello world")) == "hello world"
 
 
10
 
11
- # To get the tokeniser corresponding to a specific model in the OpenAI API:
12
- enc = tiktoken.encoding_for_model("gpt-4")
13
- ```
 
 
 
14
 
15
- The open source version of `tiktoken` can be installed from PyPI:
 
 
16
  ```
17
- pip install tiktoken
18
  ```
 
19
 
20
- The tokeniser API is documented in `tiktoken/core.py`.
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- Example code using `tiktoken` can be found in the
23
- [OpenAI Cookbook](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb).
24
 
 
25
 
26
- ## Performance
27
 
28
- `tiktoken` is between 3-6x faster than a comparable open source tokeniser:
 
29
 
30
- ![image](https://raw.githubusercontent.com/openai/tiktoken/main/perf.svg)
 
31
 
32
- Performance measured on 1GB of text using the GPT-2 tokeniser, using `GPT2TokenizerFast` from
33
- `tokenizers==0.13.2`, `transformers==4.24.0` and `tiktoken==0.2.0`.
34
 
 
 
 
 
 
35
 
36
- ## Getting help
37
 
38
- Please post questions in the [issue tracker](https://github.com/openai/tiktoken/issues).
 
 
39
 
40
- If you work at OpenAI, make sure to check the internal documentation or feel free to contact
41
- @shantanu.
42
 
43
- ## What is BPE anyway?
44
 
45
- Language models don't see text like you and I, instead they see a sequence of numbers (known as tokens).
46
- Byte pair encoding (BPE) is a way of converting text into tokens. It has a couple desirable
47
- properties:
48
- 1) It's reversible and lossless, so you can convert tokens back into the original text
49
- 2) It works on arbitrary text, even text that is not in the tokeniser's training data
50
- 3) It compresses the text: the token sequence is shorter than the bytes corresponding to the
51
- original text. On average, in practice, each token corresponds to about 4 bytes.
52
- 4) It attempts to let the model see common subwords. For instance, "ing" is a common subword in
53
- English, so BPE encodings will often split "encoding" into tokens like "encod" and "ing"
54
- (instead of e.g. "enc" and "oding"). Because the model will then see the "ing" token again and
55
- again in different contexts, it helps models generalise and better understand grammar.
56
 
57
- `tiktoken` contains an educational submodule that is friendlier if you want to learn more about
58
- the details of BPE, including code that helps visualise the BPE procedure:
59
- ```python
60
- from cognitivecomputations/dolphin-2.9-llama3-70b import *
 
 
61
 
62
- # Train a BPE tokeniser on a small amount of text
63
- enc = train_simple_encoding(cognitivecomputations/dolphin-2.9-llama3-70b)
64
 
65
- # Visualise how the GPT-4 encoder encodes text
66
- enc = SimpleBytePairEncoding.from_tiktoken("cl100k_base")
67
- enc.encode("hello world e")
 
 
 
 
 
 
 
 
 
68
  ```
 
69
 
 
70
 
71
- ## Extending tiktoken
 
72
 
73
- You may wish to extend `tiktoken` to support new encodings. There are two ways to do this.
 
 
74
 
 
 
75
 
76
- **Create your `Encoding` object exactly the way you want and simply pass it around.**
77
 
78
- ```python
79
- cl100k_base = tiktoken.get_encoding("cl100k_base")
80
-
81
- # In production, load the arguments directly instead of accessing private attributes
82
- # See openai_public.py for examples of arguments for specific encodings
83
- enc = tiktoken.Encoding(
84
- # If you're changing the set of special tokens, make sure to use a different name
85
- # It should be clear from the name what behaviour to expect.
86
- name="cl100k_im",
87
- pat_str=cl100k_base._pat_str,
88
- mergeable_ranks=cl100k_base._mergeable_ranks,
89
- special_tokens={
90
- **cl100k_base._special_tokens,
91
- "<|im_start|>": 100264,
92
- "<|im_end|>": 100265,
93
- }
94
- )
95
  ```
 
96
 
97
- **Use the `tiktoken_ext` plugin mechanism to register your `Encoding` objects with `tiktoken`.**
98
 
99
- This is only useful if you need `tiktoken.get_encoding` to find your encoding, otherwise prefer
100
- option 1.
101
 
102
- To do this, you'll need to create a namespace package under `tiktoken_ext`.
 
 
103
 
104
- Layout your project like this, making sure to omit the `tiktoken_ext/__init__.py` file:
105
- ```
106
- my_tiktoken_extension
107
- ├── tiktoken_ext
108
- │   └── my_encodings.py
109
- └── setup.py
110
- ```
111
 
112
- `my_encodings.py` should be a module that contains a variable named `ENCODING_CONSTRUCTORS`.
113
- This is a dictionary from an encoding name to a function that takes no arguments and returns
114
- arguments that can be passed to `tiktoken.Encoding` to construct that encoding. For an example, see
115
- `tiktoken_ext/openai_public.py`. For precise details, see `tiktoken/registry.py`.
116
 
117
- Your `setup.py` should look something like this:
118
- ```python
119
- from setuptools import setup, find_namespace_packages
120
-
121
- setup(
122
- name="my_tiktoken_extension",
123
- packages=find_namespace_packages(include=['tiktoken_ext*']),
124
- install_requires=["tiktoken"],
125
- ...
126
- )
127
  ```
 
 
 
 
 
 
 
128
 
129
- Then simply `pip install ./my_tiktoken_extension` and you should be able to use your
130
- custom encodings! Make sure **not** to use an editable install.
 
1
+ ---
2
+ language:
3
+ - fr
4
+ - it
5
+ - de
6
+ - es
7
+ - en
8
+ license: apache-2.0
9
+ inference:
10
+ parameters:
11
+ temperature: 0.5
12
+ widget:
13
+ - messages:
14
+ - role: user
15
+ content: What is your favorite condiment?
16
+ ---
17
+ # Model Card for Mixtral-8x7B
18
+
19
+ ### Tokenization with `mistral-common`
20
+
21
+ ```py
22
+ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
23
+ from mistral_common.protocol.instruct.messages import UserMessage
24
+ from mistral_common.protocol.instruct.request import ChatCompletionRequest
25
+
26
+ mistral_models_path = "MISTRAL_MODELS_PATH"
27
+
28
+ tokenizer = MistralTokenizer.v1()
29
+
30
+ completion_request = ChatCompletionRequest(messages=[UserMessage(content="Explain Machine Learning to me in a nutshell.")])
31
+
32
+ tokens = tokenizer.encode_chat_completion(completion_request).tokens
33
+ ```
34
+
35
+ ## Inference with `mistral_inference`
36
+
37
+ ```py
38
+ from mistral_inference.model import Transformer
39
+ from mistral_inference.generate import generate
40
+
41
+ model = Transformer.from_folder(mistral_models_path)
42
+ out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
43
+
44
+ result = tokenizer.decode(out_tokens[0])
45
+
46
+ print(result)
47
+ ```
48
 
49
+ ## Inference with hugging face `transformers`
50
+
51
+ ```py
52
+ from transformers import AutoModelForCausalLM
53
+
54
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-Instruct-v0.1")
55
+ model.to("cuda")
56
+
57
+ generated_ids = model.generate(tokens, max_new_tokens=1000, do_sample=True)
58
+
59
+ # decode with mistral tokenizer
60
+ result = tokenizer.decode(generated_ids[0].tolist())
61
+ print(result)
62
+ ```
63
 
64
+ > [!TIP]
65
+ > PRs to correct the transformers tokenizer so that it gives 1-to-1 the same results as the mistral-common reference implementation are very welcome!
66
+
67
+
68
+ ---
69
+ The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. The Mixtral-8x7B outperforms Llama 2 70B on most benchmarks we tested.
70
 
71
+ For full details of this model please read our [release blog post](https://mistral.ai/news/mixtral-of-experts/).
72
+
73
+ ## Warning
74
+ This repo contains weights that are compatible with [vLLM](https://github.com/vllm-project/vllm) serving of the model as well as Hugging Face [transformers](https://github.com/huggingface/transformers) library. It is based on the original Mixtral [torrent release](magnet:?xt=urn:btih:5546272da9065eddeb6fcd7ffddeef5b75be79a7&dn=mixtral-8x7b-32kseqlen&tr=udp%3A%2F%http://2Fopentracker.i2p.rocks%3A6969%2Fannounce&tr=http%3A%2F%http://2Ftracker.openbittorrent.com%3A80%2Fannounce), but the file format and parameter names are different. Please note that model cannot (yet) be instantiated with HF.
75
+
76
+ ## Instruction format
77
 
78
+ This format must be strictly respected, otherwise the model will generate sub-optimal outputs.
79
+
80
+ The template used to build a prompt for the Instruct model is defined as follows:
81
  ```
82
+ <s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]
83
  ```
84
+ Note that `<s>` and `</s>` are special tokens for beginning of string (BOS) and end of string (EOS) while [INST] and [/INST] are regular strings.
85
 
86
+ As reference, here is the pseudo-code used to tokenize instructions during fine-tuning:
87
+ ```python
88
+ def tokenize(text):
89
+ return tok.encode(text, add_special_tokens=False)
90
+
91
+ [BOS_ID] +
92
+ tokenize("[INST]") + tokenize(USER_MESSAGE_1) + tokenize("[/INST]") +
93
+ tokenize(BOT_MESSAGE_1) + [EOS_ID] +
94
+
95
+ tokenize("[INST]") + tokenize(USER_MESSAGE_N) + tokenize("[/INST]") +
96
+ tokenize(BOT_MESSAGE_N) + [EOS_ID]
97
+ ```
98
 
99
+ In the pseudo-code above, note that the `tokenize` method should not add a BOS or EOS token automatically, but should add a prefix space.
 
100
 
101
+ In the Transformers library, one can use [chat templates](https://huggingface.co/docs/transformers/main/en/chat_templating) which make sure the right format is applied.
102
 
103
+ ## Run the model
104
 
105
+ ```python
106
+ from transformers import AutoModelForCausalLM, AutoTokenizer
107
 
108
+ model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
109
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
110
 
111
+ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
 
112
 
113
+ messages = [
114
+ {"role": "user", "content": "What is your favourite condiment?"},
115
+ {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
116
+ {"role": "user", "content": "Do you have mayonnaise recipes?"}
117
+ ]
118
 
119
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
120
 
121
+ outputs = model.generate(inputs, max_new_tokens=20)
122
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
123
+ ```
124
 
125
+ By default, transformers will load the model in full precision. Therefore you might be interested to further reduce down the memory requirements to run the model through the optimizations we offer in HF ecosystem:
 
126
 
127
+ ### In half-precision
128
 
129
+ Note `float16` precision only works on GPU devices
 
 
 
 
 
 
 
 
 
 
130
 
131
+ <details>
132
+ <summary> Click to expand </summary>
133
+
134
+ ```diff
135
+ + import torch
136
+ from transformers import AutoModelForCausalLM, AutoTokenizer
137
 
138
+ model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
139
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
140
 
141
+ + model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
142
+
143
+ messages = [
144
+ {"role": "user", "content": "What is your favourite condiment?"},
145
+ {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
146
+ {"role": "user", "content": "Do you have mayonnaise recipes?"}
147
+ ]
148
+
149
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
150
+
151
+ outputs = model.generate(input_ids, max_new_tokens=20)
152
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
153
  ```
154
+ </details>
155
 
156
+ ### Lower precision using (8-bit & 4-bit) using `bitsandbytes`
157
 
158
+ <details>
159
+ <summary> Click to expand </summary>
160
 
161
+ ```diff
162
+ + import torch
163
+ from transformers import AutoModelForCausalLM, AutoTokenizer
164
 
165
+ model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
166
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
167
 
168
+ + model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
169
 
170
+ text = "Hello my name is"
171
+ messages = [
172
+ {"role": "user", "content": "What is your favourite condiment?"},
173
+ {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
174
+ {"role": "user", "content": "Do you have mayonnaise recipes?"}
175
+ ]
176
+
177
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
178
+
179
+ outputs = model.generate(input_ids, max_new_tokens=20)
180
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 
 
 
 
 
 
181
  ```
182
+ </details>
183
 
184
+ ### Load the model with Flash Attention 2
185
 
186
+ <details>
187
+ <summary> Click to expand </summary>
188
 
189
+ ```diff
190
+ + import torch
191
+ from transformers import AutoModelForCausalLM, AutoTokenizer
192
 
193
+ model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
194
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
 
 
 
 
 
195
 
196
+ + model = AutoModelForCausalLM.from_pretrained(model_id, use_flash_attention_2=True, device_map="auto")
 
 
 
197
 
198
+ messages = [
199
+ {"role": "user", "content": "What is your favourite condiment?"},
200
+ {"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
201
+ {"role": "user", "content": "Do you have mayonnaise recipes?"}
202
+ ]
203
+
204
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
205
+
206
+ outputs = model.generate(input_ids, max_new_tokens=20)
207
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
208
  ```
209
+ </details>
210
+
211
+ ## Limitations
212
+
213
+ The Mixtral-8x7B Instruct model is a quick demonstration that the base model can be easily fine-tuned to achieve compelling performance.
214
+ It does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to
215
+ make the model finely respect guardrails, allowing for deployment in environments requiring moderated outputs.
216
 
217
+ # The Mistral AI Team
218
+ Eduardo ruiz