megrisdal reach-vb HF staff commited on
Commit
6c1058d
1 Parent(s): 1937c70

Update README.md (#35)

Browse files

- Update README.md (aab9b2b21dda7698508d9f571593044845cb62e5)
- Update README.md (bd8be0b392d5d5df896719a73568103069925868)


Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +126 -21
README.md CHANGED
@@ -45,11 +45,37 @@ state of the art AI models and helping foster innovation for everyone.
45
 
46
  ### Usage
47
 
48
- Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.
 
 
 
49
 
 
50
 
51
- #### Running the model on a single / multi GPU
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
 
53
 
54
  ```python
55
  # pip install accelerate
@@ -60,13 +86,24 @@ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
60
  model = AutoModelForCausalLM.from_pretrained(
61
  "google/gemma-2-9b-it",
62
  device_map="auto",
63
- torch_dtype=torch.bfloat16
64
  )
65
 
66
  input_text = "Write me a poem about Machine Learning."
67
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
68
 
69
- outputs = model.generate(**input_ids)
 
 
 
 
 
 
 
 
 
 
 
70
  print(tokenizer.decode(outputs[0]))
71
  ```
72
 
@@ -86,18 +123,32 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
86
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
87
  model = AutoModelForCausalLM.from_pretrained(
88
  "google/gemma-2-9b-it",
89
- device_map="auto")
 
90
 
91
  input_text = "Write me a poem about Machine Learning."
92
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
93
 
94
- outputs = model.generate(**input_ids)
95
  print(tokenizer.decode(outputs[0]))
96
  ```
97
 
 
 
 
 
 
 
 
 
 
 
98
  #### Quantized Versions through `bitsandbytes`
99
 
100
- * _Using 8-bit precision (int8)_
 
 
 
101
 
102
  ```python
103
  # pip install bitsandbytes accelerate
@@ -108,16 +159,21 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
108
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
109
  model = AutoModelForCausalLM.from_pretrained(
110
  "google/gemma-2-9b-it",
111
- quantization_config=quantization_config)
 
112
 
113
  input_text = "Write me a poem about Machine Learning."
114
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
115
 
116
- outputs = model.generate(**input_ids)
117
  print(tokenizer.decode(outputs[0]))
118
  ```
 
119
 
120
- * _Using 4-bit precision_
 
 
 
121
 
122
  ```python
123
  # pip install bitsandbytes accelerate
@@ -128,30 +184,79 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True)
128
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
129
  model = AutoModelForCausalLM.from_pretrained(
130
  "google/gemma-2-9b-it",
131
- quantization_config=quantization_config)
 
132
 
133
  input_text = "Write me a poem about Machine Learning."
134
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
135
 
136
- outputs = model.generate(**input_ids)
137
  print(tokenizer.decode(outputs[0]))
138
  ```
 
139
 
 
140
 
141
- #### Other optimizations
 
 
 
142
 
143
- * _Flash Attention 2_
 
144
 
145
- First make sure to install `flash-attn` in your environment `pip install flash-attn`
146
 
147
- ```diff
148
- model = AutoModelForCausalLM.from_pretrained(
149
- model_id,
150
- torch_dtype=torch.float16,
151
- + attn_implementation="flash_attention_2"
152
- ).to(0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  ```
154
 
 
 
 
 
155
  ### Chat Template
156
 
157
  The instruction-tuned models use a chat template that must be adhered to for conversational use.
 
45
 
46
  ### Usage
47
 
48
+ Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:
49
+ ```sh
50
+ pip install -U transformers
51
+ ```
52
 
53
+ Then, copy the snippet from the section that is relevant for your usecase.
54
 
55
+ #### Running with the `pipeline` API
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import pipeline
60
+
61
+ pipe = pipeline(
62
+ "text-generation",
63
+ model="google/gemma-2-9b-it",
64
+ model_kwargs={"torch_dtype": torch.bfloat16},
65
+ device="cuda", # replace with "mps" to run on a Mac device
66
+ )
67
+
68
+ messages = [
69
+ {"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
70
+ ]
71
+
72
+ outputs = pipe(messages, max_new_tokens=256)
73
+ assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
74
+ print(assistant_response)
75
+ # Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world. So, what be yer pleasure, eh? 🦜
76
+ ```
77
 
78
+ #### Running the model on a single / multi GPU
79
 
80
  ```python
81
  # pip install accelerate
 
86
  model = AutoModelForCausalLM.from_pretrained(
87
  "google/gemma-2-9b-it",
88
  device_map="auto",
89
+ torch_dtype=torch.bfloat16,
90
  )
91
 
92
  input_text = "Write me a poem about Machine Learning."
93
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
94
 
95
+ outputs = model.generate(**input_ids, max_new_tokens=32)
96
+ print(tokenizer.decode(outputs[0]))
97
+ ```
98
+
99
+ You can ensure the correct chat template is applied by using `tokenizer.apply_chat_template` as follows:
100
+ ```python
101
+ messages = [
102
+ {"role": "user", "content": "Write me a poem about Machine Learning."},
103
+ ]
104
+ input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")
105
+
106
+ outputs = model.generate(**input_ids, max_new_tokens=256)
107
  print(tokenizer.decode(outputs[0]))
108
  ```
109
 
 
123
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
124
  model = AutoModelForCausalLM.from_pretrained(
125
  "google/gemma-2-9b-it",
126
+ device_map="auto",
127
+ )
128
 
129
  input_text = "Write me a poem about Machine Learning."
130
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
131
 
132
+ outputs = model.generate(**input_ids, max_new_tokens=32)
133
  print(tokenizer.decode(outputs[0]))
134
  ```
135
 
136
+ #### Running the model through a CLI
137
+
138
+ The [local-gemma](https://github.com/huggingface/local-gemma) repository contains a lightweight wrapper around Transformers
139
+ for running Gemma 2 through a command line interface, or CLI. Follow the [installation instructions](https://github.com/huggingface/local-gemma#cli-usage)
140
+ for getting started, then launch the CLI through the following command:
141
+
142
+ ```shell
143
+ local-gemma --model 9b --preset speed
144
+ ```
145
+
146
  #### Quantized Versions through `bitsandbytes`
147
 
148
+ <details>
149
+ <summary>
150
+ Using 8-bit precision (int8)
151
+ </summary>
152
 
153
  ```python
154
  # pip install bitsandbytes accelerate
 
159
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
160
  model = AutoModelForCausalLM.from_pretrained(
161
  "google/gemma-2-9b-it",
162
+ quantization_config=quantization_config,
163
+ )
164
 
165
  input_text = "Write me a poem about Machine Learning."
166
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
167
 
168
+ outputs = model.generate(**input_ids, max_new_tokens=32)
169
  print(tokenizer.decode(outputs[0]))
170
  ```
171
+ </details>
172
 
173
+ <details>
174
+ <summary>
175
+ Using 4-bit precision
176
+ </summary>
177
 
178
  ```python
179
  # pip install bitsandbytes accelerate
 
184
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
185
  model = AutoModelForCausalLM.from_pretrained(
186
  "google/gemma-2-9b-it",
187
+ quantization_config=quantization_config,
188
+ )
189
 
190
  input_text = "Write me a poem about Machine Learning."
191
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
192
 
193
+ outputs = model.generate(**input_ids, max_new_tokens=32)
194
  print(tokenizer.decode(outputs[0]))
195
  ```
196
+ </details>
197
 
198
+ #### Advanced Usage
199
 
200
+ <details>
201
+ <summary>
202
+ Torch compile
203
+ </summary>
204
 
205
+ [Torch compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) is a method for speeding-up the
206
+ inference of PyTorch modules. The Gemma-2 model can be run up to 6x faster by leveraging torch compile.
207
 
208
+ Note that two warm-up steps are required before the full inference speed is realised:
209
 
210
+ ```python
211
+ import os
212
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
213
+
214
+ from transformers import AutoTokenizer, Gemma2ForCausalLM
215
+ from transformers.cache_utils import HybridCache
216
+ import torch
217
+
218
+ torch.set_float32_matmul_precision("high")
219
+
220
+ # load the model + tokenizer
221
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
222
+ model = Gemma2ForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16)
223
+ model.to("cuda")
224
+
225
+ # apply the torch compile transformation
226
+ model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
227
+
228
+ # pre-process inputs
229
+ input_text = "The theory of special relativity states "
230
+ model_inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
231
+ prompt_length = model_inputs.input_ids.shape[1]
232
+
233
+ # set-up k/v cache
234
+ past_key_values = HybridCache(
235
+ config=model.config,
236
+ max_batch_size=1,
237
+ max_cache_len=model.config.max_position_embeddings,
238
+ device=model.device,
239
+ dtype=model.dtype
240
+ )
241
+
242
+ # enable passing kv cache to generate
243
+ model._supports_cache_class = True
244
+ model.generation_config.cache_implementation = None
245
+
246
+ # two warm-up steps
247
+ for idx in range(2):
248
+ outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
249
+ past_key_values.reset()
250
+
251
+ # fast run
252
+ outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
253
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
254
  ```
255
 
256
+ For more details, refer to the [Transformers documentation](https://huggingface.co/docs/transformers/main/en/llm_optims?static-kv=basic+usage%3A+generation_config).
257
+
258
+ </details>
259
+
260
  ### Chat Template
261
 
262
  The instruction-tuned models use a chat template that must be adhered to for conversational use.