lucyknada Delta-Vector commited on
Commit
2688680
1 Parent(s): 99914b4

out with the old in with the new (#2)

Browse files

- out with the old in with the new (992f7460f327f2bdb9f42ec7da2f545316efc3ab)


Co-authored-by: Joseph Seed <Delta-Vector@users.noreply.huggingface.co>

Files changed (2) hide show
  1. README.md +95 -74
  2. tokenizer.json +2 -2
README.md CHANGED
@@ -43,39 +43,32 @@ state of the art AI models and helping foster innovation for everyone.
43
 
44
  ### Usage
45
 
46
- Below we share some code snippets on how to get quickly started with running the model. First make sure to `pip install -U transformers`, then copy the snippet from the section that is relevant for your usecase.
47
-
 
 
48
 
49
- #### Running the model on a single / multi GPU
50
 
 
51
 
52
  ```python
53
- # pip install accelerate
54
- from transformers import AutoTokenizer, AutoModelForCausalLM
55
  import torch
 
56
 
57
- tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
58
- model = AutoModelForCausalLM.from_pretrained(
59
- "google/gemma-2-9b",
60
- device_map="auto",
61
- torch_dtype=torch.bfloat16
62
  )
63
 
64
- input_text = "Write me a poem about Machine Learning."
65
- input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
66
-
67
- outputs = model.generate(**input_ids)
68
- print(tokenizer.decode(outputs[0]))
69
  ```
70
 
71
- <a name="precisions"></a>
72
- #### Running the model on a GPU using different precisions
73
-
74
- The native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.
75
-
76
- You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
77
-
78
- * _Using `torch.float16`_
79
 
80
  ```python
81
  # pip install accelerate
@@ -86,57 +79,31 @@ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
86
  model = AutoModelForCausalLM.from_pretrained(
87
  "google/gemma-2-9b",
88
  device_map="auto",
89
- torch_dtype=torch.float16,
90
- revision="float16",
91
  )
92
 
93
  input_text = "Write me a poem about Machine Learning."
94
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
95
 
96
- outputs = model.generate(**input_ids)
97
  print(tokenizer.decode(outputs[0]))
98
  ```
99
 
100
- * _Using `torch.bfloat16`_
101
 
102
- ```python
103
- # pip install accelerate
104
- from transformers import AutoTokenizer, AutoModelForCausalLM
105
 
106
- tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
107
- model = AutoModelForCausalLM.from_pretrained(
108
- "google/gemma-2-9b",
109
- device_map="auto",
110
- torch_dtype=torch.bfloat16)
111
-
112
- input_text = "Write me a poem about Machine Learning."
113
- input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
114
-
115
- outputs = model.generate(**input_ids)
116
- print(tokenizer.decode(outputs[0]))
117
- ```
118
-
119
- * _Upcasting to `torch.float32`_
120
-
121
- ```python
122
- # pip install accelerate
123
- from transformers import AutoTokenizer, AutoModelForCausalLM
124
-
125
- tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
126
- model = AutoModelForCausalLM.from_pretrained(
127
- "google/gemma-2-9b",
128
- device_map="auto")
129
-
130
- input_text = "Write me a poem about Machine Learning."
131
- input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
132
-
133
- outputs = model.generate(**input_ids)
134
- print(tokenizer.decode(outputs[0]))
135
  ```
136
 
137
  #### Quantized Versions through `bitsandbytes`
138
 
139
- * _Using 8-bit precision (int8)_
 
 
 
140
 
141
  ```python
142
  # pip install bitsandbytes accelerate
@@ -147,16 +114,21 @@ quantization_config = BitsAndBytesConfig(load_in_8bit=True)
147
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
148
  model = AutoModelForCausalLM.from_pretrained(
149
  "google/gemma-2-9b",
150
- quantization_config=quantization_config)
 
151
 
152
  input_text = "Write me a poem about Machine Learning."
153
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
154
 
155
- outputs = model.generate(**input_ids)
156
  print(tokenizer.decode(outputs[0]))
157
  ```
 
158
 
159
- * _Using 4-bit precision_
 
 
 
160
 
161
  ```python
162
  # pip install bitsandbytes accelerate
@@ -167,30 +139,79 @@ quantization_config = BitsAndBytesConfig(load_in_4bit=True)
167
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
168
  model = AutoModelForCausalLM.from_pretrained(
169
  "google/gemma-2-9b",
170
- quantization_config=quantization_config)
 
171
 
172
  input_text = "Write me a poem about Machine Learning."
173
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
174
 
175
- outputs = model.generate(**input_ids)
176
  print(tokenizer.decode(outputs[0]))
177
  ```
 
178
 
 
179
 
180
- #### Other optimizations
 
 
 
181
 
182
- * _Flash Attention 2_
 
183
 
184
- First make sure to install `flash-attn` in your environment `pip install flash-attn`
185
 
186
- ```diff
187
- model = AutoModelForCausalLM.from_pretrained(
188
- model_id,
189
- torch_dtype=torch.float16,
190
- + attn_implementation="flash_attention_2"
191
- ).to(0)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
  ```
193
 
 
 
 
 
194
  ### Inputs and outputs
195
 
196
  * **Input:** Text string, such as a question, a prompt, or a document to be
 
43
 
44
  ### Usage
45
 
46
+ Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with:
47
+ ```sh
48
+ pip install -U transformers
49
+ ```
50
 
51
+ Then, copy the snippet from the section that is relevant for your usecase.
52
 
53
+ #### Running with the `pipeline` API
54
 
55
  ```python
 
 
56
  import torch
57
+ from transformers import pipeline
58
 
59
+ pipe = pipeline(
60
+ "text-generation",
61
+ model="google/gemma-2-9b",
62
+ device="cuda", # replace with "mps" to run on a Mac device
 
63
  )
64
 
65
+ text = "Once upon a time,"
66
+ outputs = pipe(text, max_new_tokens=256)
67
+ response = outputs[0]["generated_text"]
68
+ print(response)
 
69
  ```
70
 
71
+ #### Running the model on a single / multi GPU
 
 
 
 
 
 
 
72
 
73
  ```python
74
  # pip install accelerate
 
79
  model = AutoModelForCausalLM.from_pretrained(
80
  "google/gemma-2-9b",
81
  device_map="auto",
 
 
82
  )
83
 
84
  input_text = "Write me a poem about Machine Learning."
85
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
86
 
87
+ outputs = model.generate(**input_ids, max_new_tokens=32)
88
  print(tokenizer.decode(outputs[0]))
89
  ```
90
 
91
+ #### Running the model through a CLI
92
 
93
+ The [local-gemma](https://github.com/huggingface/local-gemma) repository contains a lightweight wrapper around Transformers
94
+ for running Gemma 2 through a command line interface, or CLI. Follow the [installation instructions](https://github.com/huggingface/local-gemma#cli-usage)
95
+ for getting started, then launch the CLI through the following command:
96
 
97
+ ```shell
98
+ local-gemma --model "google/gemma-2-9b" --prompt "What is the capital of Mexico?"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  ```
100
 
101
  #### Quantized Versions through `bitsandbytes`
102
 
103
+ <details>
104
+ <summary>
105
+ Using 8-bit precision (int8)
106
+ </summary>
107
 
108
  ```python
109
  # pip install bitsandbytes accelerate
 
114
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
115
  model = AutoModelForCausalLM.from_pretrained(
116
  "google/gemma-2-9b",
117
+ quantization_config=quantization_config,
118
+ )
119
 
120
  input_text = "Write me a poem about Machine Learning."
121
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
122
 
123
+ outputs = model.generate(**input_ids, max_new_tokens=32)
124
  print(tokenizer.decode(outputs[0]))
125
  ```
126
+ </details>
127
 
128
+ <details>
129
+ <summary>
130
+ Using 4-bit precision
131
+ </summary>
132
 
133
  ```python
134
  # pip install bitsandbytes accelerate
 
139
  tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
140
  model = AutoModelForCausalLM.from_pretrained(
141
  "google/gemma-2-9b",
142
+ quantization_config=quantization_config,
143
+ )
144
 
145
  input_text = "Write me a poem about Machine Learning."
146
  input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
147
 
148
+ outputs = model.generate(**input_ids, max_new_tokens=32)
149
  print(tokenizer.decode(outputs[0]))
150
  ```
151
+ </details>
152
 
153
+ #### Advanced Usage
154
 
155
+ <details>
156
+ <summary>
157
+ Torch compile
158
+ </summary>
159
 
160
+ [Torch compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) is a method for speeding-up the
161
+ inference of PyTorch modules. The Gemma-2 model can be run up to 6x faster by leveraging torch compile.
162
 
163
+ Note that two warm-up steps are required before the full inference speed is realised:
164
 
165
+ ```python
166
+ import os
167
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
168
+
169
+ from transformers import AutoTokenizer, Gemma2ForCausalLM
170
+ from transformers.cache_utils import HybridCache
171
+ import torch
172
+
173
+ torch.set_float32_matmul_precision("high")
174
+
175
+ # load the model + tokenizer
176
+ tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
177
+ model = Gemma2ForCausalLM.from_pretrained("google/gemma-2-9b", torch_dtype=torch.bfloat16)
178
+ model.to("cuda")
179
+
180
+ # apply the torch compile transformation
181
+ model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)
182
+
183
+ # pre-process inputs
184
+ input_text = "The theory of special relativity states "
185
+ model_inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
186
+ prompt_length = model_inputs.input_ids.shape[1]
187
+
188
+ # set-up k/v cache
189
+ past_key_values = HybridCache(
190
+ config=model.config,
191
+ max_batch_size=1,
192
+ max_cache_len=model.config.max_position_embeddings,
193
+ device=model.device,
194
+ dtype=model.dtype
195
+ )
196
+
197
+ # enable passing kv cache to generate
198
+ model._supports_cache_class = True
199
+ model.generation_config.cache_implementation = None
200
+
201
+ # two warm-up steps
202
+ for idx in range(2):
203
+ outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
204
+ past_key_values.reset()
205
+
206
+ # fast run
207
+ outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128)
208
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
209
  ```
210
 
211
+ For more details, refer to the [Transformers documentation](https://huggingface.co/docs/transformers/main/en/llm_optims?static-kv=basic+usage%3A+generation_config).
212
+
213
+ </details>
214
+
215
  ### Inputs and outputs
216
 
217
  * **Input:** Text string, such as a question, a prompt, or a document to be
tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7da53ca29fb16f6b2489482fc0bc6a394162cdab14d12764a1755ebc583fea79
3
- size 17518525
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f289bc05132635a8bc7aca7aa21255efd5e18f3710f43e3cdb96bcd41be4922
3
+ size 17525357