manvinder reach-vb HF staff commited on
Commit
2d74922
1 Parent(s): 8a03e86

Update README.md (#17)

Browse files

- Update README.md (7d02209c9a1860ada0889eefde5a29d1776ee3f4)
- Update README.md (5f50772b94e6463052b999eda50bea4ade595e03)


Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +6 -42
README.md CHANGED
@@ -48,6 +48,8 @@ Below we share some code snippets on how to get quickly started with running the
48
 
49
  #### Running the model on a single / multi GPU
50
 
 
 
51
 
52
  ```python
53
  # pip install accelerate
@@ -71,51 +73,10 @@ print(tokenizer.decode(outputs[0]))
71
  <a name="precisions"></a>
72
  #### Running the model on a GPU using different precisions
73
 
74
- The native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.
75
 
76
  You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
77
 
78
- * _Using `torch.float16`_
79
-
80
- ```python
81
- # pip install accelerate
82
- from transformers import AutoTokenizer, AutoModelForCausalLM
83
- import torch
84
-
85
- tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
86
- model = AutoModelForCausalLM.from_pretrained(
87
- "google/gemma-2-27b-it",
88
- device_map="auto",
89
- torch_dtype=torch.float16,
90
- revision="float16",
91
- )
92
-
93
- input_text = "Write me a poem about Machine Learning."
94
- input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
95
-
96
- outputs = model.generate(**input_ids)
97
- print(tokenizer.decode(outputs[0]))
98
- ```
99
-
100
- * _Using `torch.bfloat16`_
101
-
102
- ```python
103
- # pip install accelerate
104
- from transformers import AutoTokenizer, AutoModelForCausalLM
105
-
106
- tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
107
- model = AutoModelForCausalLM.from_pretrained(
108
- "google/gemma-2-27b-it",
109
- device_map="auto",
110
- torch_dtype=torch.bfloat16)
111
-
112
- input_text = "Write me a poem about Machine Learning."
113
- input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
114
-
115
- outputs = model.generate(**input_ids)
116
- print(tokenizer.decode(outputs[0]))
117
- ```
118
-
119
  * _Upcasting to `torch.float32`_
120
 
121
  ```python
@@ -182,6 +143,9 @@ print(tokenizer.decode(outputs[0]))
182
 
183
  * _Flash Attention 2_
184
 
 
 
 
185
  First make sure to install `flash-attn` in your environment `pip install flash-attn`
186
 
187
  ```diff
 
48
 
49
  #### Running the model on a single / multi GPU
50
 
51
+ > [!IMPORTANT]
52
+ > Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise `eager` attention.
53
 
54
  ```python
55
  # pip install accelerate
 
73
  <a name="precisions"></a>
74
  #### Running the model on a GPU using different precisions
75
 
76
+ The native weights of this model were exported in `bfloat16` precision.
77
 
78
  You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  * _Upcasting to `torch.float32`_
81
 
82
  ```python
 
143
 
144
  * _Flash Attention 2_
145
 
146
+ > [!WARNING]
147
+ > Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk.
148
+
149
  First make sure to install `flash-attn` in your environment `pip install flash-attn`
150
 
151
  ```diff