Update README.md (#17)

Browse files

- Update README.md (7d02209c9a1860ada0889eefde5a29d1776ee3f4)
- Update README.md (5f50772b94e6463052b999eda50bea4ade595e03)

Co-authored-by: Vaibhav Srivastav <reach-vb@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +6 -42

README.md CHANGED Viewed

@@ -48,6 +48,8 @@ Below we share some code snippets on how to get quickly started with running the
 #### Running the model on a single / multi GPU
 ```python
 # pip install accelerate
@@ -71,51 +73,10 @@ print(tokenizer.decode(outputs[0]))
 <a name="precisions"></a>
 #### Running the model on a GPU using different precisions
-The native weights of this model were exported in `bfloat16` precision. You can use `float16`, which may be faster on certain hardware, indicating the `torch_dtype` when loading the model. For convenience, the `float16` revision of the repo contains a copy of the weights already converted to that precision.
 You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
-* _Using `torch.float16`_
-```python
-# pip install accelerate
-from transformers import AutoTokenizer, AutoModelForCausalLM
-import torch
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
-model = AutoModelForCausalLM.from_pretrained(
-    "google/gemma-2-27b-it",
-    device_map="auto",
-    torch_dtype=torch.float16,
-    revision="float16",
-)
-input_text = "Write me a poem about Machine Learning."
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
-outputs = model.generate(**input_ids)
-print(tokenizer.decode(outputs[0]))
-```
-* _Using `torch.bfloat16`_
-```python
-# pip install accelerate
-from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
-model = AutoModelForCausalLM.from_pretrained(
-    "google/gemma-2-27b-it",
-    device_map="auto",
-    torch_dtype=torch.bfloat16)
-input_text = "Write me a poem about Machine Learning."
-input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
-outputs = model.generate(**input_ids)
-print(tokenizer.decode(outputs[0]))
-```
 * _Upcasting to `torch.float32`_
 ```python
@@ -182,6 +143,9 @@ print(tokenizer.decode(outputs[0]))
 * _Flash Attention 2_
 First make sure to install `flash-attn` in your environment `pip install flash-attn`
 ```diff

 #### Running the model on a single / multi GPU
+> [!IMPORTANT]
+> Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise `eager` attention.
 ```python
 # pip install accelerate
 <a name="precisions"></a>
 #### Running the model on a GPU using different precisions
+The native weights of this model were exported in `bfloat16` precision.
 You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
 * _Upcasting to `torch.float32`_
 ```python
 * _Flash Attention 2_
+> [!WARNING]
+> Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk.
 First make sure to install `flash-attn` in your environment `pip install flash-attn`
 ```diff