nielsr HF staff commited on
Commit
df77f4c
1 Parent(s): 51405cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -0
README.md CHANGED
@@ -56,6 +56,35 @@ output = model.generate(**inputs, max_new_tokens=100)
56
 
57
  print(processor.decode(output[0], skip_special_tokens=True))
58
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ### BibTeX entry and citation info
60
 
61
  ```bibtex
 
56
 
57
  print(processor.decode(output[0], skip_special_tokens=True))
58
  ```
59
+
60
+ ### Model optimization
61
+
62
+ #### 4-bit quantization through `bitsandbytes` library
63
+
64
+ First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
65
+
66
+ ```diff
67
+ model = LlavaNextForConditionalGeneration.from_pretrained(
68
+ model_id,
69
+ torch_dtype=torch.float16,
70
+ low_cpu_mem_usage=True,
71
+ + load_in_4bit=True
72
+ )
73
+ ```
74
+
75
+ #### Use Flash-Attention 2 to further speed-up generation
76
+
77
+ First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
78
+
79
+ ```diff
80
+ model = LlavaNextForConditionalGeneration.from_pretrained(
81
+ model_id,
82
+ torch_dtype=torch.float16,
83
+ low_cpu_mem_usage=True,
84
+ + use_flash_attention_2=True
85
+ ).to(0)
86
+ ```
87
+
88
  ### BibTeX entry and citation info
89
 
90
  ```bibtex