VictorSanh
commited on
Commit
·
1fa1cbd
1
Parent(s):
a44e7e3
gpu memory / inference speed - tradeoffs of quantization
Browse files
README.md
CHANGED
@@ -217,11 +217,20 @@ print(generated_texts)
|
|
217 |
|
218 |
# Model optimizations
|
219 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
220 |
**Vision encoder efficiency**
|
221 |
|
222 |
Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
|
223 |
- **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
|
224 |
-
- **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need. We recommend using values that are multiples of 14. There are no changes required on the model side.
|
225 |
|
226 |
`do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
|
227 |
|
@@ -234,7 +243,7 @@ First, make sure to install `flash-attn`. Refer to the [original repository of F
|
|
234 |
```diff
|
235 |
model = AutoModelForVision2Seq.from_pretrained(
|
236 |
"HuggingFaceM4/idefics2-8b",
|
237 |
-
+ torch_dtype=torch.
|
238 |
+ _attn_implementation="flash_attention_2",
|
239 |
).to(DEVICE)
|
240 |
```
|
@@ -243,7 +252,7 @@ Flash attention 2 support is available both for `idefics2-8b-base` and `idefics2
|
|
243 |
|
244 |
</details>
|
245 |
|
246 |
-
**4 bit quantization
|
247 |
|
248 |
<details><summary>Click to expand.</summary>
|
249 |
|
@@ -268,12 +277,63 @@ Flash attention 2 support is available both for `idefics2-8b-base` and `idefics2
|
|
268 |
model = AutoModelForVision2Seq.from_pretrained(
|
269 |
- "HuggingFaceM4/idefics2-8b",
|
270 |
+ "HuggingFaceM4/idefics2-8b-AWQ",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
271 |
+ quantization_config=quantization_config,
|
272 |
).to(DEVICE)
|
273 |
```
|
274 |
|
275 |
</details>
|
276 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
277 |
# Bias, Risks, and Limitations
|
278 |
|
279 |
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
|
|
|
217 |
|
218 |
# Model optimizations
|
219 |
|
220 |
+
If your GPU allows, we first recommend loading (and running inference) in half precision (`torch.float16` or `torch.bfloat16`).
|
221 |
+
|
222 |
+
```diff
|
223 |
+
model = AutoModelForVision2Seq.from_pretrained(
|
224 |
+
"HuggingFaceM4/idefics2-8b",
|
225 |
+
+ torch_dtype=torch.float16,
|
226 |
+
).to(DEVICE)
|
227 |
+
```
|
228 |
+
|
229 |
**Vision encoder efficiency**
|
230 |
|
231 |
Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
|
232 |
- **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
|
233 |
+
- **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need (the default value is `980`). We recommend using values that are multiples of 14. There are no changes required on the model side.
|
234 |
|
235 |
`do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
|
236 |
|
|
|
243 |
```diff
|
244 |
model = AutoModelForVision2Seq.from_pretrained(
|
245 |
"HuggingFaceM4/idefics2-8b",
|
246 |
+
+ torch_dtype=torch.float16,
|
247 |
+ _attn_implementation="flash_attention_2",
|
248 |
).to(DEVICE)
|
249 |
```
|
|
|
252 |
|
253 |
</details>
|
254 |
|
255 |
+
**4 bit quantization with AWQ**
|
256 |
|
257 |
<details><summary>Click to expand.</summary>
|
258 |
|
|
|
277 |
model = AutoModelForVision2Seq.from_pretrained(
|
278 |
- "HuggingFaceM4/idefics2-8b",
|
279 |
+ "HuggingFaceM4/idefics2-8b-AWQ",
|
280 |
+
+ torch_dtype=torch.float16,
|
281 |
+
+ quantization_config=quantization_config,
|
282 |
+
).to(DEVICE)
|
283 |
+
```
|
284 |
+
|
285 |
+
Fusing can be de-activated by removing `quantization_config` in the call to `from_pretrained`.
|
286 |
+
</details>
|
287 |
+
|
288 |
+
**4 bit quantization with bitsandbytes**
|
289 |
+
|
290 |
+
<details><summary>Click to expand.</summary>
|
291 |
+
It is also possible to load Idefics2 in 4bits with `bitsandbytes`. To do so, make sure that you have `accelerate` and `bitsandbytes` installed.
|
292 |
+
|
293 |
+
```diff
|
294 |
+
+ from transformer import BitsAndBytesConfig
|
295 |
+
|
296 |
+
quantization_config = BitsAndBytesConfig(
|
297 |
+
load_in_4bit=True,
|
298 |
+
bnb_4bit_quant_type="nf4",
|
299 |
+
bnb_4bit_use_double_quant=True,
|
300 |
+
bnb_4bit_compute_dtype=torch.float16
|
301 |
+
)
|
302 |
+
model = AutoModelForVision2Seq.from_pretrained(
|
303 |
+
"HuggingFaceM4/idefics2-8b",
|
304 |
+
+ torch_dtype=torch.float16,
|
305 |
+ quantization_config=quantization_config,
|
306 |
).to(DEVICE)
|
307 |
```
|
308 |
|
309 |
</details>
|
310 |
|
311 |
+
These optimizations can be combined to suit variable trade-offs between GPU memory, inference speed and performance. We provide the following comparison as anchor points to guide the user in choosing necessary optimizations. All of these benchmarks were computed with the example code snippet described above on a H100 (see [colab](https://colab.research.google.com/drive/1USsnssoFm1UTYuwUOw0XiGeBspLHzvso?usp=sharing)). As one can see, the are a few setups that require less than 24GB of GPU memory.
|
312 |
+
|
313 |
+
| Flash attention 2 | Image splitting | Float type | 4 bits quantization | Peak GPU memory (GB) | Time for 20 generations (secs) |
|
314 |
+
|-------------------|-----------------|------------|-----------------------------|----------------------|--------------------------------|
|
315 |
+
| No | Yes | fp32 | No | 54.9 | 55.6 |
|
316 |
+
| No | Yes | bf16 | No | 41.3 | 34.3 |
|
317 |
+
| No | Yes | fp16 | No | 36.7 | 33.3 |
|
318 |
+
| Yes | Yes | fp16 | No | 21.0 | 13.3 |
|
319 |
+
| Yes | Yes | fp16 | bitsandbytes (entire model) | 8.9 | 19.9 |
|
320 |
+
| No | Yes | fp16 | bitsandbytes (entire model) | 24.7 | 40.4 |
|
321 |
+
| No | Yes | fp16 | AWQ (LLM only) | 26.4 | 37.1 |
|
322 |
+
| Yes | Yes | fp16 | AWQ (LLM only) | 10.7 | 16.3 |
|
323 |
+
| No | Yes | fp16 | AWQ + fusing (LLM only) | 26.0 | 38.4 |
|
324 |
+
| | | | | | |
|
325 |
+
| No | No | fp32 | No | 38.8 | 17.5 |
|
326 |
+
| No | No | bf16 | No | 22.2 | 14.4 |
|
327 |
+
| No | No | fp16 | No | 21.3 | 13.9 |
|
328 |
+
| Yes | No | fp16 | No | 18.1 | 10.4 |
|
329 |
+
| Yes | No | fp16 | bitsandbytes (entire model) | 6.0 | 17.3 |
|
330 |
+
| No | No | fp16 | bitsandbytes (entire model) | 9.2 | 20.9 |
|
331 |
+
| No | No | fp16 | AWQ (LLM only) | 10.9 | 15.9 |
|
332 |
+
| Yes | No | fp16 | AWQ (LLM only) | 7.8 | 12.3 |
|
333 |
+
| No | No | fp16 | AWQ + fusing (LLM only) | 10.5 | 19.5 |
|
334 |
+
|
335 |
+
To learn more quantization schemes and fusing, we refer to the [documentation](https://huggingface.co/docs/transformers/quantization).
|
336 |
+
|
337 |
# Bias, Risks, and Limitations
|
338 |
|
339 |
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
|