VictorSanh commited on
Commit
1fa1cbd
1 Parent(s): a44e7e3

gpu memory / inference speed - tradeoffs of quantization

Browse files
Files changed (1) hide show
  1. README.md +63 -3
README.md CHANGED
@@ -217,11 +217,20 @@ print(generated_texts)
217
 
218
  # Model optimizations
219
 
 
 
 
 
 
 
 
 
 
220
  **Vision encoder efficiency**
221
 
222
  Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
223
  - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
224
- - **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need. We recommend using values that are multiples of 14. There are no changes required on the model side.
225
 
226
  `do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
227
 
@@ -234,7 +243,7 @@ First, make sure to install `flash-attn`. Refer to the [original repository of F
234
  ```diff
235
  model = AutoModelForVision2Seq.from_pretrained(
236
  "HuggingFaceM4/idefics2-8b",
237
- + torch_dtype=torch.bfloat16,
238
  + _attn_implementation="flash_attention_2",
239
  ).to(DEVICE)
240
  ```
@@ -243,7 +252,7 @@ Flash attention 2 support is available both for `idefics2-8b-base` and `idefics2
243
 
244
  </details>
245
 
246
- **4 bit quantization and module fusing**
247
 
248
  <details><summary>Click to expand.</summary>
249
 
@@ -268,12 +277,63 @@ Flash attention 2 support is available both for `idefics2-8b-base` and `idefics2
268
  model = AutoModelForVision2Seq.from_pretrained(
269
  - "HuggingFaceM4/idefics2-8b",
270
  + "HuggingFaceM4/idefics2-8b-AWQ",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
271
  + quantization_config=quantization_config,
272
  ).to(DEVICE)
273
  ```
274
 
275
  </details>
276
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
277
  # Bias, Risks, and Limitations
278
 
279
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
 
217
 
218
  # Model optimizations
219
 
220
+ If your GPU allows, we first recommend loading (and running inference) in half precision (`torch.float16` or `torch.bfloat16`).
221
+
222
+ ```diff
223
+ model = AutoModelForVision2Seq.from_pretrained(
224
+ "HuggingFaceM4/idefics2-8b",
225
+ + torch_dtype=torch.float16,
226
+ ).to(DEVICE)
227
+ ```
228
+
229
  **Vision encoder efficiency**
230
 
231
  Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
232
  - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
233
+ - **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need (the default value is `980`). We recommend using values that are multiples of 14. There are no changes required on the model side.
234
 
235
  `do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
236
 
 
243
  ```diff
244
  model = AutoModelForVision2Seq.from_pretrained(
245
  "HuggingFaceM4/idefics2-8b",
246
+ + torch_dtype=torch.float16,
247
  + _attn_implementation="flash_attention_2",
248
  ).to(DEVICE)
249
  ```
 
252
 
253
  </details>
254
 
255
+ **4 bit quantization with AWQ**
256
 
257
  <details><summary>Click to expand.</summary>
258
 
 
277
  model = AutoModelForVision2Seq.from_pretrained(
278
  - "HuggingFaceM4/idefics2-8b",
279
  + "HuggingFaceM4/idefics2-8b-AWQ",
280
+ + torch_dtype=torch.float16,
281
+ + quantization_config=quantization_config,
282
+ ).to(DEVICE)
283
+ ```
284
+
285
+ Fusing can be de-activated by removing `quantization_config` in the call to `from_pretrained`.
286
+ </details>
287
+
288
+ **4 bit quantization with bitsandbytes**
289
+
290
+ <details><summary>Click to expand.</summary>
291
+ It is also possible to load Idefics2 in 4bits with `bitsandbytes`. To do so, make sure that you have `accelerate` and `bitsandbytes` installed.
292
+
293
+ ```diff
294
+ + from transformer import BitsAndBytesConfig
295
+
296
+ quantization_config = BitsAndBytesConfig(
297
+ load_in_4bit=True,
298
+ bnb_4bit_quant_type="nf4",
299
+ bnb_4bit_use_double_quant=True,
300
+ bnb_4bit_compute_dtype=torch.float16
301
+ )
302
+ model = AutoModelForVision2Seq.from_pretrained(
303
+ "HuggingFaceM4/idefics2-8b",
304
+ + torch_dtype=torch.float16,
305
  + quantization_config=quantization_config,
306
  ).to(DEVICE)
307
  ```
308
 
309
  </details>
310
 
311
+ These optimizations can be combined to suit variable trade-offs between GPU memory, inference speed and performance. We provide the following comparison as anchor points to guide the user in choosing necessary optimizations. All of these benchmarks were computed with the example code snippet described above on a H100 (see [colab](https://colab.research.google.com/drive/1USsnssoFm1UTYuwUOw0XiGeBspLHzvso?usp=sharing)). As one can see, the are a few setups that require less than 24GB of GPU memory.
312
+
313
+ | Flash attention 2 | Image splitting | Float type | 4 bits quantization | Peak GPU memory (GB) | Time for 20 generations (secs) |
314
+ |-------------------|-----------------|------------|-----------------------------|----------------------|--------------------------------|
315
+ | No | Yes | fp32 | No | 54.9 | 55.6 |
316
+ | No | Yes | bf16 | No | 41.3 | 34.3 |
317
+ | No | Yes | fp16 | No | 36.7 | 33.3 |
318
+ | Yes | Yes | fp16 | No | 21.0 | 13.3 |
319
+ | Yes | Yes | fp16 | bitsandbytes (entire model) | 8.9 | 19.9 |
320
+ | No | Yes | fp16 | bitsandbytes (entire model) | 24.7 | 40.4 |
321
+ | No | Yes | fp16 | AWQ (LLM only) | 26.4 | 37.1 |
322
+ | Yes | Yes | fp16 | AWQ (LLM only) | 10.7 | 16.3 |
323
+ | No | Yes | fp16 | AWQ + fusing (LLM only) | 26.0 | 38.4 |
324
+ | | | | | | |
325
+ | No | No | fp32 | No | 38.8 | 17.5 |
326
+ | No | No | bf16 | No | 22.2 | 14.4 |
327
+ | No | No | fp16 | No | 21.3 | 13.9 |
328
+ | Yes | No | fp16 | No | 18.1 | 10.4 |
329
+ | Yes | No | fp16 | bitsandbytes (entire model) | 6.0 | 17.3 |
330
+ | No | No | fp16 | bitsandbytes (entire model) | 9.2 | 20.9 |
331
+ | No | No | fp16 | AWQ (LLM only) | 10.9 | 15.9 |
332
+ | Yes | No | fp16 | AWQ (LLM only) | 7.8 | 12.3 |
333
+ | No | No | fp16 | AWQ + fusing (LLM only) | 10.5 | 19.5 |
334
+
335
+ To learn more quantization schemes and fusing, we refer to the [documentation](https://huggingface.co/docs/transformers/quantization).
336
+
337
  # Bias, Risks, and Limitations
338
 
339
  Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).