---
license: apache-2.0
datasets:
- HuggingFaceM4/OBELICS
- wikipedia
- pixparse/pdfa-eng-wds
- wendlerc/RenderedText
- HuggingFaceM4/the_cauldron
- teknium/OpenHermes-2.5
- databricks/databricks-dolly-15k
- Lin-Chen/ShareGPT4V
- jxu124/llava_conversation_58k
- pixparse/docvqa-single-page-questions
- flaviagiammarino/path-vqa
- flaviagiammarino/vqa-rad
language:
- en
tags:
- multimodal
- vision
- image-text-to-text
---

<p align="center">
    <img src="https://huggingface.co/HuggingFaceM4/idefics-80b/resolve/main/assets/IDEFICS.png" alt="Idefics-Obelics logo" width="200" height="100">
</p>

# Idefics2

Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. The model can answer questions about images, describe visual content, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs. 

## Model Information
- Base Model: [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b)
- Dataset Used: [DocVQA dataset](https://huggingface.co/datasets/pixparse/docvqa-single-page-questions)
  - Introduced in Mathew et al. (2021)
  - Consists of 50,000 questions defined on 12,000+ document images
  - For further information, visit the [challenge page](https://rrc.cvc.uab.es/?ch=17) and [paper](https://arxiv.org/abs/2007.00398)

## Training Details
- The training process took approximately 38hours on an A100 80GB GPU, and model was fine-tuned using QLoRA.
- Trained with 39.5k train dataset from [DocVQA single page questions](https://huggingface.co/datasets/pixparse/docvqa-single-page-questions)
- Training Log:
  
| Epoch | Loss  | Grad Norm | Learning Rate |
|-------|-------|-----------|---------------|
| 0.01  | 2.3776| 10.40     | 4.8e-05       |
| 0.25  | 0.5029| 6.10      | 9.5412e-05    |
| 0.50  | 0.434 | 5.74      | 7.5973e-05    |
| 0.75  | 0.4608| 7.46      | 7.3925e-05    |
| 1.0   | 0.3846| 4.77      | 5.0369e-05    |
| 1.25  | 0.3226| 3.63      | 4.9857e-05    |
| 1.5   | 0.3175| 5.03      | 2.5277e-05    |
| 1.75  | 0.2918| 5.63      | 2.5789e-05    |

{'train_runtime': 141781.6786, 'train_samples_per_second': 0.557, 'train_steps_per_second': 0.035, 'train_loss': 0.3973848872424526, 'epoch': 2.0}

# Technical summary

Idefics2 exhibits strong performance for a model of its size (8B parameters) when compared to other open multimodal models and is often competitive with closed-source systems. As such, it serves as a strong foundation for various use-case specific fine-tunings.

<details><summary>For more details, expand the result table.</summary>

| <nobr>Model</nobr>        | <nobr>Open <br>weights</nobr> | <nobr>Size</nobr> | <nobr># tokens <br>per image</nobr> | <nobr>MMMU <br>(val/test)</nobr>   | <nobr>MathVista <br>(testmini)</nobr> | <nobr>TextVQA <br>(val)</nobr> | <nobr>MMBench <br>(test)</nobr> | <nobr>VQAv2 <br>(test-dev)</nobr> | <nobr>DocVQA <br>(test)</nobr> |
|--------------|-------------|------|--------------------|-----------|-----------|---------|---------|---------|---------|
| [DeepSeek-VL](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)  | ✅ |  7B   | 576                | 36.6/-   | 36.1      | 64.4       | 73.2    |  -     |   49.6   |
| [LLaVa-NeXT-Mistral-7B](https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b)   | ✅ | 7B  | 2880               | 35.3/-   | 37.7      | 65.7    | 68.7  | 82.2	 |   -   |
| [LLaVa-NeXT-13B](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b)   | ✅ | 13B  | 2880               | 36.2/-   | 35.3      | 67.1    | 70.0  | 82.8 |   -   |
| [LLaVa-NeXT-34B](https://huggingface.co/liuhaotian/llava-v1.6-34b) | ✅ |  34B    | 2880                  | 51.1/44.7 | 46.5  | 69.5  | 79.3    | 83.7    |   -   |   -   |
| MM1-Chat-7B  | ❌ | 7B   | 720                | 37.0/35.6 | 35.9      | 72.8    | 72.3    |   -   |    -   |
| MM1-Chat-30B | ❌ | 30B    | 720                  | 44.7/40.3 | 39.4  | 73.5  | 75.1    |    83.7   |       |
| Gemini 1.0 Pro | ❌ | 🤷‍♂️ |  🤷‍♂️  |  47.9/-  |   45.2   |    74.6    |   -    | 71.2 |  88.1  |
| Gemini 1.5 Pro | ❌ | 🤷‍♂️ |  🤷‍♂️  |  58.5/-  |   52.1   |    73.5    |   -    | 73.2 |  86.5  |
| Claude 3 Haiku |  ❌ | 🤷‍♂️ |  🤷‍♂️  |  50.2/-  |   46.4   |    -    |   -    | - |  88.8  |
|      |    |                  |  |       |    |     |
| [Idefics1 instruct](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) (32-shots) | ✅ |  80B |  -  |  -  |   -   |    39.3    |   -    | 68.8 |  -  |
|      |    |                  |  |       |    |     |
| **Idefics2** (w/o im. split) | ✅ |  8B   | 64                 | 43.5/37.9 | 51.6      | 70.4    | 76.8    | 80.8 | 67.3 |
| **Idefics2** (w/ im. split) | ✅ |  8B   | 320                | 43.0/37.7 | 51.4      | 73.0    | 76.7    | 81.2 | 74.0 |
| **Idefics2 DocVQA Finetuned** (w/ im. split) | ✅ |  8B   | 320                | 43.0/37.7 | 52.5      | 72.0    | 77.7    | 81.1 | 72.5 |
</details>

Idefics2 is trained in 2 stages for maximum efficiency. In a first stage, images are fed to the model at SigLIP's native resolution (squares of 384 x 384). In the second stage, images are fed to the model at their native resolution (with a maximum of 980 and a minimum of 378) and native aspect ratio.

We use DoRA to train the parameters initialized from pre-trained backbones and full fine-tuning for newly initialized parameters (modality connector), as we find this strategy to be more stable as well as more computationally efficient.

# Vision Encoder Efficiency

Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:

1. **Deactivate image splitting**: To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the SFT model has been trained with image splitting.

2. **Decrease maximum image resolution**: To do so, add `size={"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need (the default value is 980). We recommend using values that are multiples of 14. There are no changes required on the model side.

`do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).


# How to Get Started

This section shows snippets of code for generation for `idefics2-8b-DocVQA-finetuned`. Let's first define some common imports and inputs.

```python
import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
```

**For `idefics2-8b-base`**

<details><summary>Click to expand.</summary>

```python
processor = AutoProcessor.from_pretrained("idefics2-8b-DocVQA-finetuned")
model = AutoModelForVision2Seq.from_pretrained(
    "Reverb/idefics2-8b-DocVQA-finetuned",
).to(DEVICE)

# Create inputs
prompts = [
  "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
  "In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of Chicago, and more specifically the skyscrapers of the city.', 'In which city is that bridge located? The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and']
```

</details>

**Text generation inference**

Idefics2 is integrated into [TGI](https://github.com/huggingface/text-generation-inference).

Multiple images can be passed on with the markdown syntax (`![](IMAGE_URL)`) and no spaces are required before and after. The dialogue utterances can be separated with `<end_of_utterance>\n` followed by `User:` or `Assistant:`. `User:` is followed by a space if the following characters are real text (no space if followed by an image).

# Model optimizations

If your GPU allows, we first recommend loading (and running inference) in half precision (`torch.float16` or `torch.bfloat16`).

```diff
model = AutoModelForVision2Seq.from_pretrained(
    "Reverb/idefics2-8b-DocVQA-finetuned",
+    torch_dtype=torch.float16,    
).to(DEVICE)
```

**Vision encoder efficiency**

Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:
- **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
- **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need (the default value is `980`). We recommend using values that are multiples of 14. There are no changes required on the model side.

`do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).

**Using Flash-attention 2 to speed up generation**

<details><summary>Click to expand.</summary>

First, make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) for the package installation. Simply change the snippet above with: 

```diff
model = AutoModelForVision2Seq.from_pretrained(
    "Reverb/idefics2-8b-DocVQA-finetuned",
+    torch_dtype=torch.float16,    
+    _attn_implementation="flash_attention_2",
).to(DEVICE)
```

</details>

These optimizations can be combined to suit variable trade-offs between GPU memory, inference speed and performance. We provide the following comparison as anchor points to guide the user in choosing necessary optimizations. All of these benchmarks were computed with the example code snippet described above on a H100 (see [colab](https://colab.research.google.com/drive/1USsnssoFm1UTYuwUOw0XiGeBspLHzvso?usp=sharing)). As one can see, the are a few setups that require less than 24GB of GPU memory.

| Flash attention 2 | Image splitting | Float type | 4 bits quantization         | Peak GPU memory (GB) | Time for 20 generations (secs) |
|-------------------|-----------------|------------|-----------------------------|----------------------|--------------------------------|
| No                | Yes             | fp32       | No                          |                 54.9 |                           55.6 |
| No                | Yes             | bf16       | No                          |                 41.3 |                           34.3 |
| No                | Yes             | fp16       | No                          |                 36.7 |                           33.3 |
| Yes               | Yes             | fp16       | No                          |                 21.0 |                           13.3 |
| Yes               | Yes             | fp16       | bitsandbytes (entire model) |                  8.9 |                           19.9 |
| No                | Yes             | fp16       | bitsandbytes (entire model) |                 24.7 |                           40.4 |
| No                | Yes             | fp16       | AWQ (LLM only)              |                 26.4 |                           37.1 |
| Yes               | Yes             | fp16       | AWQ (LLM only)              |                 10.7 |                           16.3 |
| No                | Yes             | fp16       | AWQ + fusing (LLM only)     |                 26.0 |                           38.4 |
|                   |                 |            |                             |                      |                                |
| No                | No              | fp32       | No                          |                 38.8 |                           17.5 |
| No                | No              | bf16       | No                          |                 22.2 |                           14.4 |
| No                | No              | fp16       | No                          |                 21.3 |                           13.9 |
| Yes               | No              | fp16       | No                          |                 18.1 |                           10.4 |
| Yes               | No              | fp16       | bitsandbytes (entire model) |                  6.0 |                           17.3 |
| No                | No              | fp16       | bitsandbytes (entire model) |                  9.2 |                           20.9 |
| No                | No              | fp16       | AWQ (LLM only)              |                 10.9 |                           15.9 |
| Yes               | No              | fp16       | AWQ (LLM only)              |                  7.8 |                           12.3 |
| No                | No              | fp16       | AWQ + fusing (LLM only)     |                 10.5 |                           19.5 |

To learn more quantization schemes and fusing, we refer to the [documentation](https://huggingface.co/docs/transformers/quantization).

# Bias, Risks, and Limitations

- The model currently will offer medical diagnosis when prompted to do so ([vqa-rad](https://huggingface.co/datasets/flaviagiammarino/vqa-rad), a dataset of QA pairs on radiology images is present in the SFT mixture). For example, the prompt `Does this X-ray show any medical problems?` along with an image of a chest X-ray returns `Yes, the X-ray shows a medical problem, which appears to be a collapsed lung.`. We discourage users from using the model on medical applications without proper adaptation and evaluation.
- Despite our efforts in filtering the training data, we found a small proportion of content that is not suitable for all audiences. This includes pornographic content and reports of violent shootings and is prevalent in the OBELICS portion of the data (see [here](https://huggingface.co/datasets/HuggingFaceM4/OBELICS#content-warnings) for more details). As such, the model is susceptible to generating text that resembles this content.
- We note that we know relatively little about the composition of the pre-trained LM backbone, which makes it difficult to link inherited limitations or problematic behaviors to their data.

# Misuse and Out-of-scope use

Using the model in [high-stakes](https://huggingface.co/bigscience/bloom/blob/main/README.md#glossary-and-calculations) settings is out of scope for this model. The model is not designed for [critical decisions](https://huggingface.co/bigscience/bloom/blob/main/README.md#glossary-and-calculations) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but may not be correct. Out-of-scope uses include:
- Usage for evaluating or scoring individuals, such as for employment, education, or credit
- Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct

Intentionally using the model for harm, violating [human rights](https://huggingface.co/bigscience/bloom/blob/main/README.md#glossary-and-calculations), or other kinds of malicious activities, is a misuse of this model. This includes:
- Spam generation
- Disinformation and influence operations
- Disparagement and defamation
- Harassment and abuse
- [Deception](https://huggingface.co/bigscience/bloom/blob/main/README.md#glossary-and-calculations)
- Unconsented impersonation and imitation
- Unconsented surveillance

# Citation

**BibTeX:**

```bibtex
@misc{laurencon2023obelics,
      title={OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents},
      author={Hugo Laurençon and Lucile Saulnier and Léo Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M. Rush and Douwe Kiela and Matthieu Cord and Victor Sanh},
      year={2023},
      eprint={2306.16527},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{laurençon2024matters,
      title={What matters when building vision-language models?}, 
      author={Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh},
      year={2024},
      eprint={2405.02246},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
```