---
license: apache-2.0
base_model: HuggingFaceM4/idefics2-8b
tags:
- multimodal
- lmm
- vlm
- llava
- siglip
- llama3
- mantis
model-index:
- name: mantis-8b-idefics2_8192
  results: []
datasets:
- TIGER-Lab/Mantis-Instruct
language:
- en
---

# 🔥 Mantis

[Paper](https://arxiv.org/abs/2405.01483) | [Website](https://tiger-ai-lab.github.io/Mantis/) | [Github](https://github.com/TIGER-AI-Lab/Mantis) | [Models](https://huggingface.co/collections/TIGER-Lab/mantis-6619b0834594c878cdb1d6e4) | [Demo](https://huggingface.co/spaces/TIGER-Lab/Mantis)

![Mantis](https://tiger-ai-lab.github.io/Mantis/images/radar_chart.png)

**Excited to announce Mantis-Idefics2, with enhanced ability in multi-image scenarios!**
It's fine-tuned on [Mantis-Instruct](https://huggingface.co/datasets/TIGER-Lab/Mantis-Instruct) from [Idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b)

## Summary

- Mantis-Idefics2 is LMM with **interleaved text and image as inputs**, trained on Mantis-Instruct under academic-level resources (i.e. 36 hours on 16xA100-40G).
- Mantis is trained to have multi-image skills including co-reference, reasoning, comparing, temporal understanding.
- Mantis reaches the state-of-the-art performance on five multi-image benchmarks (NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval), and also maintain a strong single-image performance on par with CogVLM and Emu2.

## Multi-Image Performance

| Models             | Size |  Format  | NLVR2 | Q-Bench | Mantis-Eval | BLINK | MVBench |  Avg |
|--------------------|:----:|:--------:|:-----:|:-------:|:-----------:|:-----:|:-------:|:----:|
| GPT-4V             |   -  | sequence | 88.80 |  76.52  |    62.67    | 51.14 |  43.50  | 64.5 |
| Open Source Models |      |          |       |         |             |       |         |      |
| Random             |   -  |     -    | 48.93 |  40.20  |    23.04    | 38.09 |  27.30  | 35.5 |
| Kosmos2            | 1.6B |   merge  | 49.00 |  35.10  |    30.41    | 37.50 |  21.62  | 34.7 |
| LLaVA-v1.5         |  7B  |   merge  | 53.88 |  49.32  |    31.34    | 37.13 |  36.00  | 41.5 |
| LLava-V1.6         |  7B  |   merge  | 58.88 |  54.80  |    45.62    | 39.55 |  40.90  | 48.0 |
| Qwen-VL-Chat       |  7B  |   merge  | 58.72 |  45.90  |    39.17    | 31.17 |  42.15  | 43.4 |
| Fuyu               |  8B  |   merge  | 51.10 |  49.15  |    27.19    | 36.59 |  30.20  | 38.8 |
| BLIP-2             |  13B |   merge  | 59.42 |  51.20  |    49.77    | 39.45 |  31.40  | 46.2 |
| InstructBLIP       |  13B |   merge  | 60.26 |  44.30  |    45.62    | 42.24 |  32.50  | 45.0 |
| CogVLM             |  17B |   merge  | 58.58 |  53.20  |    45.16    | 41.54 |  37.30  | 47.2 |
| OpenFlamingo       |  9B  | sequence | 36.41 |  19.60  |    12.44    | 39.18 |   7.90  | 23.1 |
| Otter-Image        |  9B  | sequence | 49.15 |  17.50  |    14.29    | 36.26 |  15.30  | 26.5 |
| Idefics1           |  9B  | sequence | 54.63 |  30.60  |    28.11    | 24.69 |  26.42  | 32.9 |
| VideoLLaVA         |  7B  | sequence | 56.48 |  45.70  |    35.94    | 38.92 |  44.30  | 44.3 |
| Emu2-Chat          |  37B | sequence | 58.16 |  50.05  |    37.79    | 36.20 |  39.72  | 44.4 |
| Vila               |  8B  | sequence | 76.45 |  45.70  |    51.15    | 39.30 |  49.40  | 52.4 |
| Idefics2           |  8B  | sequence | 86.87 |  57.00  |    48.85    | 45.18 |  29.68  | 53.5 |
| Mantis-CLIP        |  8B  | sequence | 84.66 |  66.00  |    55.76    | 47.06 |  48.30  | 60.4 |
| Mantis-SIGLIP      |  8B  | sequence | 87.43 |  69.90  |    **59.45**    | 46.35 |  50.15  | 62.7 |
| Mantis-Flamingo    |  9B  | sequence | 52.96 |  46.80  |    32.72    | 38.00 |  40.83  | 42.3 |
| Mantis-Idefics2    |  8B  | sequence | **89.71** |  **75.20**  |    57.14    | **49.05** |  **51.38**  | **64.5** |
| $\Delta$ over SOTA |   -  |     -    |  +2.84 |  +18.20  |     +8.30    |  +3.87 |   +1.98  | +11.0 |

## Single-Image Performance

| Model           | Size | TextVQA |  VQA |  MMB | MMMU | OKVQA |  SQA | MathVista |  Avg |
|-----------------|:----:|:-------:|:----:|:----:|:----:|:-----:|:----:|:---------:|:----:|
| OpenFlamingo    |  9B  |   46.3  | 58.0 | 32.4 | 28.7 |  51.4 | 45.7 |    18.6   | 40.2 |
| Idefics1        |  9B  |   39.3  | 68.8 | 45.3 | 32.5 |  50.4 | 51.6 |    21.1   | 44.1 |
| InstructBLIP    |  7B  |   33.6  | 75.2 | 38.3 | 30.6 |  45.2 | 70.6 |    24.4   | 45.4 |
| Yi-VL           |  6B  |   44.8  | 72.5 | 68.4 | 39.1 |  51.3 | 71.7 |    29.7   | 53.9 |
| Qwen-VL-Chat    |  7B  |   63.8  | 78.2 | 61.8 | 35.9 |  56.6 | 68.2 |    15.5   | 54.3 |
| LLaVA-1.5       |  7B  |   58.2  | 76.6 | 64.8 | 35.3 |  53.4 | 70.4 |    25.6   | 54.9 |
| Emu2-Chat       |  37B |   __66.6__  | **84.9** | 63.6 | 36.3 |  **64.8** | 65.3 |    30.7   | 58.9 |
| CogVLM          |  17B |   **70.4**  | __82.3__ | 65.8 | 32.1 |  __64.8__ | 65.6 |    35.0   | 59.4 |
| Idefics2        |  8B  |   70.4  | 79.1 | __75.7__ | **43.0** |  53.5 | **86.5** |    **51.4**   | **65.7** |
| Mantis-CLIP     |  8B  |   56.4  | 73.0 | 66.0 | 38.1 |  53.0 | 73.8 |    31.7   | 56.0 |
| Mantis-SigLIP   |  8B  |   59.2  | 74.9 | 68.7 | 40.1 |  55.4 | 74.9 |    34.4   | 58.2 |
| Mantis-Idefics2 |  8B  |   63.5  | 77.6 | 75.7 | __41.1__ |  52.6 | __81.3__ |    __40.4__   | __61.7__ |

## How to use

### Run example inference:
```python
```

### Training
See [mantis/train](https://github.com/TIGER-AI-Lab/Mantis/tree/main/mantis/train) for details

### Evaluation
See [mantis/benchmark](https://github.com/TIGER-AI-Lab/Mantis/tree/main/mantis/benchmark) for details

**Please cite our paper or give a star to out Github repo if you find this model useful**

## Citation
```
@inproceedings{Jiang2024MANTISIM,
  title={MANTIS: Interleaved Multi-Image Instruction Tuning},
  author={Dongfu Jiang and Xuan He and Huaye Zeng and Cong Wei and Max W.F. Ku and Qian Liu and Wenhu Chen},
  publisher={arXiv2405.01483}
  year={2024},
}
```