File size: 9,786 Bytes

---
base_model: HuggingFaceM4/idefics2-8b
datasets:
- TIGER-Lab/Mantis-Instruct
language:
- en
license: apache-2.0
tags:
- multimodal
- lmm
- vlm
- llava
- siglip
- llama3
- mantis
model-index:
- name: mantis-8b-idefics2_8192
  results: []
---

# 🔥 Mantis (TMLR 2024)

[Paper](https://arxiv.org/abs/2405.01483) | 
[Website](https://tiger-ai-lab.github.io/Mantis/) | 
[Github](https://github.com/TIGER-AI-Lab/Mantis) | 
[Models](https://huggingface.co/collections/TIGER-Lab/mantis-6619b0834594c878cdb1d6e4) | 
[Demo](https://huggingface.co/spaces/TIGER-Lab/Mantis) |
[Wandb](https://api.wandb.ai/links/dongfu/lnkrl3af)

![Mantis](https://tiger-ai-lab.github.io/Mantis/images/radar_chart.png)

**Excited to announce Mantis-Idefics2, with enhanced ability in multi-image scenarios!**
It's fine-tuned on [Mantis-Instruct](https://huggingface.co/datasets/TIGER-Lab/Mantis-Instruct) from [Idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b)

## Summary

- Mantis-Idefics2 is an LMM with **interleaved text and image as inputs**, trained on Mantis-Instruct under academic-level resources (i.e. 36 hours on 16xA100-40G).
- Mantis is trained to have multi-image skills including co-reference, reasoning, comparing, temporal understanding.
- Mantis reaches the state-of-the-art performance on five multi-image benchmarks (NLVR2, Q-Bench, BLINK, MVBench, Mantis-Eval), and also maintain a strong single-image performance on par with CogVLM and Emu2.

## Multi-Image Performance

| Models             | Size |  Format  | NLVR2 | Q-Bench | Mantis-Eval | BLINK | MVBench |  Avg |
|--------------------|:----:|:--------:|:-----:|:-------:|:-----------:|:-----:|:-------:|:----:|
| GPT-4V             |   -  | sequence | 88.80 |  76.52  |    62.67    | 51.14 |  43.50  | 64.5 |
| Open Source Models |      |          |       |         |             |       |         |      |
| Random             |   -  |     -    | 48.93 |  40.20  |    23.04    | 38.09 |  27.30  | 35.5 |
| Kosmos2            | 1.6B |   merge  | 49.00 |  35.10  |    30.41    | 37.50 |  21.62  | 34.7 |
| LLaVA-v1.5         |  7B  |   merge  | 53.88 |  49.32  |    31.34    | 37.13 |  36.00  | 41.5 |
| LLava-V1.6         |  7B  |   merge  | 58.88 |  54.80  |    45.62    | 39.55 |  40.90  | 48.0 |
| Qwen-VL-Chat       |  7B  |   merge  | 58.72 |  45.90  |    39.17    | 31.17 |  42.15  | 43.4 |
| Fuyu               |  8B  |   merge  | 51.10 |  49.15  |    27.19    | 36.59 |  30.20  | 38.8 |
| BLIP-2             |  13B |   merge  | 59.42 |  51.20  |    49.77    | 39.45 |  31.40  | 46.2 |
| InstructBLIP       |  13B |   merge  | 60.26 |  44.30  |    45.62    | 42.24 |  32.50  | 45.0 |
| CogVLM             |  17B |   merge  | 58.58 |  53.20  |    45.16    | 41.54 |  37.30  | 47.2 |
| OpenFlamingo       |  9B  | sequence | 36.41 |  19.60  |    12.44    | 39.18 |   7.90  | 23.1 |
| Otter-Image        |  9B  | sequence | 49.15 |  17.50  |    14.29    | 36.26 |  15.30  | 26.5 |
| Idefics1           |  9B  | sequence | 54.63 |  30.60  |    28.11    | 24.69 |  26.42  | 32.9 |
| VideoLLaVA         |  7B  | sequence | 56.48 |  45.70  |    35.94    | 38.92 |  44.30  | 44.3 |
| Emu2-Chat          |  37B | sequence | 58.16 |  50.05  |    37.79    | 36.20 |  39.72  | 44.4 |
| Vila               |  8B  | sequence | 76.45 |  45.70  |    51.15    | 39.30 |  49.40  | 52.4 |
| Idefics2           |  8B  | sequence | 86.87 |  57.00  |    48.85    | 45.18 |  29.68  | 53.5 |
| Mantis-CLIP        |  8B  | sequence | 84.66 |  66.00  |    55.76    | 47.06 |  48.30  | 60.4 |
| Mantis-SIGLIP      |  8B  | sequence | 87.43 |  69.90  |    **59.45**    | 46.35 |  50.15  | 62.7 |
| Mantis-Flamingo    |  9B  | sequence | 52.96 |  46.80  |    32.72    | 38.00 |  40.83  | 42.3 |
| Mantis-Idefics2    |  8B  | sequence | **89.71** |  **75.20**  |    57.14    | **49.05** |  **51.38**  | **64.5** |
| $\Delta$ over SOTA |   -  |     -    |  +2.84 |  +18.20  |     +8.30    |  +3.87 |   +1.98  | +11.0 |

## Single-Image Performance

| Model           | Size | TextVQA |  VQA |  MMB | MMMU | OKVQA |  SQA | MathVista |  Avg |
|-----------------|:----:|:-------:|:----:|:----:|:----:|:-----:|:----:|:---------:|:----:|
| OpenFlamingo    |  9B  |   46.3  | 58.0 | 32.4 | 28.7 |  51.4 | 45.7 |    18.6   | 40.2 |
| Idefics1        |  9B  |   39.3  | 68.8 | 45.3 | 32.5 |  50.4 | 51.6 |    21.1   | 44.1 |
| InstructBLIP    |  7B  |   33.6  | 75.2 | 38.3 | 30.6 |  45.2 | 70.6 |    24.4   | 45.4 |
| Yi-VL           |  6B  |   44.8  | 72.5 | 68.4 | 39.1 |  51.3 | 71.7 |    29.7   | 53.9 |
| Qwen-VL-Chat    |  7B  |   63.8  | 78.2 | 61.8 | 35.9 |  56.6 | 68.2 |    15.5   | 54.3 |
| LLaVA-1.5       |  7B  |   58.2  | 76.6 | 64.8 | 35.3 |  53.4 | 70.4 |    25.6   | 54.9 |
| Emu2-Chat       |  37B |   <u>66.6</u>  | **84.9** | 63.6 | 36.3 |  **64.8** | 65.3 |    30.7   | 58.9 |
| CogVLM          |  17B |   **70.4**  | <u>82.3</u> | 65.8 | 32.1 |  <u>64.8</u> | 65.6 |    35.0   | 59.4 |
| Idefics2        |  8B  |   70.4  | 79.1 | <u>75.7</u> | **43.0** |  53.5 | **86.5** |    **51.4**   | **65.7** |
| Mantis-CLIP     |  8B  |   56.4  | 73.0 | 66.0 | 38.1 |  53.0 | 73.8 |    31.7   | 56.0 |
| Mantis-SigLIP   |  8B  |   59.2  | 74.9 | 68.7 | 40.1 |  55.4 | 74.9 |    34.4   | 58.2 |
| Mantis-Idefics2 |  8B  |   63.5  | 77.6 | 75.7 | <u>41.1</u> |  52.6 | <u>81.3</u> |    <u>40.4</u>   | <u>61.7</u> |

## How to use

### Run example inference:
```python

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image


processor = AutoProcessor.from_pretrained("TIGER-Lab/Mantis-8B-Idefics2") # do_image_splitting is False by default
model = AutoModelForVision2Seq.from_pretrained(
    "TIGER-Lab/Mantis-8B-Idefics2",
    device_map="auto"
)
generation_kwargs = {
    "max_new_tokens": 1024,
    "num_beams": 1,
    "do_sample": False
}

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image2 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
images = [image1, image2, image3]


query1 = "What cities image 1, image 2, and image 3 belong to respectively? Answer me in order."
query2 = "Which one do you recommend for a visit? and why?"
query3 = "Which picture has most cars in it?"

### Chat
### Round 1
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": query1},
        ]
    }    
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=images, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, **generation_kwargs)
response = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("User: ", query1)
print("ASSISTANT: ", response[0])

### Round 2
messages.append(
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": response[0]},
        ]
    }
)
messages.append(
    {
        "role": "user",
        "content": [
            {"type": "text", "text": query2},
        ]
    }
)
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=images, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, **generation_kwargs)
response = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("User: ", query2)
print("ASSISTANT: ", response[0])

### Round 3
messages.append(
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": response[0]},
        ]
    }
)
messages.append(
    {
        "role": "user",
        "content": [
            {"type": "text", "text": query3},
        ]
    }
)

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=images, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, **generation_kwargs)
response = processor.batch_decode(generated_ids[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print("User: ", query3)
print("ASSISTANT: ", response[0])


"""
User:  What cities image 1, image 2, and image 3 belong to respectively? Answer me in order.
ASSISTANT:  Chicago, New York, San Francisco
User:  Which one do you recommend for a visit? and why?
ASSISTANT:  New York - because it's a bustling metropolis with iconic landmarks like the Statue of Liberty and the Empire State Building.
User:  Which picture has most cars in it?
ASSISTANT:  Image 3
"""
```

### Training
See [mantis/train](https://github.com/TIGER-AI-Lab/Mantis/tree/main/mantis/train) for details

### Evaluation
See [mantis/benchmark](https://github.com/TIGER-AI-Lab/Mantis/tree/main/mantis/benchmark) for details

**Please cite our paper or give a star to out Github repo if you find this model useful**

## Citation
```
@article{Jiang2024MANTISIM,
  title={MANTIS: Interleaved Multi-Image Instruction Tuning},
  author={Dongfu Jiang and Xuan He and Huaye Zeng and Cong Wei and Max W.F. Ku and Qian Liu and Wenhu Chen},
  journal={Trans. Mach. Learn. Res.},
  year={2024},
  volume={2024},
  url={Transactions on Machine Learning Research}
}
```