Update README.md

<div align=center>
<img src="assets/teaser.jpg" width=600 >
</div>

# 1. Introduction 📚
**TL;DR: ChatRex is a MLLM skilled in perception that can respond to questions while simultaneously grounding its answers to the referenced objects.**

ChatRex is a Multimodal Large Language Model (MLLM) designed to seamlessly integrate fine-grained object perception and robust language understanding. By adopting a decoupled architecture with a retrieval-based approach for object detection and leveraging high-resolution visual inputs, ChatRex addresses key challenges in perception tasks. It is powered by the Rexverse-2M dataset with diverse image-region-text annotations. ChatRex can be applied to various scenarios requiring fine-grained perception, such as object detection, grounded conversation, grounded image captioning and region
understanding.

<div align=center>
<img src="assets/capability_overview.jpg" width=800 >
</div>

----

# 2. Installation 🛠️
```bash
conda install -n chatrex python=3.9
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
pip install -v -e .
# install deformable attention for universal proposal network
cd chatrex/upn/ops
pip install -v -e .
```

## 2.1 Download Pre-trained Models
We provide model checkpoints for both the ***Universal Proposal Network (UPN)*** and the ***ChatRex model***. You can download the pre-trained models from the following links:
- [UPN Checkpoint](https://drive.google)
- [ChatRex-7B Checkpoint](https://huggingface.co/IDEA-Research/ChatRex-7B)

Or you can also using the following command to download the pre-trained models:
```bash
mkdir checkpoints
mkdir checkpoints/upn
# download UPN checkpoint
wget -O checkpoints/upn/upn_large.pth https://drive.google.com/file/d/
# download ChatRex checkpoint from huggingface IDEA-Research/ChatRex-7B
# Download ChatRex checkpoint from Hugging Face
git lfs install
git clone https://huggingface.co/IDEA-Research/ChatRex-7B checkpoints/chatrex
```

## 2.2 Verify Installation
To verify the ***installation of the Universal Proposal Network (UPN)***, run the following command:
```bash
python tests/test_upn_install.py
```

If the installation is successful, you will get two visualization images of both fine-grained proposal and coarse-grained proposal in `tests` folder.

To verify the ***installation of the ChatRex model***, run the following command:
```bash
python tests/test_chatrex_install.py
```

If the installation is successful, you will get an output like this:
```text
prediction: <obj0> shows a brown dog lying on a bed. The dog is resting comfortably, possibly sleeping, and is positioned on the left side of the bed
```

# 3. Usage 🚀
## 3.1 Use UPN for Object Proposal Generation

Universal Proposal Network (UPN) is a robust object proposal model designed as part of ChatRex to enable comprehensive and accurate object detection across diverse granularities and domains. Built upon T-Rex2, UPN is a DETR-based model with a dual-granularity prompt tuning strategy, combining fine-grained (e.g., part-level) and coarse-grained (e.g., instance-level) detection.

<div align=center>
<img src="assets/upn_res.jpg" width=600 >
</div>

----

<details close>
<summary><strong>Example Code for UPN</strong></summary>

```python
import torch
from PIL import Image
from tools.visualize import plot_boxes_to_image
from chatrex.upn import UPNWrapper

ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
test_image_path = "tests/images/test_upn.jpeg"

model = UPNWrapper(ckpt_path)
# fine-grained prompt
fine_grained_proposals = model.inference(
test_image_path, prompt_type="fine_grained_prompt"
)
# filter by score (default: 0.3) and nms (default: 0.8)
fine_grained_filtered_proposals = model.filter(
fine_grained_proposals, min_score=0.3, nms_value=0.8
)
## output is a dict with keys: "original_xyxy_boxes", "scores"
## - "original_xyxy_boxes": list of boxes in xyxy format in shape (B, N, 4)
## - "scores": list of scores for each box in shape (B, N)

# coarse-grained prompt
coarse_grained_proposals = model.inference(
test_image_path, prompt_type="coarse_grained_prompt"
)
coarse_grained_filtered_proposals = model.filter(
coarse_grained_proposals, min_score=0.3, nms_value=0.8
)

## output is a dict with keys: "original_xyxy_boxes", "scores"
## - "original_xyxy_boxes": list of boxes in xyxy format in shape (B, N, 4)
## - "scores": list of scores for each box in shape (B, N)
```

</details>

We also provide a visualization tool to visualize the object proposals generated by UPN. You can use the following code to visualize the object proposals:

<details close>
<summary><strong>Example Code for UPN Visualization</strong></summary>

```python

from chatrex.tools.visualize import plot_boxes_to_image
image = Image.open(test_image_path)
fine_grained_vis_image, _ = plot_boxes_to_image(
image.copy(),
fine_grained_filtered_proposals["original_xyxy_boxes"][0],
fine_grained_filtered_proposals["scores"][0],
)
fine_grained_vis_image.save("tests/test_image_fine_grained.jpeg")
print(f"fine-grained proposal is saved at tests/test_image_fine_grained.jpeg")

coarse_grained_vis_image, _ = plot_boxes_to_image(
image.copy(),
coarse_grained_filtered_proposals["original_xyxy_boxes"][0],
coarse_grained_filtered_proposals["scores"][0],
)
coarse_grained_vis_image.save("tests/test_image_coarse_grained.jpeg")
print(f"coarse-grained proposal is saved at tests/test_image_coarse_grained.jpeg")

```
</details>

## 3.2 Usage of ChatRex

ChatRex takes three inputs: image, text prompt, and box input. For the box input, you can either use the object proposals generated by UPN or provide your own box input (user drawn boxes). We have wrapped the ChatRex model to huggingface transformers format for easy usage. ChatRex can be used for various tasks and we provide example code for each task below.

### 3.2.1 ChatRex for Object Detection & Grounding & Referring

Example Prompt for detection, grounding, referring tasks:
```text
# Single Object Detection
Please detect dog in this image. Answer the question with object indexes.
Please detect the man in yellow shirt in this image. Answer the question with object indexes.

# multiple object detection, use ; to separate the objects
Please detect person; pigeon in this image. Answer the question with object indexes.
Please detect person in the car; cat below the table in this image. Answer the question with object indexes.
```

<details close>
<summary><strong>Example Code</strong></summary>

- [Example Code in python file](tests/test_chatrex_detection.py)

```python
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

from chatrex.tools.visualize import visualize_chatrex_output
from chatrex.upn import UPNWrapper

if __name__ == "__main__":
# load the processor
processor = AutoProcessor.from_pretrained(
"checkpoints/chatrex7b",
trust_remote_code=True,
device_map="cuda",
)

print(f"loading chatrex model...")
# load chatrex model
model = AutoModelForCausalLM.from_pretrained(
"checkpoints/chatrex7b",
trust_remote_code=True,
use_safetensors=True,
).to("cuda")

# load upn model
print(f"loading upn model...")
ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
model_upn = UPNWrapper(ckpt_path)
test_image_path = "tests/images/test_chatrex_detection.jpg"

# get upn predictions
fine_grained_proposals = model_upn.inference(
test_image_path, prompt_type="fine_grained_prompt"
)
fine_grained_filtered_proposals = model_upn.filter(
fine_grained_proposals, min_score=0.3, nms_value=0.8
)

inputs = processor.process(
image=Image.open(test_image_path),
question="Please detect person; pigeon in this image. Answer the question with object indexes.",
bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
0
], # box in xyxy format
)

inputs = {k: v.to("cuda") for k, v in inputs.items()}

# perform inference
gen_config = GenerationConfig(
max_new_tokens=512,
do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id,
pad_token_id=(
processor.tokenizer.pad_token_id
if processor.tokenizer.pad_token_id is not None
else processor.tokenizer.eos_token_id
),
)
with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
prediction = model.generate(
inputs, gen_config=gen_config, tokenizer=processor.tokenizer
)
print(f"prediction:", prediction)

# visualize the prediction
vis_image = visualize_chatrex_output(
Image.open(test_image_path),
fine_grained_filtered_proposals["original_xyxy_boxes"][0],
prediction,
font_size=15,
draw_width=5,
)
vis_image.save("tests/test_chatrex_detection.jpeg")
print(f"prediction is saved at tests/test_chatrex_detection.jpeg")
```

The output from LLM is like:
```text
<ground>person</ground><objects><obj10><obj14><obj15><obj27><obj28><obj32><obj33><obj35><obj38><obj47><obj50></objects>
<ground>pigeon</ground><objects><obj0><obj1><obj2><obj3><obj4><obj5><obj6><obj7><obj8><obj9><obj11><obj12><obj13><obj16><obj17><obj18><obj19><obj20><obj21><obj22><obj23><obj24><obj25><obj26><obj29><obj31><obj37><obj39><obj40><obj41><obj44><obj49></objects>
```

The visualization of the output is like:

<div align=center>
<img src="assets/vis_output/test_chatrex_detection.jpeg" width=600 >
</div>

</details>

----

### 3.2.2 ChatRex for Region Caption
Example Prompt for Region Caption tasks:

```text
# Single Object Detection
## caption in category name
What is the category name of <obji>? Answer the question with its category name in free format.

## caption in short phrase
Can you provide me with a short phrase to describe <obji>? Answer the question with a short phrase.

## caption in referring style
Can you provide me with a brief description of <obji>? Answer the question with brief description.

## caption in one sentence
Can you provide me with

Files changed (1) hide show

README.md +13 -3

README.md CHANGED Viewed

@@ -1,3 +1,13 @@
----
-{}
----

+---
+pipeline_tag: image-text-to-text
+language:
+- en
+base_model:
+- lmsys/vicuna-7b-v1.5
+- openai/clip-vit-large-patch14
+- laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg
+tags:
+- chatrex
+---
+# title here
+test