--- language: - en base_model: - lmsys/vicuna-7b-v1.5 - openai/clip-vit-large-patch14 - laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg pipeline_tag: image-text-to-text tags: - chatrex - upn --- arxiv.org/abs/2411.18363
---- # 1. Introduction 📚 **TL;DR: ChatRex is an MLLM skilled in perception that can respond to questions while simultaneously grounding its answers to the referenced objects.** ChatRex is a Multimodal Large Language Model (MLLM) designed to seamlessly integrate fine-grained object perception and robust language understanding. By adopting a decoupled architecture with a retrieval-based approach for object detection and leveraging high-resolution visual inputs, ChatRex addresses key challenges in perception tasks. It is powered by the Rexverse-2M dataset with diverse image-region-text annotations. ChatRex can be applied to various scenarios requiring fine-grained perception, such as object detection, grounded conversation, grounded image captioning and region understanding.
---- # 2. Installation 🛠️ ```bash conda install -n chatrex python=3.9 pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121 git clone https://github.com/IDEA-Research/ChatRex.git cd ChatRex pip install -v -e . # install deformable attention for universal proposal network cd chatrex/upn/ops pip install -v -e . ``` ## 2.1 Download Pre-trained UPN Models We provide model checkpoints for both the ***Universal Proposal Network (UPN)*** and the ***ChatRex model***. You can download the pre-trained models from the following links: - [UPN Checkpoint](https://github.com/IDEA-Research/ChatRex/releases/download/upn-large/upn_large.pth) - [ChatRex-7B Checkpoint](https://huggingface.co/IDEA-Research/ChatRex-7B) Or you can also using the following command to download the pre-trained models: ```bash mkdir checkpoints mkdir checkpoints/upn # download UPN checkpoint wget -O checkpoints/upn/upn_large.pth https://github.com/IDEA-Research/ChatRex/releases/download/upn-large/upn_large.pth ``` ## 2.2 Verify Installation To verify the ***installation of the Universal Proposal Network (UPN)***, run the following command: ```bash python tests/test_upn_install.py ``` If the installation is successful, you will get two visualization images of both fine-grained proposal and coarse-grained proposal in `tests` folder. To verify the ***installation of the ChatRex model***, run the following command: ```bash python tests/test_chatrex_install.py ``` If the installation is successful, you will get an output like this: ```text prediction: shows a brown dog lying on a bed. The dog is resting comfortably, possibly sleeping, and is positioned on the left side of the bed ``` # 3. Usage 🚀 ## 3.1 Use UPN for Object Proposal Generation Universal Proposal Network (UPN) is a robust object proposal model designed as part of ChatRex to enable comprehensive and accurate object detection across diverse granularities and domains. Built upon T-Rex2, UPN is a DETR-based model with a dual-granularity prompt tuning strategy, combining fine-grained (e.g., part-level) and coarse-grained (e.g., instance-level) detection.
----
Example Code for UPN ```python import torch from PIL import Image from tools.visualize import plot_boxes_to_image from chatrex.upn import UPNWrapper ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth" test_image_path = "tests/images/test_upn.jpeg" model = UPNWrapper(ckpt_path) # fine-grained prompt fine_grained_proposals = model.inference( test_image_path, prompt_type="fine_grained_prompt" ) # filter by score (default: 0.3) and nms (default: 0.8) fine_grained_filtered_proposals = model.filter( fine_grained_proposals, min_score=0.3, nms_value=0.8 ) ## output is a dict with keys: "original_xyxy_boxes", "scores" ## - "original_xyxy_boxes": list of boxes in xyxy format in shape (B, N, 4) ## - "scores": list of scores for each box in shape (B, N) # coarse-grained prompt coarse_grained_proposals = model.inference( test_image_path, prompt_type="coarse_grained_prompt" ) coarse_grained_filtered_proposals = model.filter( coarse_grained_proposals, min_score=0.3, nms_value=0.8 ) ## output is a dict with keys: "original_xyxy_boxes", "scores" ## - "original_xyxy_boxes": list of boxes in xyxy format in shape (B, N, 4) ## - "scores": list of scores for each box in shape (B, N) ```
We also provide a visualization tool to visualize the object proposals generated by UPN. You can use the following code to visualize the object proposals:
Example Code for UPN Visualization ```python from chatrex.tools.visualize import plot_boxes_to_image image = Image.open(test_image_path) fine_grained_vis_image, _ = plot_boxes_to_image( image.copy(), fine_grained_filtered_proposals["original_xyxy_boxes"][0], fine_grained_filtered_proposals["scores"][0], ) fine_grained_vis_image.save("tests/test_image_fine_grained.jpeg") print(f"fine-grained proposal is saved at tests/test_image_fine_grained.jpeg") coarse_grained_vis_image, _ = plot_boxes_to_image( image.copy(), coarse_grained_filtered_proposals["original_xyxy_boxes"][0], coarse_grained_filtered_proposals["scores"][0], ) coarse_grained_vis_image.save("tests/test_image_coarse_grained.jpeg") print(f"coarse-grained proposal is saved at tests/test_image_coarse_grained.jpeg") ```
## 3.2 Usage of ChatRex ChatRex takes three inputs: image, text prompt, and box input. For the box input, you can either use the object proposals generated by UPN or provide your own box input (user drawn boxes). We have wrapped the ChatRex model to huggingface transformers format for easy usage. ChatRex can be used for various tasks and we provide example code for each task below. ### 3.2.1 ChatRex for Object Detection & Grounding & Referring Example Prompt for detection, grounding, referring tasks: ```text # Single Object Detection Please detect dog in this image. Answer the question with object indexes. Please detect the man in yellow shirt in this image. Answer the question with object indexes. # multiple object detection, use ; to separate the objects Please detect person; pigeon in this image. Answer the question with object indexes. Please detect person in the car; cat below the table in this image. Answer the question with object indexes. ```
Example Code ```python import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig from chatrex.tools.visualize import visualize_chatrex_output from chatrex.upn import UPNWrapper if __name__ == "__main__": # load the processor processor = AutoProcessor.from_pretrained( "IDEA-Research/ChatRex-7B", trust_remote_code=True, device_map="cuda", ) print(f"loading chatrex model...") # load chatrex model model = AutoModelForCausalLM.from_pretrained( "IDEA-Research/ChatRex-7B", trust_remote_code=True, use_safetensors=True, ).to("cuda") # load upn model print(f"loading upn model...") ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth" model_upn = UPNWrapper(ckpt_path) test_image_path = "tests/images/test_chatrex_detection.jpg" # get upn predictions fine_grained_proposals = model_upn.inference( test_image_path, prompt_type="fine_grained_prompt" ) fine_grained_filtered_proposals = model_upn.filter( fine_grained_proposals, min_score=0.3, nms_value=0.8 ) inputs = processor.process( image=Image.open(test_image_path), question="Please detect person; pigeon in this image. Answer the question with object indexes.", bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][ 0 ], # box in xyxy format ) inputs = {k: v.to("cuda") for k, v in inputs.items()} # perform inference gen_config = GenerationConfig( max_new_tokens=512, do_sample=False, eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=( processor.tokenizer.pad_token_id if processor.tokenizer.pad_token_id is not None else processor.tokenizer.eos_token_id ), ) with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16): prediction = model.generate( inputs, gen_config=gen_config, tokenizer=processor.tokenizer ) print(f"prediction:", prediction) # visualize the prediction vis_image = visualize_chatrex_output( Image.open(test_image_path), fine_grained_filtered_proposals["original_xyxy_boxes"][0], prediction, font_size=15, draw_width=5, ) vis_image.save("tests/test_chatrex_detection.jpeg") print(f"prediction is saved at tests/test_chatrex_detection.jpeg") ``` The output from LLM is like: ```text person pigeon ``` The visualization of the output is like:
---- ### 3.2.2 ChatRex for Region Caption Example Prompt for Region Caption tasks: ```text # Single Object Detection ## caption in category name What is the category name of ? Answer the question with its category name in free format. ## caption in short phrase Can you provide me with a short phrase to describe ? Answer the question with a short phrase. ## caption in referring style Can you provide me with a brief description of ? Answer the question with brief description. ## caption in one sentence Can you provide me with a one sentence of ? Answer the question with one sentence description. # multiple object detection, use ; to separate the objects ```
Example Code ```python import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig from chatrex.tools.visualize import visualize_chatrex_output from chatrex.upn import UPNWrapper if __name__ == "__main__": # load the processor processor = AutoProcessor.from_pretrained( "IDEA-Research/ChatRex-7B", trust_remote_code=True, device_map="cuda", ) print(f"loading chatrex model...") # load chatrex model model = AutoModelForCausalLM.from_pretrained( "IDEA-Research/ChatRex-7B", trust_remote_code=True, use_safetensors=True, ).to("cuda") test_image_path = "tests/images/test_chatrex_install.jpg" inputs = processor.process( image=Image.open(test_image_path), question="Can you provide a one sentence description of in the image? Answer the question with a one sentence description.", bbox=[[73.88417, 56.62228, 227.69223, 216.34338]], ) inputs = {k: v.to("cuda") for k, v in inputs.items()} # perform inference gen_config = GenerationConfig( max_new_tokens=512, do_sample=False, eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=( processor.tokenizer.pad_token_id if processor.tokenizer.pad_token_id is not None else processor.tokenizer.eos_token_id ), ) with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16): prediction = model.generate( inputs, gen_config=gen_config, tokenizer=processor.tokenizer ) print(f"prediction:", prediction) # visualize the prediction vis_image = visualize_chatrex_output( Image.open(test_image_path), [[73.88417, 56.62228, 227.69223, 216.34338]], prediction, font_size=15, draw_width=5, ) vis_image.save("tests/test_chatrex_region_caption.jpeg") print(f"prediction is saved at tests/test_chatrex_region_caption.jpeg") ``` The output from LLM is like: ```text A brown dog is lying on a bed, appearing relaxed and comfortable ``` The visualization of the output is like:
---- ### 3.2.3 ChatRex for Grounded Image Captioning Example Prompt for Region Caption tasks: ```text # Brief Grounded Imager Caption Please breifly describe this image in one sentence and detect all the mentioned objects. Answer the question with grounded answer. # Detailed Grounded Image Caption Please provide a detailed description of the image and detect all the mentioned objects. Answer the question with grounded object indexes. ```
Example Code ```python import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig from chatrex.tools.visualize import visualize_chatrex_output from chatrex.upn import UPNWrapper if __name__ == "__main__": # load the processor processor = AutoProcessor.from_pretrained( "IDEA-Research/ChatRex-7B", trust_remote_code=True, device_map="cuda", ) print(f"loading chatrex model...") # load chatrex model model = AutoModelForCausalLM.from_pretrained( "IDEA-Research/ChatRex-7B", trust_remote_code=True, use_safetensors=True, ).to("cuda") # load upn model print(f"loading upn model...") ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth" model_upn = UPNWrapper(ckpt_path) test_image_path = "tests/images/test_chatrex_grounded_caption.jpg" # get upn predictions fine_grained_proposals = model_upn.inference( test_image_path, prompt_type="fine_grained_prompt" ) fine_grained_filtered_proposals = model_upn.filter( fine_grained_proposals, min_score=0.3, nms_value=0.8 ) inputs = processor.process( image=Image.open(test_image_path), question="Please breifly describe this image in one sentence and detect all the mentioned objects. Answer the question with grounded answer.", bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][ 0 ], # box in xyxy format ) inputs = {k: v.to("cuda") for k, v in inputs.items()} # perform inference gen_config = GenerationConfig( max_new_tokens=512, do_sample=False, eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=( processor.tokenizer.pad_token_id if processor.tokenizer.pad_token_id is not None else processor.tokenizer.eos_token_id ), ) with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16): prediction = model.generate( inputs, gen_config=gen_config, tokenizer=processor.tokenizer ) print(f"prediction:", prediction) # visualize the prediction vis_image = visualize_chatrex_output( Image.open(test_image_path), fine_grained_filtered_proposals["original_xyxy_boxes"][0], prediction, font_size=15, draw_width=5, ) vis_image.save("tests/test_chatrex_grounded_image_caption.jpeg") print(f"prediction is saved at tests/test_chatrex_grounded_image_caption.jpeg") ``` The output from LLM is like: ```text The image depicts a cozy living room with a plaid couch, a wooden TV standholding a black television, a red armchair, and a whiteboardwith writing on the wall, accompanied by a framed posterof a couple. ``` The visualization of the output is like:
---- ### 3.2.4 ChatRex for Grounded Conversation Example Prompt for Region Caption tasks: ```text Answer the question in Grounded format. Question ```
Example Code ```python import torch from PIL import Image from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig from chatrex.tools.visualize import visualize_chatrex_output from chatrex.upn import UPNWrapper if __name__ == "__main__": # load the processor processor = AutoProcessor.from_pretrained( "IDEA-Research/ChatRex-7B", trust_remote_code=True, device_map="cuda", ) print(f"loading chatrex model...") # load chatrex model model = AutoModelForCausalLM.from_pretrained( "IDEA-Research/ChatRex-7B", trust_remote_code=True, use_safetensors=True, ).to("cuda") # load upn model print(f"loading upn model...") ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth" model_upn = UPNWrapper(ckpt_path) test_image_path = "tests/images/test_grounded_conversation.jpg" # get upn predictions fine_grained_proposals = model_upn.inference( test_image_path, prompt_type="coarse_grained_prompt" ) fine_grained_filtered_proposals = model_upn.filter( fine_grained_proposals, min_score=0.3, nms_value=0.8 ) inputs = processor.process( image=Image.open(test_image_path), question="Answer the question in grounded format. This is a photo of my room, and can you tell me what kind of person I am? ", bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][ 0 ], # box in xyxy format ) inputs = {k: v.to("cuda") for k, v in inputs.items()} # perform inference gen_config = GenerationConfig( max_new_tokens=512, do_sample=False, eos_token_id=processor.tokenizer.eos_token_id, pad_token_id=( processor.tokenizer.pad_token_id if processor.tokenizer.pad_token_id is not None else processor.tokenizer.eos_token_id ), ) with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16): prediction = model.generate( inputs, gen_config=gen_config, tokenizer=processor.tokenizer ) print(f"prediction:", prediction) # visualize the prediction vis_image = visualize_chatrex_output( Image.open(test_image_path), fine_grained_filtered_proposals["original_xyxy_boxes"][0], prediction, font_size=30, draw_width=10, ) vis_image.save("tests/test_chatrex_grounded_conversation.jpeg") print(f"prediction is saved at tests/test_chatrex_grounded_conversation.jpeg") ``` The output from LLM is like: ```text Based on the items in the image, it can be inferred that the person who owns this room has an interest in fitness and possibly enjoys reading. The presence of the dumbbell suggests a commitment to physical activity, while the book indicates a liking for literature or reading. The sneakers and the plush toy add a personal touch, suggesting that the person might also value comfort and perhaps has a playful or nostalgic side. However, without more context, it is not possible to accurately determine the individual's specific traits or personality. ``` The visualization of the output is like:
---- # 5. LICENSE ChatRex is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. Note that this project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses including but not limited to the: - [OpenAI Terms of Use](https://openai.com/policies/terms-of-use) for the dataset. - For the LLM used in this project, the model is [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main), which is licensed under [Llama 2 Community License Agreement](https://huggingface.co/lmsys/vicuna-7b-v1.5). - For the high resolution vision encoder, we are using [laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) which is licensed under [MIT LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md). - For the low resolution vision encoder, we are using [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) which is licensed under [MIT LICENSE](https://github.com/openai/CLIP/blob/main/LICENSE) # BibTeX 📚 ``` @misc{jiang2024trex2, title={T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy}, author={Qing Jiang and Feng Li and Zhaoyang Zeng and Tianhe Ren and Shilong Liu and Lei Zhang}, year={2024}, eprint={2403.14610}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```