
Contents
- Contents
- 1. Introduction π
- 2. Installation π οΈ
- 3. Usage π
- 4. Gradio Demos π¨
- 5. HumanRef Benchmark
- 6. LICENSE
- BibTeX π
1. Introduction π
RexSeek is a Multimodal Large Language Model (MLLM) designed to detect people or objects in images based on natural language descriptions. Unlike traditional referring models that focus on single-instance detection, RexSeek excels at multi-instance referring tasks - identifying multiple people or objects that match a given description.
Key Features
- Multi-Instance Detection: Can identify multiple matching instances in a single image
- Robust Perception: Powered by state-of-the-art person detection models
- Strong Language Understanding: Leverages advanced LLM capabilities for complex description comprehension
The HumanRef Benchmark
We aslo introduce HumanRef Benchmark, a comprehensive benchmark for human-centric referring tasks containing:
- 6000 referring expressions
- Average of 2.2 instances per expression
- Covers 6 key aspects of human referring:
- Attributes (gender, age, clothing, etc.)
- Position (spatial relationships)
- Interaction (human-to-human, human-to-object)
- Reasoning (multi-step inference)
- Celebrity Recognition
- Rejection (hallucination detection)
2. Installation π οΈ
conda install -n rexseek python=3.9
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
pip install -v -e .
2.1 Download Pre-trained Models
We provide model checkpoints for RexSeek-3B. You can download the pre-trained models from the following links:
Or you can also using the following command to download the pre-trained models:
# Download ChatRex checkpoint from Hugging Face
git lfs install
git clone https://huggingface.co/IDEA-Research/RexSeek-3B IDEA-Research/RexSeek-3B
2.2 Verify Installation
To verify the installation, run the following command:
python tests/test_local_load.py
If the installation is successful, you will get a visualization image in tests/images
folder.
3. Usage π
3.1 Model Architecture

TL;DR: RexSeek needs model to propose object boxes first, then use the LLM to detect the objects.
RexSeek consists of three key components:
- Vision Encoders: Dual-resolution feature extraction (CLIP + ConvNeXt)
- Person Detector: DINO-X for generating high-quality object proposals
- Language Model: Qwen2.5 for understanding complex referring expressions
Inputs:
- Image: The source image containing people/objects
- Text: Natural language description of target objects
- Boxes: Object proposals from DINO-X detector (can be replaced with custom boxes)
Outputs:
- Object indices corresponding to the referring expression in format:
<ground>referring text</ground><objects><obj1><obj2>...</objects>
- Object indices corresponding to the referring expression in format:
3.2 Combine RexSeek with GroundingDINO
In this example, we will use GroundingDINO to generate object proposals, and then use RexSeek to detect the objects.
3.2.1 Install GroundingDINO
cd demos/
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -v -e .
mkdir weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth -P weights
cd ../../../
3.2.2 Run the Demo
python demos/rexseek_grounding_dino.py \
--image demos/demo_images/demo1.jpg \
--output demos/demo_images/demo1_result.jpg \
--referring "person that is giving a proposal" \
--objects "person" \
--text-threshold 0.25 \
--box-threshold 0.25
3.3 Combine RexSeek with GroundingDINO and Spacy
In previous example, we need to explicitly specify object categories (like "person") for GroundingDINO to detect. However, we can make this process more automatic by using Spacy to extract nouns from the question as detection targets.
3.3.1 Install Dependencies
pip install spacy
python -m spacy download en_core_web_sm
3.3.2 Run the Demo
python demos/rexseek_grounding_dino_spacy.py \
--image demos/demo_images/demo1.jpg \
--output demos/demo_images/demo1_result.jpg \
--referring "person that is giving a proposal" \
--text-threshold 0.25 \
--box-threshold 0.25
In this enhanced version:
- No need to specify
--objects
parameter - Spacy automatically extracts nouns ("people", "shirts", "dogs", "park") from the question
- GroundingDINO uses these extracted nouns as detection targets
- More flexible and natural interaction through questions
3.4 Combine RexSeek with GroundingDINO, Spacy and SAM
In this example, we will use GroundingDINO to generate object proposals, then use Spacy to extract nouns from the question as detection targets, and finally use SAM to segment the objects.
3.4.1 Install Dependencies
cd demos/
git clone https://github.com/IDEA-Research/SAM.git
cd SAM
pip install -v -e .
mkdir weights
wget -q https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -P weights
cd ../../../
3.4.2 Run the Demo
python demos/rexseek_grounding_dino_spacy_sam.py \
--image demos/demo_images/demo1.jpg \
--output demos/demo_images/demo1_result.jpg \
--referring "person that is giving a proposal" \
--text-threshold 0.25 \
--box-threshold 0.25
4. Gradio Demos π¨
4.1 Gradio Demo for RexSeek + GroundingDINO + SAM
We provide a gradio demo for RexSeek + GroundingDINO + SAM. You can run the following command to start the gradio demo:
python demos/gradio_demo.py \
--rexseek-path "IDEA-Research/RexSeek-3B" \
--gdino-config "demos/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py" \
--gdino-weights "demos/GroundingDINO/weights/groundingdino_swint_ogc.pth" \
--sam-weights "demos/segment-anything/weights/sam_vit_h_4b8939.pth"

5. HumanRef Benchmark

HumanRef is a large-scale human-centric referring expression dataset designed for multi-instance human referring in natural scenes. Unlike traditional referring datasets that focus on one-to-one object referring, HumanRef supports referring to multiple individuals simultaneously through natural language descriptions.
Key features of HumanRef include:
- Multi-Instance Referring: A single referring expression can correspond to multiple individuals, better reflecting real-world scenarios
- Diverse Referring Types: Covers 6 major types of referring expressions:
- Attribute-based (e.g., gender, age, clothing)
- Position-based (relative positions between humans or with environment)
- Interaction-based (human-human or human-environment interactions)
- Reasoning-based (complex logical combinations)
- Celebrity Recognition
- Rejection Cases (non-existent references)
- High-Quality Data:
- 34,806 high-resolution images (>1000Γ1000 pixels)
- 103,028 referring expressions in training set
- 6,000 carefully curated expressions in benchmark set
- Average 8.6 persons per image
- Average 2.2 target boxes per referring expression
The dataset aims to advance research in human-centric visual understanding and referring expression comprehension in complex, multi-person scenarios.
5.1 Download
You can download the HumanRef Benchmark at https://huggingface.co/datasets/IDEA-Research/HumanRef.
5.2 Visualization
HumanRef Benchmark contains 6 domains, each domain may have multiple sub-domains.
Domain | Subdomain | Num Referrings |
---|---|---|
attribute | 1000_attribute_retranslated_with_mask | 1000 |
position | 500_inner_position_data_with_mask | 500 |
position | 500_outer_position_data_with_mask | 500 |
celebrity | 1000_celebrity_data_with_mask | 1000 |
interaction | 500_inner_interaction_data_with_mask | 500 |
interaction | 500_outer_interaction_data_with_mask | 500 |
reasoning | 229_outer_position_two_stage_with_mask | 229 |
reasoning | 271_positive_then_negative_reasoning_with_mask | 271 |
reasoning | 500_inner_position_two_stage_with_mask | 500 |
rejection | 1000_rejection_referring_with_mask | 1000 |
To visualize the dataset, you can run the following command:
python rexseek/tools/visualize_humanref.py \
--anno_path "IDEA-Research/HumanRef/annotations.jsonl" \
--image_root_dir "IDEA-Research/HumanRef/images" \
--domain_anme "attribute" \ # attribute, position, interaction, reasoning, celebrity, rejection
--sub_domain_anme "1000_attribute_retranslated_with_mask" \ # 1000_attribute_retranslated_with_mask, 500_inner_position_data_with_mask, 500_outer_position_data_with_mask, 1000_celebrity_data_with_mask, 500_inner_interaction_data_with_mask, 500_outer_interaction_data_with_mask, 229_outer_position_two_stage_with_mask, 271_positive_then_negative_reasoning_with_mask, 500_inner_position_two_stage_with_mask, 1000_rejection_referring_with_mask
--vis_path "IDEA-Research/HumanRef/visualize" \
--num_images 50 \
--vis_mask True # True, False
5.3 Evaluation
5.3.1 Metrics
We evaluate the referring task using three main metrics: Precision, Recall, and DensityF1 Score.
Basic Metrics
Precision & Recall: For each referring expression, a predicted bounding box is considered correct if its IoU with any ground truth box exceeds a threshold. Following COCO evaluation protocol, we report average performance across IoU thresholds from 0.5 to 0.95 in steps of 0.05.
Point-based Evaluation: For models that only output points (e.g., Molmo), a prediction is considered correct if the predicted point falls within the mask of the corresponding instance. Note that this is less strict than IoU-based metrics.
Rejection Accuracy: For the rejection subset, we calculate:
Rejection Accuracy = Number of correctly rejected expressions / Total number of expressions
where a correct rejection means the model predicts no boxes for a non-existent reference.
DensityF1 Score
To penalize over-detection (predicting too many boxes), we introduce the DensityF1 Score:
DensityF1 = (1/N) * Ξ£ [2 * (Precision_i * Recall_i)/(Precision_i + Recall_i) * D_i]
where D_i is the density penalty factor:
D_i = min(1.0, GT_Count_i / Predicted_Count_i)
where:
- N is the number of referring expressions
- GT_Count_i is the total number of persons in image i
- Predicted_Count_i is the number of predicted boxes for referring expression i
This penalty factor reduces the score when models predict significantly more boxes than the actual number of people in the image, discouraging over-detection strategies.
5.3.2 Evaluation Script
Prediction Format
Before running the evaluation, you need to prepare your model's predictions in the correct format. Each prediction should be a JSON line in a JSONL file with the following structure:
{
"id": "image_id",
"extracted_predictions": [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]
}
Where:
- id: The image identifier matching the ground truth data
- extracted_predictions: A list of bounding boxes in [x1, y1, x2, y2] format or points in [x, y] format
For rejection cases (where no humans should be detected), you should either:
- Include an empty list: "extracted_predictions": []
- Include a list with an empty box: "extracted_predictions": [[]]
Running the Evaluation
You can run the evaluation script using the following command:
python rexseek/metric/recall_precision_densityf1.py \
--gt_path IDEA-Research/HumanRef/annotations.jsonl \
--pred_path path/to/your/predictions.jsonl \
--pred_names "Your Model Name" \
--dump_path IDEA-Research/HumanRef/evaluation_results/your_model_results
Parameters:
- --gt_path: Path to the ground truth annotations file
- --pred_path: Path to your prediction file(s). You can provide multiple paths to compare different models
- --pred_names: Names for your models (for display in the results)
- --dump_path: Directory to save the evaluation results in markdown and JSON formats
Evaluating Multiple Models:
To compare multiple models, provide multiple prediction files:
python rexseek/metric/recall_precision_densityf1.py \
--gt_path IDEA-Research/HumanRef/annotations.jsonl \
--pred_path model1_results.jsonl model2_results.jsonl model3_results.jsonl \
--pred_names "Model 1" "Model 2" "Model 3" \
--dump_path IDEA-Research/HumanRef/evaluation_results/comparison
Programmatic Usage
from rexseek.metric.recall_precision_densityf1 import recall_precision_densityf1
recall_precision_densityf1(
gt_path="IDEA-Research/HumanRef/annotations.jsonl",
pred_path=["path/to/your/predictions.jsonl"],
dump_path="IDEA-Research/HumanRef/evaluation_results/your_model_results"
)
5.3.3 Evaluate RexSeek
First we need to run the following command to generate the predictions:
python rexseek/evaluation/evaluate_rexseek.py \
--model_path IDEA-Research/RexSeek-3B \
--image_folder IDEA-Research/HumanRef/images \
--question_file IDEA-Research/HumanRef/annotations.jsonl \
--answers_file IDEA-Research/HumanRef/evaluation_results/eval_rexseek/RexSeek-3B_results.jsonl \
Then we can run the following command to evaluate the RexSeek model:
python rexseek/metric/recall_precision_densityf1.py \
--gt_path IDEA-Research/HumanRef/annotations.jsonl \
--pred_path IDEA-Research/HumanRef/evaluation_results/eval_rexseek/RexSeek-3B_results.jsonl\
--pred_names "RexSeek-3B" \
--dump_path IDEA-Research/HumanRef/evaluation_results/comparison
6. LICENSE
ChatRex is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. Note that this project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses including but not limited to the:
- OpenAI Terms of Use for the dataset.
- For the LLM used in this project, the model is Qwen/Qwen2.5-3B-Instruct, which is licensed under Qwen RESEARCH LICENSE AGREEMENT.
- For the high resolution vision encoder, we are using laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg which is licensed under MIT LICENSE.
- For the low resolution vision encoder, we are using openai/clip-vit-large-patch14 which is licensed under MIT LICENSE
BibTeX π
@misc{jiang2025referringperson,
title={Referring to Any Person},
author={Qing Jiang and Lin Wu and Zhaoyang Zeng and Tianhe Ren and Yuda Xiong and Yihao Chen and Qin Liu and Lei Zhang},
year={2025},
eprint={2503.08507},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.08507},
}
- Downloads last month
- 12