Contents
1. Introduction 📚
2. Installation 🛠️
- 2.1 Download Pre-trained Models
- 2.2 Verify Installation
3. Usage 🚀
4. Gradio Demos 🎨
- 4.1 Gradio Demo for RexSeek + GroundingDINO + SAM
5. HumanRef Benchmark
6. LICENSE
BibTeX 📚

1. Introduction 📚

RexSeek is a Multimodal Large Language Model (MLLM) designed to detect people or objects in images based on natural language descriptions. Unlike traditional referring models that focus on single-instance detection, RexSeek excels at multi-instance referring tasks - identifying multiple people or objects that match a given description.

Key Features

Multi-Instance Detection: Can identify multiple matching instances in a single image
Robust Perception: Powered by state-of-the-art person detection models
Strong Language Understanding: Leverages advanced LLM capabilities for complex description comprehension

The HumanRef Benchmark

We aslo introduce HumanRef Benchmark, a comprehensive benchmark for human-centric referring tasks containing:

6000 referring expressions
Average of 2.2 instances per expression
Covers 6 key aspects of human referring:
- Attributes (gender, age, clothing, etc.)
- Position (spatial relationships)
- Interaction (human-to-human, human-to-object)
- Reasoning (multi-step inference)
- Celebrity Recognition
- Rejection (hallucination detection)

2. Installation 🛠️

conda install -n rexseek python=3.9
pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
pip install -v -e .

2.1 Download Pre-trained Models

We provide model checkpoints for RexSeek-3B. You can download the pre-trained models from the following links:

ChatRex-3B Checkpoint

Or you can also using the following command to download the pre-trained models:

# Download ChatRex checkpoint from Hugging Face
git lfs install
git clone https://huggingface.co/IDEA-Research/RexSeek-3B IDEA-Research/RexSeek-3B

2.2 Verify Installation

To verify the installation, run the following command:

python tests/test_local_load.py

If the installation is successful, you will get a visualization image in tests/images folder.

3. Usage 🚀

3.1 Model Architecture

TL;DR: RexSeek needs model to propose object boxes first, then use the LLM to detect the objects.

RexSeek consists of three key components:

Vision Encoders: Dual-resolution feature extraction (CLIP + ConvNeXt)
Person Detector: DINO-X for generating high-quality object proposals
Language Model: Qwen2.5 for understanding complex referring expressions

Inputs:
- Image: The source image containing people/objects
- Text: Natural language description of target objects
- Boxes: Object proposals from DINO-X detector (can be replaced with custom boxes)
Outputs:
- Object indices corresponding to the referring expression in format:
```
<ground>referring text</ground><objects><obj1><obj2>...</objects>
```

3.2 Combine RexSeek with GroundingDINO

In this example, we will use GroundingDINO to generate object proposals, and then use RexSeek to detect the objects.

3.2.1 Install GroundingDINO

cd demos/
git clone https://github.com/IDEA-Research/GroundingDINO.git
cd GroundingDINO
pip install -v -e .
mkdir weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth -P weights
cd ../../../

3.2.2 Run the Demo

python demos/rexseek_grounding_dino.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --objects "person" \
    --text-threshold 0.25 \
    --box-threshold 0.25

3.3 Combine RexSeek with GroundingDINO and Spacy

In previous example, we need to explicitly specify object categories (like "person") for GroundingDINO to detect. However, we can make this process more automatic by using Spacy to extract nouns from the question as detection targets.

3.3.1 Install Dependencies

pip install spacy
python -m spacy download en_core_web_sm

3.3.2 Run the Demo

python demos/rexseek_grounding_dino_spacy.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --text-threshold 0.25 \
    --box-threshold 0.25

In this enhanced version:

No need to specify --objects parameter
Spacy automatically extracts nouns ("people", "shirts", "dogs", "park") from the question
GroundingDINO uses these extracted nouns as detection targets
More flexible and natural interaction through questions

3.4 Combine RexSeek with GroundingDINO, Spacy and SAM

In this example, we will use GroundingDINO to generate object proposals, then use Spacy to extract nouns from the question as detection targets, and finally use SAM to segment the objects.

3.4.1 Install Dependencies

cd demos/
git clone https://github.com/IDEA-Research/SAM.git  
cd SAM
pip install -v -e .
mkdir weights
wget -q https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth -P weights
cd ../../../

3.4.2 Run the Demo

python demos/rexseek_grounding_dino_spacy_sam.py \
    --image demos/demo_images/demo1.jpg \
    --output demos/demo_images/demo1_result.jpg \
    --referring "person that is giving a proposal" \
    --text-threshold 0.25 \
    --box-threshold 0.25

4. Gradio Demos 🎨

4.1 Gradio Demo for RexSeek + GroundingDINO + SAM

We provide a gradio demo for RexSeek + GroundingDINO + SAM. You can run the following command to start the gradio demo:

python demos/gradio_demo.py \
    --rexseek-path "IDEA-Research/RexSeek-3B" \
    --gdino-config "demos/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py" \
    --gdino-weights "demos/GroundingDINO/weights/groundingdino_swint_ogc.pth" \
    --sam-weights "demos/segment-anything/weights/sam_vit_h_4b8939.pth"

5. HumanRef Benchmark

HumanRef is a large-scale human-centric referring expression dataset designed for multi-instance human referring in natural scenes. Unlike traditional referring datasets that focus on one-to-one object referring, HumanRef supports referring to multiple individuals simultaneously through natural language descriptions.

Key features of HumanRef include:

Multi-Instance Referring: A single referring expression can correspond to multiple individuals, better reflecting real-world scenarios
Diverse Referring Types: Covers 6 major types of referring expressions:
- Attribute-based (e.g., gender, age, clothing)
- Position-based (relative positions between humans or with environment)
- Interaction-based (human-human or human-environment interactions)
- Reasoning-based (complex logical combinations)
- Celebrity Recognition
- Rejection Cases (non-existent references)
High-Quality Data:
- 34,806 high-resolution images (>1000×1000 pixels)
- 103,028 referring expressions in training set
- 6,000 carefully curated expressions in benchmark set
- Average 8.6 persons per image
- Average 2.2 target boxes per referring expression

The dataset aims to advance research in human-centric visual understanding and referring expression comprehension in complex, multi-person scenarios.

5.1 Download

You can download the HumanRef Benchmark at https://huggingface.co/datasets/IDEA-Research/HumanRef.

5.2 Visualization

HumanRef Benchmark contains 6 domains, each domain may have multiple sub-domains.

Domain	Subdomain	Num Referrings
attribute	1000_attribute_retranslated_with_mask	1000
position	500_inner_position_data_with_mask	500
position	500_outer_position_data_with_mask	500
celebrity	1000_celebrity_data_with_mask	1000
interaction	500_inner_interaction_data_with_mask	500
interaction	500_outer_interaction_data_with_mask	500
reasoning	229_outer_position_two_stage_with_mask	229
reasoning	271_positive_then_negative_reasoning_with_mask	271
reasoning	500_inner_position_two_stage_with_mask	500
rejection	1000_rejection_referring_with_mask	1000

To visualize the dataset, you can run the following command:

python rexseek/tools/visualize_humanref.py \
    --anno_path "IDEA-Research/HumanRef/annotations.jsonl" \
    --image_root_dir "IDEA-Research/HumanRef/images" \
    --domain_anme "attribute" \ # attribute, position, interaction, reasoning, celebrity, rejection
    --sub_domain_anme "1000_attribute_retranslated_with_mask" \ # 1000_attribute_retranslated_with_mask, 500_inner_position_data_with_mask, 500_outer_position_data_with_mask, 1000_celebrity_data_with_mask, 500_inner_interaction_data_with_mask, 500_outer_interaction_data_with_mask, 229_outer_position_two_stage_with_mask, 271_positive_then_negative_reasoning_with_mask, 500_inner_position_two_stage_with_mask, 1000_rejection_referring_with_mask
    --vis_path "IDEA-Research/HumanRef/visualize" \
    --num_images 50 \
    --vis_mask True # True, False

5.3 Evaluation

5.3.1 Metrics

We evaluate the referring task using three main metrics: Precision, Recall, and DensityF1 Score.

Basic Metrics

Precision & Recall: For each referring expression, a predicted bounding box is considered correct if its IoU with any ground truth box exceeds a threshold. Following COCO evaluation protocol, we report average performance across IoU thresholds from 0.5 to 0.95 in steps of 0.05.
Point-based Evaluation: For models that only output points (e.g., Molmo), a prediction is considered correct if the predicted point falls within the mask of the corresponding instance. Note that this is less strict than IoU-based metrics.
Rejection Accuracy: For the rejection subset, we calculate:
```
Rejection Accuracy = Number of correctly rejected expressions / Total number of expressions
```
where a correct rejection means the model predicts no boxes for a non-existent reference.

DensityF1 Score

To penalize over-detection (predicting too many boxes), we introduce the DensityF1 Score:

DensityF1 = (1/N) * Σ [2 * (Precision_i * Recall_i)/(Precision_i + Recall_i) * D_i]

where D_i is the density penalty factor:

D_i = min(1.0, GT_Count_i / Predicted_Count_i)

where:

N is the number of referring expressions
GT_Count_i is the total number of persons in image i
Predicted_Count_i is the number of predicted boxes for referring expression i

This penalty factor reduces the score when models predict significantly more boxes than the actual number of people in the image, discouraging over-detection strategies.

5.3.2 Evaluation Script

Prediction Format

Before running the evaluation, you need to prepare your model's predictions in the correct format. Each prediction should be a JSON line in a JSONL file with the following structure:

{
  "id": "image_id",
  "extracted_predictions": [[x1, y1, x2, y2], [x1, y1, x2, y2], ...]
}

Where:

id: The image identifier matching the ground truth data
extracted_predictions: A list of bounding boxes in [x1, y1, x2, y2] format or points in [x, y] format

For rejection cases (where no humans should be detected), you should either:

Include an empty list: "extracted_predictions": []
Include a list with an empty box: "extracted_predictions": [[]]

Running the Evaluation

You can run the evaluation script using the following command:

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA-Research/HumanRef/annotations.jsonl \
  --pred_path path/to/your/predictions.jsonl \
  --pred_names "Your Model Name" \
  --dump_path IDEA-Research/HumanRef/evaluation_results/your_model_results

Parameters:

--gt_path: Path to the ground truth annotations file
--pred_path: Path to your prediction file(s). You can provide multiple paths to compare different models
--pred_names: Names for your models (for display in the results)
--dump_path: Directory to save the evaluation results in markdown and JSON formats

Evaluating Multiple Models:

To compare multiple models, provide multiple prediction files:

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA-Research/HumanRef/annotations.jsonl \
  --pred_path model1_results.jsonl model2_results.jsonl model3_results.jsonl \
  --pred_names "Model 1" "Model 2" "Model 3" \
  --dump_path IDEA-Research/HumanRef/evaluation_results/comparison

Programmatic Usage

from rexseek.metric.recall_precision_densityf1 import recall_precision_densityf1

recall_precision_densityf1(
    gt_path="IDEA-Research/HumanRef/annotations.jsonl",
    pred_path=["path/to/your/predictions.jsonl"],
    dump_path="IDEA-Research/HumanRef/evaluation_results/your_model_results"
)

5.3.3 Evaluate RexSeek

First we need to run the following command to generate the predictions:

python rexseek/evaluation/evaluate_rexseek.py \
    --model_path IDEA-Research/RexSeek-3B \
    --image_folder IDEA-Research/HumanRef/images \
    --question_file IDEA-Research/HumanRef/annotations.jsonl \
    --answers_file IDEA-Research/HumanRef/evaluation_results/eval_rexseek/RexSeek-3B_results.jsonl \

Then we can run the following command to evaluate the RexSeek model:

python rexseek/metric/recall_precision_densityf1.py \
  --gt_path IDEA-Research/HumanRef/annotations.jsonl \
  --pred_path  IDEA-Research/HumanRef/evaluation_results/eval_rexseek/RexSeek-3B_results.jsonl\
  --pred_names "RexSeek-3B" \
  --dump_path IDEA-Research/HumanRef/evaluation_results/comparison

6. LICENSE

ChatRex is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. Note that this project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses including but not limited to the:

OpenAI Terms of Use for the dataset.
For the LLM used in this project, the model is Qwen/Qwen2.5-3B-Instruct, which is licensed under Qwen RESEARCH LICENSE AGREEMENT.
For the high resolution vision encoder, we are using laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg which is licensed under MIT LICENSE.
For the low resolution vision encoder, we are using openai/clip-vit-large-patch14 which is licensed under MIT LICENSE

BibTeX 📚

@misc{jiang2025referringperson,
      title={Referring to Any Person}, 
      author={Qing Jiang and Lin Wu and Zhaoyang Zeng and Tianhe Ren and Yuda Xiong and Yihao Chen and Qin Liu and Lei Zhang},
      year={2025},
      eprint={2503.08507},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.08507}, 
}

Contents