Mountchicken commited on
Commit
5a7a3e1
1 Parent(s): 4d629ce

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +622 -13
README.md CHANGED
@@ -1,13 +1,622 @@
1
- ---
2
- pipeline_tag: image-text-to-text
3
- language:
4
- - en
5
- base_model:
6
- - lmsys/vicuna-7b-v1.5
7
- - openai/clip-vit-large-patch14
8
- - laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg
9
- tags:
10
- - chatrex
11
- ---
12
- # title here
13
- test
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ <div align=center>
3
+ <img src="assets/teaser.jpg" width=600 >
4
+ </div>
5
+
6
+
7
+
8
+ # 1. Introduction 📚
9
+ **TL;DR: ChatRex is a MLLM skilled in perception that can respond to questions while simultaneously grounding its answers to the referenced objects.**
10
+
11
+ ChatRex is a Multimodal Large Language Model (MLLM) designed to seamlessly integrate fine-grained object perception and robust language understanding. By adopting a decoupled architecture with a retrieval-based approach for object detection and leveraging high-resolution visual inputs, ChatRex addresses key challenges in perception tasks. It is powered by the Rexverse-2M dataset with diverse image-region-text annotations. ChatRex can be applied to various scenarios requiring fine-grained perception, such as object detection, grounded conversation, grounded image captioning and region
12
+ understanding.
13
+
14
+ <div align=center>
15
+ <img src="assets/capability_overview.jpg" width=800 >
16
+ </div>
17
+
18
+ ----
19
+
20
+ # 2. Installation 🛠️
21
+ ```bash
22
+ conda install -n chatrex python=3.9
23
+ pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu121
24
+ pip install -v -e .
25
+ # install deformable attention for universal proposal network
26
+ cd chatrex/upn/ops
27
+ pip install -v -e .
28
+ ```
29
+
30
+ ## 2.1 Download Pre-trained Models
31
+ We provide model checkpoints for both the ***Universal Proposal Network (UPN)*** and the ***ChatRex model***. You can download the pre-trained models from the following links:
32
+ - [UPN Checkpoint](https://drive.google)
33
+ - [ChatRex-7B Checkpoint](https://huggingface.co/IDEA-Research/ChatRex-7B)
34
+
35
+ Or you can also using the following command to download the pre-trained models:
36
+ ```bash
37
+ mkdir checkpoints
38
+ mkdir checkpoints/upn
39
+ # download UPN checkpoint
40
+ wget -O checkpoints/upn/upn_large.pth https://drive.google.com/file/d/
41
+ # download ChatRex checkpoint from huggingface IDEA-Research/ChatRex-7B
42
+ # Download ChatRex checkpoint from Hugging Face
43
+ git lfs install
44
+ git clone https://huggingface.co/IDEA-Research/ChatRex-7B checkpoints/chatrex
45
+ ```
46
+
47
+ ## 2.2 Verify Installation
48
+ To verify the ***installation of the Universal Proposal Network (UPN)***, run the following command:
49
+ ```bash
50
+ python tests/test_upn_install.py
51
+ ```
52
+
53
+ If the installation is successful, you will get two visualization images of both fine-grained proposal and coarse-grained proposal in `tests` folder.
54
+
55
+ To verify the ***installation of the ChatRex model***, run the following command:
56
+ ```bash
57
+ python tests/test_chatrex_install.py
58
+ ```
59
+
60
+ If the installation is successful, you will get an output like this:
61
+ ```text
62
+ prediction: <obj0> shows a brown dog lying on a bed. The dog is resting comfortably, possibly sleeping, and is positioned on the left side of the bed
63
+ ```
64
+
65
+ # 3. Usage 🚀
66
+ ## 3.1 Use UPN for Object Proposal Generation
67
+
68
+ Universal Proposal Network (UPN) is a robust object proposal model designed as part of ChatRex to enable comprehensive and accurate object detection across diverse granularities and domains. Built upon T-Rex2, UPN is a DETR-based model with a dual-granularity prompt tuning strategy, combining fine-grained (e.g., part-level) and coarse-grained (e.g., instance-level) detection.
69
+
70
+ <div align=center>
71
+ <img src="assets/upn_res.jpg" width=600 >
72
+ </div>
73
+
74
+ ----
75
+
76
+ <details close>
77
+ <summary><strong>Example Code for UPN</strong></summary>
78
+
79
+ ```python
80
+ import torch
81
+ from PIL import Image
82
+ from tools.visualize import plot_boxes_to_image
83
+ from chatrex.upn import UPNWrapper
84
+
85
+ ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
86
+ test_image_path = "tests/images/test_upn.jpeg"
87
+
88
+ model = UPNWrapper(ckpt_path)
89
+ # fine-grained prompt
90
+ fine_grained_proposals = model.inference(
91
+ test_image_path, prompt_type="fine_grained_prompt"
92
+ )
93
+ # filter by score (default: 0.3) and nms (default: 0.8)
94
+ fine_grained_filtered_proposals = model.filter(
95
+ fine_grained_proposals, min_score=0.3, nms_value=0.8
96
+ )
97
+ ## output is a dict with keys: "original_xyxy_boxes", "scores"
98
+ ## - "original_xyxy_boxes": list of boxes in xyxy format in shape (B, N, 4)
99
+ ## - "scores": list of scores for each box in shape (B, N)
100
+
101
+ # coarse-grained prompt
102
+ coarse_grained_proposals = model.inference(
103
+ test_image_path, prompt_type="coarse_grained_prompt"
104
+ )
105
+ coarse_grained_filtered_proposals = model.filter(
106
+ coarse_grained_proposals, min_score=0.3, nms_value=0.8
107
+ )
108
+
109
+ ## output is a dict with keys: "original_xyxy_boxes", "scores"
110
+ ## - "original_xyxy_boxes": list of boxes in xyxy format in shape (B, N, 4)
111
+ ## - "scores": list of scores for each box in shape (B, N)
112
+ ```
113
+
114
+ </details>
115
+
116
+ We also provide a visualization tool to visualize the object proposals generated by UPN. You can use the following code to visualize the object proposals:
117
+
118
+ <details close>
119
+ <summary><strong>Example Code for UPN Visualization</strong></summary>
120
+
121
+ ```python
122
+
123
+ from chatrex.tools.visualize import plot_boxes_to_image
124
+ image = Image.open(test_image_path)
125
+ fine_grained_vis_image, _ = plot_boxes_to_image(
126
+ image.copy(),
127
+ fine_grained_filtered_proposals["original_xyxy_boxes"][0],
128
+ fine_grained_filtered_proposals["scores"][0],
129
+ )
130
+ fine_grained_vis_image.save("tests/test_image_fine_grained.jpeg")
131
+ print(f"fine-grained proposal is saved at tests/test_image_fine_grained.jpeg")
132
+
133
+ coarse_grained_vis_image, _ = plot_boxes_to_image(
134
+ image.copy(),
135
+ coarse_grained_filtered_proposals["original_xyxy_boxes"][0],
136
+ coarse_grained_filtered_proposals["scores"][0],
137
+ )
138
+ coarse_grained_vis_image.save("tests/test_image_coarse_grained.jpeg")
139
+ print(f"coarse-grained proposal is saved at tests/test_image_coarse_grained.jpeg")
140
+
141
+ ```
142
+ </details>
143
+
144
+ ## 3.2 Usage of ChatRex
145
+
146
+ ChatRex takes three inputs: image, text prompt, and box input. For the box input, you can either use the object proposals generated by UPN or provide your own box input (user drawn boxes). We have wrapped the ChatRex model to huggingface transformers format for easy usage. ChatRex can be used for various tasks and we provide example code for each task below.
147
+
148
+ ### 3.2.1 ChatRex for Object Detection & Grounding & Referring
149
+
150
+ Example Prompt for detection, grounding, referring tasks:
151
+ ```text
152
+ # Single Object Detection
153
+ Please detect dog in this image. Answer the question with object indexes.
154
+ Please detect the man in yellow shirt in this image. Answer the question with object indexes.
155
+
156
+ # multiple object detection, use ; to separate the objects
157
+ Please detect person; pigeon in this image. Answer the question with object indexes.
158
+ Please detect person in the car; cat below the table in this image. Answer the question with object indexes.
159
+ ```
160
+
161
+ <details close>
162
+ <summary><strong>Example Code</strong></summary>
163
+
164
+ - [Example Code in python file](tests/test_chatrex_detection.py)
165
+
166
+ ```python
167
+ import torch
168
+ from PIL import Image
169
+ from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
170
+
171
+ from chatrex.tools.visualize import visualize_chatrex_output
172
+ from chatrex.upn import UPNWrapper
173
+
174
+ if __name__ == "__main__":
175
+ # load the processor
176
+ processor = AutoProcessor.from_pretrained(
177
+ "checkpoints/chatrex7b",
178
+ trust_remote_code=True,
179
+ device_map="cuda",
180
+ )
181
+
182
+ print(f"loading chatrex model...")
183
+ # load chatrex model
184
+ model = AutoModelForCausalLM.from_pretrained(
185
+ "checkpoints/chatrex7b",
186
+ trust_remote_code=True,
187
+ use_safetensors=True,
188
+ ).to("cuda")
189
+
190
+ # load upn model
191
+ print(f"loading upn model...")
192
+ ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
193
+ model_upn = UPNWrapper(ckpt_path)
194
+ test_image_path = "tests/images/test_chatrex_detection.jpg"
195
+
196
+ # get upn predictions
197
+ fine_grained_proposals = model_upn.inference(
198
+ test_image_path, prompt_type="fine_grained_prompt"
199
+ )
200
+ fine_grained_filtered_proposals = model_upn.filter(
201
+ fine_grained_proposals, min_score=0.3, nms_value=0.8
202
+ )
203
+
204
+ inputs = processor.process(
205
+ image=Image.open(test_image_path),
206
+ question="Please detect person; pigeon in this image. Answer the question with object indexes.",
207
+ bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
208
+ 0
209
+ ], # box in xyxy format
210
+ )
211
+
212
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
213
+
214
+ # perform inference
215
+ gen_config = GenerationConfig(
216
+ max_new_tokens=512,
217
+ do_sample=False,
218
+ eos_token_id=processor.tokenizer.eos_token_id,
219
+ pad_token_id=(
220
+ processor.tokenizer.pad_token_id
221
+ if processor.tokenizer.pad_token_id is not None
222
+ else processor.tokenizer.eos_token_id
223
+ ),
224
+ )
225
+ with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
226
+ prediction = model.generate(
227
+ inputs, gen_config=gen_config, tokenizer=processor.tokenizer
228
+ )
229
+ print(f"prediction:", prediction)
230
+
231
+ # visualize the prediction
232
+ vis_image = visualize_chatrex_output(
233
+ Image.open(test_image_path),
234
+ fine_grained_filtered_proposals["original_xyxy_boxes"][0],
235
+ prediction,
236
+ font_size=15,
237
+ draw_width=5,
238
+ )
239
+ vis_image.save("tests/test_chatrex_detection.jpeg")
240
+ print(f"prediction is saved at tests/test_chatrex_detection.jpeg")
241
+ ```
242
+
243
+ The output from LLM is like:
244
+ ```text
245
+ <ground>person</ground><objects><obj10><obj14><obj15><obj27><obj28><obj32><obj33><obj35><obj38><obj47><obj50></objects>
246
+ <ground>pigeon</ground><objects><obj0><obj1><obj2><obj3><obj4><obj5><obj6><obj7><obj8><obj9><obj11><obj12><obj13><obj16><obj17><obj18><obj19><obj20><obj21><obj22><obj23><obj24><obj25><obj26><obj29><obj31><obj37><obj39><obj40><obj41><obj44><obj49></objects>
247
+ ```
248
+
249
+ The visualization of the output is like:
250
+
251
+ <div align=center>
252
+ <img src="assets/vis_output/test_chatrex_detection.jpeg" width=600 >
253
+ </div>
254
+
255
+ </details>
256
+
257
+ ----
258
+
259
+ ### 3.2.2 ChatRex for Region Caption
260
+ Example Prompt for Region Caption tasks:
261
+
262
+ ```text
263
+ # Single Object Detection
264
+ ## caption in category name
265
+ What is the category name of <obji>? Answer the question with its category name in free format.
266
+
267
+ ## caption in short phrase
268
+ Can you provide me with a short phrase to describe <obji>? Answer the question with a short phrase.
269
+
270
+ ## caption in referring style
271
+ Can you provide me with a brief description of <obji>? Answer the question with brief description.
272
+
273
+ ## caption in one sentence
274
+ Can you provide me with a one sentence of <obji>? Answer the question with one sentence description.
275
+
276
+ # multiple object detection, use ; to separate the objects
277
+ ```
278
+
279
+ <details close>
280
+ <summary><strong>Example Code</strong></summary>
281
+
282
+ - [Example Code in python file](tests/test_chatrex_region_caption.py)
283
+
284
+ ```python
285
+ import torch
286
+ from PIL import Image
287
+ from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
288
+
289
+ from chatrex.tools.visualize import visualize_chatrex_output
290
+ from chatrex.upn import UPNWrapper
291
+
292
+ if __name__ == "__main__":
293
+ # load the processor
294
+ processor = AutoProcessor.from_pretrained(
295
+ "checkpoints/chatrex7b",
296
+ trust_remote_code=True,
297
+ device_map="cuda",
298
+ )
299
+
300
+ print(f"loading chatrex model...")
301
+ # load chatrex model
302
+ model = AutoModelForCausalLM.from_pretrained(
303
+ "checkpoints/chatrex7b",
304
+ trust_remote_code=True,
305
+ use_safetensors=True,
306
+ ).to("cuda")
307
+
308
+ test_image_path = "tests/images/test_chatrex_install.jpg"
309
+
310
+ inputs = processor.process(
311
+ image=Image.open(test_image_path),
312
+ question="Can you provide a one sentence description of <obj0> in the image? Answer the question with a one sentence description.",
313
+ bbox=[[73.88417, 56.62228, 227.69223, 216.34338]],
314
+ )
315
+
316
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
317
+
318
+ # perform inference
319
+ gen_config = GenerationConfig(
320
+ max_new_tokens=512,
321
+ do_sample=False,
322
+ eos_token_id=processor.tokenizer.eos_token_id,
323
+ pad_token_id=(
324
+ processor.tokenizer.pad_token_id
325
+ if processor.tokenizer.pad_token_id is not None
326
+ else processor.tokenizer.eos_token_id
327
+ ),
328
+ )
329
+ with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
330
+ prediction = model.generate(
331
+ inputs, gen_config=gen_config, tokenizer=processor.tokenizer
332
+ )
333
+ print(f"prediction:", prediction)
334
+
335
+ # visualize the prediction
336
+ vis_image = visualize_chatrex_output(
337
+ Image.open(test_image_path),
338
+ [[73.88417, 56.62228, 227.69223, 216.34338]],
339
+ prediction,
340
+ font_size=15,
341
+ draw_width=5,
342
+ )
343
+ vis_image.save("tests/test_chatrex_region_caption.jpeg")
344
+ print(f"prediction is saved at tests/test_chatrex_region_caption.jpeg")
345
+ ```
346
+
347
+ The output from LLM is like:
348
+ ```text
349
+ <ground>A brown dog is lying on a bed, appearing relaxed and comfortable</ground><objects><obj0></objects>
350
+ ```
351
+
352
+ The visualization of the output is like:
353
+
354
+ <div align=center>
355
+ <img src="assets/vis_output/test_chatrex_region_caption.jpeg" width=600 >
356
+ </div>
357
+
358
+ </details>
359
+
360
+ ----
361
+
362
+ ### 3.2.3 ChatRex for Grounded Image Captioning
363
+ Example Prompt for Region Caption tasks:
364
+
365
+ ```text
366
+ # Brief Grounded Imager Caption
367
+ Please breifly describe this image in one sentence and detect all the mentioned objects. Answer the question with grounded answer.
368
+
369
+ # Detailed Grounded Image Caption
370
+ Please provide a detailed description of the image and detect all the mentioned objects. Answer the question with grounded object indexes.
371
+ ```
372
+
373
+ <details close>
374
+ <summary><strong>Example Code</strong></summary>
375
+
376
+ - [Example Code in python file](tests/test_chatrex_grounded_image_caption.py)
377
+
378
+ ```python
379
+ import torch
380
+ from PIL import Image
381
+ from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
382
+
383
+ from chatrex.tools.visualize import visualize_chatrex_output
384
+ from chatrex.upn import UPNWrapper
385
+
386
+ if __name__ == "__main__":
387
+ # load the processor
388
+ processor = AutoProcessor.from_pretrained(
389
+ "checkpoints/chatrex7b",
390
+ trust_remote_code=True,
391
+ device_map="cuda",
392
+ )
393
+
394
+ print(f"loading chatrex model...")
395
+ # load chatrex model
396
+ model = AutoModelForCausalLM.from_pretrained(
397
+ "checkpoints/chatrex7b",
398
+ trust_remote_code=True,
399
+ use_safetensors=True,
400
+ ).to("cuda")
401
+
402
+ # load upn model
403
+ print(f"loading upn model...")
404
+ ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
405
+ model_upn = UPNWrapper(ckpt_path)
406
+ test_image_path = "tests/images/test_chatrex_grounded_caption.jpg"
407
+
408
+ # get upn predictions
409
+ fine_grained_proposals = model_upn.inference(
410
+ test_image_path, prompt_type="fine_grained_prompt"
411
+ )
412
+ fine_grained_filtered_proposals = model_upn.filter(
413
+ fine_grained_proposals, min_score=0.3, nms_value=0.8
414
+ )
415
+
416
+ inputs = processor.process(
417
+ image=Image.open(test_image_path),
418
+ question="Please breifly describe this image in one sentence and detect all the mentioned objects. Answer the question with grounded answer.",
419
+ bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
420
+ 0
421
+ ], # box in xyxy format
422
+ )
423
+
424
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
425
+
426
+ # perform inference
427
+ gen_config = GenerationConfig(
428
+ max_new_tokens=512,
429
+ do_sample=False,
430
+ eos_token_id=processor.tokenizer.eos_token_id,
431
+ pad_token_id=(
432
+ processor.tokenizer.pad_token_id
433
+ if processor.tokenizer.pad_token_id is not None
434
+ else processor.tokenizer.eos_token_id
435
+ ),
436
+ )
437
+ with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
438
+ prediction = model.generate(
439
+ inputs, gen_config=gen_config, tokenizer=processor.tokenizer
440
+ )
441
+ print(f"prediction:", prediction)
442
+
443
+ # visualize the prediction
444
+ vis_image = visualize_chatrex_output(
445
+ Image.open(test_image_path),
446
+ fine_grained_filtered_proposals["original_xyxy_boxes"][0],
447
+ prediction,
448
+ font_size=15,
449
+ draw_width=5,
450
+ )
451
+ vis_image.save("tests/test_chatrex_grounded_image_caption.jpeg")
452
+ print(f"prediction is saved at tests/test_chatrex_grounded_image_caption.jpeg")
453
+ ```
454
+
455
+ The output from LLM is like:
456
+ ```text
457
+ The image depicts a cozy living room with a <ground>plaid couch,</ground><objects><obj2></objects> a <ground>wooden TV stand</ground><objects><obj3></objects>holding a <ground>black television,</ground><objects><obj1></objects> a <ground>red armchair,</ground><objects><obj4></objects> and a <ground>whiteboard</ground><objects><obj0></objects>with writing on the wall, accompanied by a <ground>framed poster</ground><objects><obj6></objects>of a <ground>couple.</ground><objects><obj9><obj11></objects>
458
+ ```
459
+
460
+ The visualization of the output is like:
461
+
462
+ <div align=center>
463
+ <img src="assets/vis_output/test_chatrex_grounded_image_caption.jpeg" width=600 >
464
+ </div>
465
+
466
+ </details>
467
+
468
+ ----
469
+
470
+ ### 3.2.4 ChatRex for Grounded Conversation
471
+ Example Prompt for Region Caption tasks:
472
+
473
+ ```text
474
+ Answer the question in Grounded format. Question
475
+ ```
476
+
477
+ <details close>
478
+ <summary><strong>Example Code</strong></summary>
479
+
480
+ - [Example Code in python file](tests/test_chatrex_grounded_conversation.py)
481
+
482
+ ```python
483
+ import torch
484
+ from PIL import Image
485
+ from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
486
+
487
+ from chatrex.tools.visualize import visualize_chatrex_output
488
+ from chatrex.upn import UPNWrapper
489
+
490
+ if __name__ == "__main__":
491
+ # load the processor
492
+ processor = AutoProcessor.from_pretrained(
493
+ "checkpoints/chatrex7b",
494
+ trust_remote_code=True,
495
+ device_map="cuda",
496
+ )
497
+
498
+ print(f"loading chatrex model...")
499
+ # load chatrex model
500
+ model = AutoModelForCausalLM.from_pretrained(
501
+ "checkpoints/chatrex7b",
502
+ trust_remote_code=True,
503
+ use_safetensors=True,
504
+ ).to("cuda")
505
+
506
+ # load upn model
507
+ print(f"loading upn model...")
508
+ ckpt_path = "checkpoints/upn_checkpoints/upn_large.pth"
509
+ model_upn = UPNWrapper(ckpt_path)
510
+ test_image_path = "tests/images/test_grounded_conversation.jpg"
511
+
512
+ # get upn predictions
513
+ fine_grained_proposals = model_upn.inference(
514
+ test_image_path, prompt_type="coarse_grained_prompt"
515
+ )
516
+ fine_grained_filtered_proposals = model_upn.filter(
517
+ fine_grained_proposals, min_score=0.3, nms_value=0.8
518
+ )
519
+
520
+ inputs = processor.process(
521
+ image=Image.open(test_image_path),
522
+ question="Answer the question in grounded format. This is a photo of my room, and can you tell me what kind of person I am? ",
523
+ bbox=fine_grained_filtered_proposals["original_xyxy_boxes"][
524
+ 0
525
+ ], # box in xyxy format
526
+ )
527
+
528
+ inputs = {k: v.to("cuda") for k, v in inputs.items()}
529
+
530
+ # perform inference
531
+ gen_config = GenerationConfig(
532
+ max_new_tokens=512,
533
+ do_sample=False,
534
+ eos_token_id=processor.tokenizer.eos_token_id,
535
+ pad_token_id=(
536
+ processor.tokenizer.pad_token_id
537
+ if processor.tokenizer.pad_token_id is not None
538
+ else processor.tokenizer.eos_token_id
539
+ ),
540
+ )
541
+ with torch.autocast(device_type="cuda", enabled=True, dtype=torch.bfloat16):
542
+ prediction = model.generate(
543
+ inputs, gen_config=gen_config, tokenizer=processor.tokenizer
544
+ )
545
+ print(f"prediction:", prediction)
546
+
547
+ # visualize the prediction
548
+ vis_image = visualize_chatrex_output(
549
+ Image.open(test_image_path),
550
+ fine_grained_filtered_proposals["original_xyxy_boxes"][0],
551
+ prediction,
552
+ font_size=30,
553
+ draw_width=10,
554
+ )
555
+ vis_image.save("tests/test_chatrex_grounded_conversation.jpeg")
556
+ print(f"prediction is saved at tests/test_chatrex_grounded_conversation.jpeg")
557
+
558
+ ```
559
+
560
+ The output from LLM is like:
561
+ ```text
562
+ Based on the items in the image, it can be inferred that the <ground>person</ground><objects><obj1></objects> who owns this room has an interest in fitness and possibly enjoys reading. The presence of the <ground>dumbbell</ground><objects><obj2></objects> suggests a commitment to physical activity, while the <ground>book</ground><objects><obj3></objects> indicates a liking for literature or reading. The <ground>sneaker</ground><objects><obj0></objects>s and the <ground>plush toy</ground><objects><obj1></objects> add a personal touch, suggesting that the <ground>person</ground><objects><obj1></objects> might also value comfort and perhaps has a playful or nostalgic side. However, without more context, it is not possible to accurately determine the individual's specific traits or <ground>person</ground><objects><obj1></objects>ality.
563
+ ```
564
+
565
+ The visualization of the output is like:
566
+
567
+ <div align=center>
568
+ <img src="assets/test_chatrex_grounded_conversation.jpeg" width=600 >
569
+ </div>
570
+
571
+ </details>
572
+
573
+ ----
574
+
575
+ # 4. Gradio Demos 🎨
576
+ ## 4.1 Gradio Demo for UPN
577
+ We provide a gradio demo for UPN to visualize the object proposals generated by UPN. You can run the following command to start the gradio demo:
578
+ ```bash
579
+ python gradio_demos/upn_demo.py
580
+ # if there is permission error, please run the following command
581
+ mkdir tmp
582
+ TMPDIR='/tmp' python gradio_demos/upn_demo.py
583
+ ```
584
+
585
+ <div align=center>
586
+ <img src="assets/upn_gradio.jpg" width=600 >
587
+ </div>
588
+
589
+
590
+ ## 4.2 Gradio Demo for ChatRex
591
+ We also provide a gradio demo for ChatRex.
592
+ ```bash
593
+ python gradio_demos/chatrex_demo.py
594
+ # if there is permission error, please run the following command
595
+ mkdir tmp
596
+ TMPDIR='/tmp' python gradio_demos/upn_demo.py
597
+ ```
598
+
599
+ <div align=center>
600
+ <img src="assets/chatrex_gradio.jpg" width=600 >
601
+ </div>
602
+
603
+
604
+
605
+ # 5. LICENSE
606
+
607
+ ChatRex is licensed under the IDEA License 1.0, Copyright (c) IDEA. All Rights Reserved. Note that this project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses including but not limited to the:
608
+ - [OpenAI Terms of Use](https://openai.com/policies/terms-of-use) for the dataset.
609
+ - For the LLM used in this project, the model is [lmsys/vicuna-7b-v1.5](https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main), which is licensed under [Llama 2 Community License Agreement](https://huggingface.co/lmsys/vicuna-7b-v1.5).
610
+ - For the high resolution vision encoder, we are using [laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg](https://huggingface.co/laion/CLIP-convnext_large_d.laion2B-s26B-b102K-augreg) which is licensed under [MIT LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md).
611
+ - For the low resolution vision encoder, we are using [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) which is licensed under [MIT LICENSE](https://github.com/openai/CLIP/blob/main/LICENSE)
612
+ # BibTeX 📚
613
+ ```
614
+ @misc{jiang2024trex2,
615
+ title={T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy},
616
+ author={Qing Jiang and Feng Li and Zhaoyang Zeng and Tianhe Ren and Shilong Liu and Lei Zhang},
617
+ year={2024},
618
+ eprint={2403.14610},
619
+ archivePrefix={arXiv},
620
+ primaryClass={cs.CV}
621
+ }
622
+ ```