Spaces:
Runtime error
Runtime error
title: Dense_Captioning_-_GRiT | |
app_file: app.py | |
sdk: gradio | |
sdk_version: 3.42.0 | |
# GRiT: A Generative Region-to-text Transformer for Object Understanding | |
GRiT is a general and open-set object understanding framework that localizes objects and | |
describes them with any style of free-form texts it was trained with, e.g., class names, descriptive sentences | |
(including object attributes, actions, counts and many more). | |
> [**GRiT: A Generative Region-to-text Transformer for Object Understanding**](https://arxiv.org/abs/2212.00280) \ | |
> Jialian Wu, Jianfeng Wang, Zhengyuan Yang, Zhe Gan, Zicheng Liu, Junsong Yuan, Lijuan Wang \ | |
> <sup>1</sup>State University of New York at Buffalo, <sup>2</sup>Microsoft \ | |
> *arXiv technical report* ([PDF](https://arxiv.org/pdf/2212.00280.pdf)) | |
<p align="center"> <img src='docs/grit.png' align="center" height="400px"> </p> | |
## Installation | |
Please follow [Installation instructions](docs/INSTALL.md). | |
## ChatGPT with GRiT | |
We give ChatGPT GRiT's dense captioning outputs (object location and description) to have it | |
describe the scene and even write poetry. ChatGPT can generate amazing scene descriptions given our dense | |
captioning outputs. An example is shown below: :star_struck::star_struck::star_struck: | |
<p align="center"> <img src='docs/chatgpt.png' align="center"> </p> | |
## Object Understanding Demo - One Model Two tasks | |
[Download the GRiT model](https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth) or use the following commend to download: | |
~~~ | |
mkdir models && cd models | |
wget https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap_objectdet.pth && cd .. | |
~~~ | |
The downloaded GRiT model was jointly trained on dense captioning | |
task and object detection task. With the same trained model, it can | |
output both rich descriptive sentences and short class names by varying | |
the flag `--test-task`. Play it as follows! :star_struck: | |
### *Output for Dense Captioning (rich descriptive sentences)* | |
~~~ | |
python demo.py --test-task DenseCap --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth | |
~~~ | |
### *Output for Object Detection (short class names)* | |
~~~ | |
python demo.py --test-task ObjectDet --config-file configs/GRiT_B_DenseCap_ObjectDet.yaml --input demo_images --output visualization --opts MODEL.WEIGHTS models/grit_b_densecap_objectdet.pth | |
~~~ | |
Output images will be saved under the `visualization` folder, which looks like: | |
<p align="center"> <img src='docs/demo.png' align="center"> </p> | |
You can also try the Colab demo provided by the [TWC team](https://github.com/taskswithcode): [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/taskswithcode/GriT/blob/master/TWCGRiT.ipynb) | |
## Benchmark Inference and Evaluation | |
Please follow [dataset preparation instructions](datasets/DATASETS.md) to download datasets. | |
Download our trained models and put them to `models/` for evaluation. | |
### *Object Detection on COCO 2017 Dataset* | |
| Model | val AP | test-dev AP | Download | | |
|-----------------------|-----------------|----------|----------| | |
|[GRiT (ViT-B)](configs/GRiT_B_ObjectDet.yaml)|53.7|53.8| [model](https://datarelease.blob.core.windows.net/grit/models/grit_b_objectdet.pth) | | |
|[GRiT (ViT-L)](configs/GRiT_L_ObjectDet.yaml)|56.4|56.6| [model](https://datarelease.blob.core.windows.net/grit/models/grit_l_objectdet.pth) | | |
|[GRiT (ViT-H)](configs/GRiT_H_ObjectDet.yaml)|60.4|60.4| [model](https://datarelease.blob.core.windows.net/grit/models/grit_h_objectdet.pth) | | |
To evaluate the trained GRiT on coco 2017 val, run: | |
~~~ | |
# GRiT (ViT-B) | |
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet --eval-only MODEL.WEIGHTS models/grit_b_objectdet.pth | |
# GRiT (ViT-L) | |
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_L_ObjectDet.yaml --output-dir-name ./output/grit_l_objectdet --eval-only MODEL.WEIGHTS models/grit_l_objectdet.pth | |
# GRiT (ViT-H) | |
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_H_ObjectDet.yaml --output-dir-name ./output/grit_h_objectdet --eval-only MODEL.WEIGHTS models/grit_h_objectdet.pth | |
~~~ | |
### *Dense Captioning on VG Dataset* | |
| Model | mAP | Download | | |
|-----------------------|-----------------|----------| | |
|[GRiT (ViT-B)](configs/GRiT_B_DenseCap.yaml)|15.5| [model](https://datarelease.blob.core.windows.net/grit/models/grit_b_densecap.pth) | | |
To test on VG test set, run: | |
~~~ | |
python train_net.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_DenseCap.yaml --output-dir-name ./output/grit_b_densecap --eval-only MODEL.WEIGHTS models/grit_b_densecap.pth | |
~~~ | |
It will save the inference results to `output/grit_b_densecap/vg_instances_results.json`. | |
We use the VG dense captioning [official evaluation codebase](https://github.com/jcjohnson/densecap) | |
to report the results. We didn't integrate the evaluation code into our project as it was written in Lua. | |
To evaluate on VG, please follow the original codebase's instructions and test based upon it. We're happy to discuss | |
in our issue section about the issues you may encounter when using their code. | |
## Training | |
To save training memory, we use [DeepSpeed](https://github.com/microsoft/DeepSpeed) for training which can work well for | |
[activation checkpointing](https://pytorch.org/docs/stable/checkpoint.html) in distributed training. | |
To train on single machine node, run: | |
~~~ | |
python train_deepspeed.py --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet | |
~~~ | |
To train on multiple machine nodes, run: | |
~~~ | |
python train_deepspeed.py --num-machines 4 --num-gpus-per-machine 8 --config-file configs/GRiT_B_ObjectDet.yaml --output-dir-name ./output/grit_b_objectdet | |
~~~ | |
## Acknowledgement | |
Our code is in part based on [Detic](https://github.com/facebookresearch/Detic), | |
[CenterNet2](https://github.com/xingyizhou/CenterNet2), | |
[detectron2](https://github.com/facebookresearch/detectron2), | |
[GIT](https://github.com/microsoft/GenerativeImage2Text), and | |
[transformers](https://github.com/huggingface/transformers). | |
We thank the authors and appreciate their great works! | |
## Citation | |
If you find our work interesting and would like to cite it, please use the following BibTeX entry. | |
@article{wu2022grit, | |
title={GRiT: A Generative Region-to-text Transformer for Object Understanding}, | |
author={Wu, Jialian and Wang, Jianfeng and Yang, Zhengyuan and Gan, Zhe and Liu, Zicheng and Yuan, Junsong and Wang, Lijuan}, | |
journal={arXiv preprint arXiv:2212.00280}, | |
year={2022} | |
} | |