Model Card for Model ID
GiT: Towards Generalist Vision Transformer through Universal Language Interface
This repository includes GiT checkpoints, logs, and the pre-trained files used.
Model Details
Model Description
In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:
- ๐ฎ Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoder and adapter.
- ๐ Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., objecte detection), pixel-level tasks (e.g., semantic segmentation) and vision-language tasks (e.g., image captioning).
- ๐ค Achieving task synergy by unified language interface: Similar to LLM, GiT observes task synergy effect in multi-task training.
- ๐ฅ Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after trained on 27 datasets.
- Developed by: Haiyang Wang ( wanghaiyang6@stu.pku.edu.cn ), Hao Tang (tanghao@stu.pku.edu.cn)
- License: Apache license 2.0
Model Sources
- Repository: https://github.com/Haiyang-W/GiT
- Paper: https://arxiv.org/abs/2403.09394
Uses
Please refer here for more detail about usage.