--- license: mit language: - en base_model: - facebook/dinov2-small ---

ViTGaze 👀

Gaze Following with Interaction Features in Vision Transformers

[Yuehao Song](https://scholar.google.com/citations?user=7sqkA-MAAAAJ)1 , [Xinggang Wang](https://xwcv.github.io)1,✉️ , [Jingfeng Yao](https://scholar.google.com/citations?user=4qc1qJ0AAAAJ)1 , [Wenyu Liu](http://eic.hust.edu.cn/professor/liuwenyu/)1 , Jinglin Zhang2 , Xiangmin Xu3 1 Huazhong University of Science and Technology, 2 Shandong University, 3 South China University of Technology (✉️ corresponding author) Accepted by Visual Intelligence ([Paper](https://link.springer.com/article/10.1007/s44267-024-00064-9)) [![arxiv paper](https://img.shields.io/badge/arXiv-Preprint-red)](https://arxiv.org/abs/2403.12778) [![Github](https://img.shields.io/badge/Github-Code-gren)](https://github.com/hustvl/ViTGaze) [![PaperwithCode](https://img.shields.io/badge/Paperswithcode-blue)](https://paperswithcode.com/paper/vitgaze-gaze-following-with-interaction) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitgaze-gaze-following-with-interaction/gaze-target-estimation-on-gazefollow)](https://paperswithcode.com/sota/gaze-target-estimation-on-gazefollow?p=vitgaze-gaze-following-with-interaction) [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/vitgaze-gaze-following-with-interaction/gaze-target-estimation-on)](https://paperswithcode.com/sota/gaze-target-estimation-on?p=vitgaze-gaze-following-with-interaction)
# ![Demo0](assets/demo0.gif) ![Demo1](assets/demo1.gif) ### News * **`Nov. 21th, 2024`:** ViTGaze is accepted by Visual Intelligence! 🎉 * **`Mar. 25th, 2024`:** We release an initial version of ViTGaze. * **`Mar. 19th, 2024`:** We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️ ## Introduction
Plain Vision Transformer could also do gaze following with the simple ViTGaze framework!
![framework](assets/pipeline.png "framework") Inspired by the remarkable success of pre-trained plain Vision Transformers (ViTs), we introduce a novel single-modality gaze following framework, **ViTGaze**. In contrast to previous methods, it creates a brand new gaze following framework based mainly on powerful encoders (relative decoder parameter less than 1%). Our principal insight lies in that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement on AUC, 5.1% improvement on AP) and very comparable performance against multi-modality methods with 59% number of parameters less. ## Results > Results from the [ViTGaze paper](https://link.springer.com/article/10.1007/s44267-024-00064-9) ![comparison](assets/comparion.png "comparison")
Results on GazeFollow Results on VideoAttentionTarget
AUC Avg. Dist. Min. Dist. AUC Dist. AP
0.949 0.105 0.047 0.938 0.102 0.905
Corresponding checkpoints are released: - GazeFollow: [GoogleDrive](https://drive.google.com/file/d/164c4woGCmUI8UrM7GEKQrV1FbA3vGwP4/view?usp=drive_link) - VideoAttentionTarget: [GoogleDrive](https://drive.google.com/file/d/11_O4Jm5wsvQ8qfLLgTlrudqSNvvepsV0/view?usp=drive_link) ## Getting Started - [Installation](docs/install.md) - [Train](docs/train.md) - [Eval](docs/eval.md) ## Acknowledgements ViTGaze is based on [detectron2](https://github.com/facebookresearch/detectron2). We use the efficient multi-head attention implemented in the [xFormers](https://github.com/facebookresearch/xformers) library. ## Citation If you find ViTGaze is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry. ```bibtex @article{song2024vitgaze, title = {ViTGaze: Gaze Following with Interaction Features in Vision Transformers}, author = {Song, Yuehao and Wang, Xinggang and Yao, Jingfeng and Liu, Wenyu and Zhang, Jinglin and Xu, Xiangmin}, journal = {Visual Intelligence}, volume = {2}, number = {31}, year = {2024}, url = {https://doi.org/10.1007/s44267-024-00064-9} } ```