---
license: mit
language:
- en
base_model:
- facebook/dinov2-small
---
ViTGaze 👀
Gaze Following with Interaction Features in Vision Transformers
[Yuehao Song](https://scholar.google.com/citations?user=7sqkA-MAAAAJ)1 , [Xinggang Wang](https://xwcv.github.io)1,✉️ , [Jingfeng Yao](https://scholar.google.com/citations?user=4qc1qJ0AAAAJ)1 , [Wenyu Liu](http://eic.hust.edu.cn/professor/liuwenyu/)1 , Jinglin Zhang2 , Xiangmin Xu3
1 Huazhong University of Science and Technology, 2 Shandong University, 3 South China University of Technology
(✉️ corresponding author)
Accepted by Visual Intelligence ([Paper](https://link.springer.com/article/10.1007/s44267-024-00064-9))
[](https://arxiv.org/abs/2403.12778) [](https://github.com/hustvl/ViTGaze) [](https://paperswithcode.com/paper/vitgaze-gaze-following-with-interaction)
[](https://paperswithcode.com/sota/gaze-target-estimation-on-gazefollow?p=vitgaze-gaze-following-with-interaction)
[](https://paperswithcode.com/sota/gaze-target-estimation-on?p=vitgaze-gaze-following-with-interaction)
#


### News
* **`Nov. 21th, 2024`:** ViTGaze is accepted by Visual Intelligence! 🎉
* **`Mar. 25th, 2024`:** We release an initial version of ViTGaze.
* **`Mar. 19th, 2024`:** We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️
## Introduction
Plain Vision Transformer could also do gaze following with the simple ViTGaze framework!

Inspired by the remarkable success of pre-trained plain Vision Transformers (ViTs), we introduce a novel single-modality gaze following framework, **ViTGaze**. In contrast to previous methods, it creates a brand new gaze following framework based mainly on powerful encoders (relative decoder parameter less than 1%). Our principal insight lies in that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement on AUC, 5.1% improvement on AP) and very comparable performance against multi-modality methods with 59% number of parameters less.
## Results
> Results from the [ViTGaze paper](https://link.springer.com/article/10.1007/s44267-024-00064-9)

Corresponding checkpoints are released:
- GazeFollow: [GoogleDrive](https://drive.google.com/file/d/164c4woGCmUI8UrM7GEKQrV1FbA3vGwP4/view?usp=drive_link)
- VideoAttentionTarget: [GoogleDrive](https://drive.google.com/file/d/11_O4Jm5wsvQ8qfLLgTlrudqSNvvepsV0/view?usp=drive_link)
## Getting Started
- [Installation](docs/install.md)
- [Train](docs/train.md)
- [Eval](docs/eval.md)
## Acknowledgements
ViTGaze is based on [detectron2](https://github.com/facebookresearch/detectron2). We use the efficient multi-head attention implemented in the [xFormers](https://github.com/facebookresearch/xformers) library.
## Citation
If you find ViTGaze is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
```bibtex
@article{song2024vitgaze,
title = {ViTGaze: Gaze Following with Interaction Features in Vision Transformers},
author = {Song, Yuehao and Wang, Xinggang and Yao, Jingfeng and Liu, Wenyu and Zhang, Jinglin and Xu, Xiangmin},
journal = {Visual Intelligence},
volume = {2},
number = {31},
year = {2024},
url = {https://doi.org/10.1007/s44267-024-00064-9}
}
```