EVA: An Open Billion-Scale Vision Foundation Model

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Yuxin Fang^2,1, Wen Wang^3,1, Binhui Xie^4,1, Quan Sun¹, Ledell Wu¹, Xinggang Wang², Tiejun Huang¹, Xinlong Wang¹, Yue Cao¹

¹BAAI, ²HUST, ³ZJU, ⁴BIT

We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data and academic resources. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features (i.e., CLIP features) conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks.

EVA is the first open-sourced billion-scale vision foundation model that achieves state-of-the-art performance on a broad range of downstream tasks.

Table of Contents

Image Classification
- Summary of EVA's image classification performance
Video Classification
Object Detection & Instance Segmentation
- COCO 2017 (single-scale evaluation on val set)
- LVIS v1.0 (single-scale evaluation on val set)
Semantic Segmentation
- COCO-Stuff-164K
- ADE20K
EVA-CLIP
Citation
License
Contact

Image Classification

We provide all pre-trained & fine-tuned EVAs for the community. The following table summarizes the basic statistics of MIM pre-trained EVA and image classification EVA.

model name	#param.	pre-training epochs on merged-30M	intermeidate fine-tuning epochs on IN-21K	fine-tuning epochs on IN-1K	IN-1K top-1 acc.	weight
`eva_psz14`	1.0B	150	-	-	-	🤗 HF link (`2GB`)
`eva_psz14to16`	1.0B	150	-	-	-	🤗 HF link (`2GB`)
`eva_21k_224px_psz14`	1.0B	150	60	-	-	🤗 HF link (`2GB`)
`eva_21k_1k_336px_psz14_ema`	1.0B	150	60	10	89.6	🤗 HF link (`4GB`)
`eva_21k_1k_560px_psz14_ema`	1.0B	150	60	15	89.7	🤗 HF link (`4GB`)

eva_psz14to16 model interpolates the kernel size of patch_embed from 14x14 to 16x16. This is useful for object detection, instance segmentation & semantic segmentation, etc. See interpolate_patch_14to16.py for implementation details.
For MIM pre-trained EVA and EVA-CLIP, we use deepspeed fp16 format. IN-1K fine-tuned EVA weights are larger (4GB v.s. 2GB) because ema updates models with fp32 format. The weights of other downstream tasks are also with fp32 format.

Summary of EVA's image classification performance

model	IN-1K	IN-V2	IN-ReaL	IN-Adv.	IN-Ren.	IN-Ske.	ObjectNet
EVA	89.6	81.6	90.8	86.2	88.3	67.7	60.9

Video Classification

dataset	model name	init. weight	acc@1	config	weight	logs
Kinetics722	`eva_video_k722`	`eva_psz14`	-	config	🤗 HF link (`4.8GB`)	ft_k722
Kinetics400	`eva_video_k400`	`eva_video_k722`	89.7	config	🤗 HF link (`4.8GB`)	ft_k400
Kinetics600	`eva_video_k600`	`eva_video_k722`	89.8	config	🤗 HF link (`4.8GB`)	ft_k600
Kinetics700	`eva_video_k700`	`eva_video_k722`	82.9	config	🤗 HF link (`4.8GB`)	ft_k700

Object Detection & Instance Segmentation

model name	#param.	pre-training interations on Objects365	weight
`eva_o365`	1.1B	380k	🤗 HF link (`4GB`)

COCO 2017 (single-scale evaluation on `val` set)

init. model weight	batch size	iter	AP box	AP mask	config	model weight
`eva_o365`	64	35k	64.2	53.9	config	🤗 HF link (`4GB`)
`eva_o365`	64	45k	63.9	55.0	config	🤗 HF link (`4GB`)

LVIS v1.0 (single-scale evaluation on `val` set)

init. model weight	batch size	iter	AP box	AP mask	config	model weight
`eva_o365`	64	75k	62.2	55.0	config	🤗 HF link (`4GB`)

Semantic Segmentation

COCO-Stuff-164K

init. model weight	batch size	iter	crop size	mIoU (ss)	config	seg model weight	logs
`eva_psz14to16`	32	60k	896	53.4	config	🤗 HF link	training \| evaluation

ADE20K

init. model weight	batch size	iter	crop size	mIoU	config	seg model weight	logs
`eva_sem_seg_coco`	64	20k	896	61.5 (ss) \| 62.3 (ms)	config	🤗 HF link	training \| evaluation

EVA-CLIP

model name	#param.	precision	data	batch size	IN-1K zero-shot top-1	weight
`eva_clip_psz14`	1.3B	`fp16`	LAION-400M	41K	78.5	🤗 HF link (`2GB`)

The ImageNet-1K zero-shot classification performance is higher than our paper (78.5 v.s. 78.2) because of longer training.

We choose to train a 1.3B CLIP model, not because it is easy, but because it is hard. Please refer to this note for a glance of the challenges in training very large CLIP.

To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance. We will updates the results in our paper soon. For more details of EVA-CLIP, please refer to Section 2.3.5 of our paper.

We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation leaning, AIGC, etc.

Citation

If you find our work helpful, please star this repo and cite the related articles. Thanks for your support!

@article{EVA,
  title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
  author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
  journal={arXiv preprint arXiv:2211.07636},
  year={2022}
}

License

The content of this project itself is licensed under the MIT License.

Contact

For help or issues using EVA, please open a GitHub issue.

We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Yue Cao (caoyue@baai.ac.cn) and Xinlong Wang (wangxinlong@baai.ac.cn).