EVA: An Open Billion-Scale Vision Foundation Model
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Yuxin Fang2,1, Wen Wang3,1, Binhui Xie4,1, Quan Sun1, Ledell Wu1, Xinggang Wang2, Tiejun Huang1, Xinlong Wang1, Yue Cao1
We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data and academic resources. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features (i.e., CLIP features) conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks.
EVA is the first open-sourced billion-scale vision foundation model that achieves state-of-the-art performance on a broad range of downstream tasks.
Table of Contents
- Image Classification
- Video Classification
- Object Detection & Instance Segmentation
- Semantic Segmentation
- EVA-CLIP
- Citation
- License
- Contact
Image Classification
We provide all pre-trained & fine-tuned EVAs for the community. The following table summarizes the basic statistics of MIM pre-trained EVA and image classification EVA.
model name | #param. | pre-training epochs on merged-30M | intermeidate fine-tuning epochs on IN-21K | fine-tuning epochs on IN-1K | IN-1K top-1 acc. | weight |
---|---|---|---|---|---|---|
eva_psz14 |
1.0B | 150 | - | - | - | π€ HF link (2GB ) |
eva_psz14to16 |
1.0B | 150 | - | - | - | π€ HF link (2GB ) |
eva_21k_224px_psz14 |
1.0B | 150 | 60 | - | - | π€ HF link (2GB ) |
eva_21k_1k_336px_psz14_ema |
1.0B | 150 | 60 | 10 | 89.6 | π€ HF link (4GB ) |
eva_21k_1k_560px_psz14_ema |
1.0B | 150 | 60 | 15 | 89.7 | π€ HF link (4GB ) |
eva_psz14to16
model interpolates the kernel size ofpatch_embed
from14x14
to16x16
. This is useful for object detection, instance segmentation & semantic segmentation, etc. Seeinterpolate_patch_14to16.py
for implementation details.- For MIM pre-trained EVA and EVA-CLIP, we use
deepspeed
fp16
format. IN-1K fine-tuned EVA weights are larger (4GB
v.s.2GB
) because ema updates models withfp32
format. The weights of other downstream tasks are also withfp32
format.
Summary of EVA's image classification performance
Video Classification
dataset | model name | init. weight | acc@1 | config | weight | logs |
---|---|---|---|---|---|---|
Kinetics722 | eva_video_k722 |
eva_psz14 |
- | config | π€ HF link (4.8GB ) |
ft_k722 |
Kinetics400 | eva_video_k400 |
eva_video_k722 |
89.7 | config | π€ HF link (4.8GB ) |
ft_k400 |
Kinetics600 | eva_video_k600 |
eva_video_k722 |
89.8 | config | π€ HF link (4.8GB ) |
ft_k600 |
Kinetics700 | eva_video_k700 |
eva_video_k722 |
82.9 | config | π€ HF link (4.8GB ) |
ft_k700 |
Object Detection & Instance Segmentation
model name | #param. | pre-training interations on Objects365 | weight |
---|---|---|---|
eva_o365 |
1.1B | 380k | π€ HF link (4GB ) |
COCO 2017 (single-scale evaluation on val
set)
init. model weight | batch size | iter | AP box | AP mask | config | model weight |
---|---|---|---|---|---|---|
eva_o365 |
64 | 35k | 64.2 | 53.9 | config | π€ HF link (4GB ) |
eva_o365 |
64 | 45k | 63.9 | 55.0 | config | π€ HF link (4GB ) |
LVIS v1.0 (single-scale evaluation on val
set)
init. model weight | batch size | iter | AP box | AP mask | config | model weight |
---|---|---|---|---|---|---|
eva_o365 |
64 | 75k | 62.2 | 55.0 | config | π€ HF link (4GB ) |
Semantic Segmentation
COCO-Stuff-164K
init. model weight | batch size | iter | crop size | mIoU (ss) | config | seg model weight | logs |
---|---|---|---|---|---|---|---|
eva_psz14to16 |
32 | 60k | 896 | 53.4 | config | π€ HF link | training | evaluation |
ADE20K
init. model weight | batch size | iter | crop size | mIoU | config | seg model weight | logs |
---|---|---|---|---|---|---|---|
eva_sem_seg_coco |
64 | 20k | 896 | 61.5 (ss) | 62.3 (ms) | config | π€ HF link | training | evaluation |
EVA-CLIP
model name | #param. | precision | data | batch size | IN-1K zero-shot top-1 | weight |
---|---|---|---|---|---|---|
eva_clip_psz14 |
1.3B | fp16 |
LAION-400M | 41K | 78.5 | π€ HF link (2GB ) |
The ImageNet-1K zero-shot classification performance is higher than our paper (
78.5
v.s.78.2
) because of longer training.
We choose to train a 1.3B CLIP model, not because it is easy, but because it is hard. Please refer to this note for a glance of the challenges in training very large CLIP.
To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance. We will updates the results in our paper soon. For more details of EVA-CLIP, please refer to Section 2.3.5 of our paper.
We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation leaning, AIGC, etc.
Citation
If you find our work helpful, please star this repo and cite the related articles. Thanks for your support!
@article{EVA,
title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
journal={arXiv preprint arXiv:2211.07636},
year={2022}
}
License
The content of this project itself is licensed under the MIT License.
Contact
For help or issues using EVA, please open a GitHub issue.
We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns.
If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Yue Cao (caoyue@baai.ac.cn
) and Xinlong Wang (wangxinlong@baai.ac.cn
).