BAAI
/

EVA: An Open Billion-Scale Vision Foundation Model

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Yuxin Fang2,1, Wen Wang3,1, Binhui Xie4,1, Quan Sun1, Ledell Wu1, Xinggang Wang2, Tiejun Huang1, Xinlong Wang1, Yue Cao1

1BAAI, 2HUST, 3ZJU, 4BIT

We launch EVA, a vision-centric foundation model to Explore the limits of Visual representation at scAle using only publicly accessible data and academic resources. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features (i.e., CLIP features) conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks.

EVA is the first open-sourced billion-scale vision foundation model that achieves state-of-the-art performance on a broad range of downstream tasks.

Table of Contents

Image Classification

We provide all pre-trained & fine-tuned EVAs for the community. The following table summarizes the basic statistics of MIM pre-trained EVA and image classification EVA.

model name #param. pre-training epochs on merged-30M intermeidate fine-tuning epochs on IN-21K fine-tuning epochs on IN-1K IN-1K top-1 acc. weight
eva_psz14 1.0B 150 - - - πŸ€— HF link (2GB)
eva_psz14to16 1.0B 150 - - - πŸ€— HF link (2GB)
eva_21k_224px_psz14 1.0B 150 60 - - πŸ€— HF link (2GB)
eva_21k_1k_336px_psz14_ema 1.0B 150 60 10 89.6 πŸ€— HF link (4GB)
eva_21k_1k_560px_psz14_ema 1.0B 150 60 15 89.7 πŸ€— HF link (4GB)
  • eva_psz14to16 model interpolates the kernel size of patch_embed from 14x14 to 16x16. This is useful for object detection, instance segmentation & semantic segmentation, etc. See interpolate_patch_14to16.py for implementation details.
  • For MIM pre-trained EVA and EVA-CLIP, we use deepspeed fp16 format. IN-1K fine-tuned EVA weights are larger (4GB v.s. 2GB) because ema updates models with fp32 format. The weights of other downstream tasks are also with fp32 format.

Summary of EVA's image classification performance

model IN-1K IN-V2 IN-ReaL IN-Adv. IN-Ren. IN-Ske. ObjectNet
EVA 89.6 81.6 90.8 86.2 88.3 67.7 60.9

Video Classification

dataset model name init. weight acc@1 config weight logs
Kinetics722 eva_video_k722 eva_psz14 - config πŸ€— HF link (4.8GB) ft_k722
Kinetics400 eva_video_k400 eva_video_k722 89.7 config πŸ€— HF link (4.8GB) ft_k400
Kinetics600 eva_video_k600 eva_video_k722 89.8 config πŸ€— HF link (4.8GB) ft_k600
Kinetics700 eva_video_k700 eva_video_k722 82.9 config πŸ€— HF link (4.8GB) ft_k700

Object Detection & Instance Segmentation

model name #param. pre-training interations on Objects365 weight
eva_o365 1.1B 380k πŸ€— HF link (4GB)

COCO 2017 (single-scale evaluation on val set)

init. model weight batch size iter AP box AP mask config model weight
eva_o365 64 35k 64.2 53.9 config πŸ€— HF link (4GB)
eva_o365 64 45k 63.9 55.0 config πŸ€— HF link (4GB)

LVIS v1.0 (single-scale evaluation on val set)

init. model weight batch size iter AP box AP mask config model weight
eva_o365 64 75k 62.2 55.0 config πŸ€— HF link (4GB)

Semantic Segmentation

COCO-Stuff-164K

init. model weight batch size iter crop size mIoU (ss) config seg model weight logs
eva_psz14to16 32 60k 896 53.4 config πŸ€— HF link training | evaluation

ADE20K

init. model weight batch size iter crop size mIoU config seg model weight logs
eva_sem_seg_coco 64 20k 896 61.5 (ss) | 62.3 (ms) config πŸ€— HF link training | evaluation

EVA-CLIP

model name #param. precision data batch size IN-1K zero-shot top-1 weight
eva_clip_psz14 1.3B fp16 LAION-400M 41K 78.5 πŸ€— HF link (2GB)

The ImageNet-1K zero-shot classification performance is higher than our paper (78.5 v.s. 78.2) because of longer training.

We choose to train a 1.3B CLIP model, not because it is easy, but because it is hard. Please refer to this note for a glance of the challenges in training very large CLIP.

To our knowledge, EVA-CLIP is the largest performant open-sourced CLIP model evaluated via zero-shot classification performance. We will updates the results in our paper soon. For more details of EVA-CLIP, please refer to Section 2.3.5 of our paper.

We hope open-sourcing EVA-CLIP can facilitate future research in multi-modal learning, representation leaning, AIGC, etc.

Citation

If you find our work helpful, please star this repo and cite the related articles. Thanks for your support!

@article{EVA,
  title={EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
  author={Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
  journal={arXiv preprint arXiv:2211.07636},
  year={2022}
}

License

The content of this project itself is licensed under the MIT License.

Contact

For help or issues using EVA, please open a GitHub issue.

We are hiring at all levels at BAAI Vision Team, including full-time researchers, engineers and interns. If you are interested in working with us on foundation model, self-supervised learning and multimodal learning, please contact Yue Cao (caoyue@baai.ac.cn) and Xinlong Wang (wangxinlong@baai.ac.cn).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .