cs-giung's picture
Update README.md
979bef1 verified
metadata
license: apache-2.0

Vision Transformer

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. The weights were converted from the ViT-L_16.npz file in GCS buckets presented in the original repository.