简介 Vision Transformer Mirror

Vision Transformer(ViT)是一个基于 transformer 编码器模型在一个大型图像集合上以监督学习进行预训练的模型, 即 ImageNet-21k, 分辨率为 224x224 像素。图像被呈现给模型作为一系列固定大小的 patch(分辨率为16x16), 这些patch被线性嵌入。还在序列的开头添加了一个 [CLS] 标记, 用于分类任务。在将序列馈送到 Transformer 编码器的层之前, 还添加了绝对位置嵌入。需要注意的是, 该模型不提供任何精细调整的头部, 因为这些头部已被 Google 研究人员置零。然而, 模型包括预训练的池化器, 可用于下游任务(如图像分类)。通过对模型进行预训练, 它学习了图像的内部表示, 然后可以用于提取有用于下游任务的特征：例如, 如果您有一个带标签的图像数据集, 您可以通过在预训练的编码器顶部放置一个线性层来训练一个标准的分类器。通常将线性层放在 [CLS] 标记的顶部, 因为该标记的最后隐藏状态可以看作是整个图像的表示。

The Vision Transformer (ViT) is a model that is pretrained with supervised learning on a large image dataset, namely ImageNet-21k, at a resolution of 224x224 pixels. Images are presented to the model as a series of fixed-size patches (16x16 in resolution), which are linearly embedded. A [CLS] token is also added at the beginning of the sequence for classification tasks. Absolute position embeddings are added before the sequence is fed into the layers of the Transformer encoder. It should be noted that the model does not provide any fine-tuned heads, as these have been zeroed out by Google researchers. However, the model includes a pretrained pooler that can be used for downstream tasks, such as image classification. By pretraining the model, it learns the internal representations of images, which can then be used to extract features useful for downstream tasks: for example, if you have a labeled image dataset, you can train a standard classifier by placing a linear layer on top of the pretrained encoder. The linear layer is typically placed on top of the [CLS] token because the final hidden state of this token can be considered as the representation of the entire image.

使用 Usage

from modelscope import snapshot_download
model_dir = snapshot_download('Genius-Society/ViT')

维护 Maintenance

git clone git@hf.co:Genius-Society/ViT
cd ViT

参考引用 Reference

[1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale