Vision Transformer (base-sized model)

Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Dosovitskiy et al. and first released in this repository. However, the weights were converted from the timm repository by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.

Disclaimer: The team releasing ViT did not write a model card for this model so this model card has been written by the Hugging Face team.

Dataset

The dataset used consist of spans six classes: glass, paper, cardboard, plastic, metal, and trash. Currently, the dataset consists of 2527 images:

501 glass
594 paper
403 cardboard
482 plastic
410 metal
137 trash

Fine_tuned Notebook

This notebook outlines the steps from preparing the data in the Vit-acceptable format to training the model Notebook

How to use

Just copy this lines below:

from transformers import AutoFeatureExtractor, AutoModelForImageClassification
from PIL import Image
import requests
url = 'https://www.estal.com/FitxersWeb/331958/estal_carroussel_wg_spirits_5.jpg'
image = Image.open(requests.get(url, stream=True).raw)

feature_extractor = AutoFeatureExtractor.from_pretrained("Aalaa/Fine_tuned_Vit_trash_classification")
model = AutoModelForImageClassification.from_pretrained("Aalaa/Fine_tuned_Vit_trash_classification")

inputs = feature_extractor(images=image, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits

predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

For more code examples, we refer to the documentation.