---
license: cc-by-nc-4.0
library_name: transformers
tags:
- vision
- pretraining
- racing
- formula1
---
# Vision transformer pre-trained with MAE on Formula 1 racing dataset

Vision transformer baze-sized (ViT base) feature model. Pre-trained with [Masked Autoencoder (MAE) Self-Supervised approach](https://arxiv.org/abs/2111.06377) on the custom Formula 1 racing dataset from [Constructor SportsTech](https://constructor.tech/solutions/sports-tech), allowes the extraction of features that are more efficient for use in Computer Vision tasks in the areas of racing and Formula 1 then features pre-trained on standard ImageNet-1K. This ViT model is ready for use in [Transformers library realization of MAE](https://huggingface.co/facebook/vit-mae-base).

## Model Details

- Model type: feature backbone
- Image size: 224 x 224
- Original MAE repo: https://github.com/facebookresearch/mae
- Original paper: Masked Autoencoders Are Scalable Vision Learners (https://arxiv.org/abs/2111.06377)

## Training Procedure

F1 ViT-base MAE was pre-trained on the custom dataset containing more than 1 million Formula 1 images from seasons 2021, 2022, 2023 with both racing and non racing scenes. The traing was performed on a cluster of 8 A100 80GB GPUs provided by [Nebius](https://nebius.com/) who invited us to technical preview of their platform.

### Training Hyperparameters

- Masking proportion during pre-training: 75 %
- Normalized pixels during pre-training: False
- Epochs: 500
- Batch size: 4096
- Learning rate: 3e-3
- Warmup: 40 epochs
- Optimizer: AdamW

## Comparison with ViT-base MAE pre-trained on ImageNet-1K

Comparison of F1 ViT-base MAE and [original ViT-base MAE pre-trained on ImageNet-1K](https://huggingface.co/facebook/vit-mae-base) by reconstruction results on images from Formula 1 domain. Top is F1 ViT-base MAE reconstruction output, bottom is original ViT-base MAE. 

<img src="comparison_2.png" alt="drawing" width="1200"/>

<img src="comparison_1.png" alt="drawing" width="1200"/>

## How to use

Usage is the same as in [Transformers library realization of MAE](https://huggingface.co/facebook/vit-mae-base).

```
from transformers import AutoImageProcessor, ViTMAEForPreTraining
from PIL import Image
import requests

url = 'https://huggingface.co/andrewbo29/vit-mae-base-formula1/blob/main/racing_example.jpg'
image = Image.open(requests.get(url, stream=True).raw)

processor = AutoImageProcessor.from_pretrained('andrewbo29/vit-mae-base-formula1')
model = ViTMAEForPreTraining.from_pretrained('andrewbo29/vit-mae-base-formula1')

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
loss = outputs.loss
mask = outputs.mask
ids_restore = outputs.ids_restore
```

## BibTeX entry and citation info

```
@article{DBLP:journals/corr/abs-2111-06377,
  author    = {Kaiming He and
               Xinlei Chen and
               Saining Xie and
               Yanghao Li and
               Piotr Doll{\'{a}}r and
               Ross B. Girshick},
  title     = {Masked Autoencoders Are Scalable Vision Learners},
  journal   = {CoRR},
  volume    = {abs/2111.06377},
  year      = {2021},
  url       = {https://arxiv.org/abs/2111.06377},
  eprinttype = {arXiv},
  eprint    = {2111.06377},
  timestamp = {Tue, 16 Nov 2021 12:12:31 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2111-06377.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
```