--- license: cc-by-nc-4.0 library_name: transformers tags: - vision - pretraining - racing - formula1 --- # Vision transformer pre-trained with MAE on Formula 1 racing dataset Vision transformer baze-sized (ViT base) feature model. Pre-trained with [Masked Autoencoder (MAE) Self-Supervised approach](https://arxiv.org/abs/2111.06377) on the custom Formula 1 racing dataset from [Constructor SportsTech](https://constructor.tech/solutions/sports-tech), allowes the extraction of features that are more efficient for use in Computer Vision tasks in the areas of racing and Formula 1 then features pre-trained on standard ImageNet-1K. This ViT model is ready for use in [Transformers library realization of MAE](https://huggingface.co/facebook/vit-mae-base). ## Model Details - Model type: feature backbone - Image size: 224 x 224 - Original MAE repo: https://github.com/facebookresearch/mae - Original paper: Masked Autoencoders Are Scalable Vision Learners (https://arxiv.org/abs/2111.06377) ## Training Procedure F1 ViT-base MAE was pre-trained on the custom dataset containing more than 1 million Formula 1 images from seasons 2021, 2022, 2023 with both racing and non racing scenes. The traing was performed on a cluster of 8 A100 80GB GPUs provided by [Nebius](https://nebius.com/) who invited us to technical preview of their platform. ### Training Hyperparameters - Masking proportion during pre-training: 75 % - Normalized pixels during pre-training: False - Epochs: 500 - Batch size: 4096 - Learning rate: 3e-3 - Warmup: 40 epochs - Optimizer: AdamW ## Comparison with ViT-base MAE pre-trained on ImageNet-1K Comparison of F1 ViT-base MAE and [original ViT-base MAE pre-trained on ImageNet-1K](https://huggingface.co/facebook/vit-mae-base) by reconstruction results on images from Formula 1 domain. Top is F1 ViT-base MAE reconstruction output, bottom is original ViT-base MAE. drawing drawing ## How to use Usage is the same as in [Transformers library realization of MAE](https://huggingface.co/facebook/vit-mae-base). ``` from transformers import AutoImageProcessor, ViTMAEForPreTraining from PIL import Image import requests url = 'https://huggingface.co/andrewbo29/vit-mae-base-formula1/blob/main/racing_example.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained('andrewbo29/vit-mae-base-formula1') model = ViTMAEForPreTraining.from_pretrained('andrewbo29/vit-mae-base-formula1') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) loss = outputs.loss mask = outputs.mask ids_restore = outputs.ids_restore ``` ## BibTeX entry and citation info ``` @article{DBLP:journals/corr/abs-2111-06377, author = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'{a}}r and Ross B. Girshick}, title = {Masked Autoencoders Are Scalable Vision Learners}, journal = {CoRR}, volume = {abs/2111.06377}, year = {2021}, url = {https://arxiv.org/abs/2111.06377}, eprinttype = {arXiv}, eprint = {2111.06377}, timestamp = {Tue, 16 Nov 2021 12:12:31 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-2111-06377.bib}, bibsource = {dblp computer science bibliography, https://dblp.org} } ```