File size: 2,388 Bytes
030b5a2 ca46bec 030b5a2 ca46bec 030b5a2 8a95488 030b5a2 8a95488 65c778f 030b5a2 3c94648 030b5a2 65c778f 030b5a2 3c94648 030b5a2 65c778f 030b5a2 65c778f 030b5a2 ebaaa6d 030b5a2 65c778f 030b5a2 9429ac5 030b5a2 65c778f 030b5a2 eb73311 9429ac5 030b5a2 9429ac5 030b5a2 9429ac5 65c778f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
---
license: other
license_name: stem.ai.mtl
license_link: LICENSE
tags:
- vision
- image-classification
- STEM-AI-mtl/City_map
- Google
- ViT
- STEM-AI-mtl
datasets:
- STEM-AI-mtl/City_map
widget:
- image: https://cdn.britannica.com/50/69550-050-B9DA3DCA/Central-New-York-City-borough-Manhattan-Park.jpg
output:
text: NYC
metrics:
- accuracy
---
# The fine-tuned ViT model that beats [Google's state-of-the-art model](https://huggingface.co/google/vit-base-patch16-224) and OpenAI's famous GPT4
Image-classification fine-tuned model that identifies which city map is illustrated from an image input.
The Vision Transformer (ViT) base model is a transformer encoder model (BERT-like) pretrained on a large collection of images in a supervised fashion, namely ImageNet-21k, at a resolution of 224x224 pixels. Next, the model was fine-tuned on ImageNet (also referred to as ILSVRC2012), a dataset comprising 1 million images and 1,000 classes, also at resolution 224x224.
### How to use:
[Inference script](https://github.com/STEM-ai/Vision/raw/7d92c8daa388eb74e8c336f2d0d3942722fec3c6/ViT_inference.py)
For more code examples, we refer to the [documentation](https://huggingface.co/transformers/model_doc/vit.html#).
## Training data
This [Google's ViT-base-patch16-224 for city identification](https://huggingface.co/google/vit-base-patch16-224) model was fine-tuned on the [STEM-AI-mtl/City_map dataset](https://huggingface.co/datasets/STEM-AI-mtl/City_map), contaning overer 600 images of 45 different maps of cities around the world.
## Training procedure
A Transformer training was performed on [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on a 4 Gb Nvidia GTX 1650 GPU.
[Training notebook](https://github.com/STEM-ai/Vision/raw/7d92c8daa388eb74e8c336f2d0d3942722fec3c6/Trainer_ViT.ipynb)
## Training evaluation results
The most accurate output model was obtained from a learning rate of 1e-3. The quality of the training was evaluated with the training dataset and resulted in the following metrics:\
{'eval_loss': 1.3691096305847168,\
'eval_accuracy': 0.6666666666666666,\
'eval_runtime': 13.0277,\
'eval_samples_per_second': 4.606,\
'eval_steps_per_second': 0.154,\
'epoch': 2.82}
## Model Card Authors
STEM.AI: stem.ai.mtl@gmail.com\
[William Harbec](https://www.linkedin.com/in/william-harbec-56a262248/) |