add model card
Browse files
README.md
ADDED
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
tags:
|
4 |
+
- vision
|
5 |
+
- image-segmentation
|
6 |
+
datasets:
|
7 |
+
- pascal-voc
|
8 |
+
widget:
|
9 |
+
- src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg
|
10 |
+
example_title: Tiger
|
11 |
+
---
|
12 |
+
|
13 |
+
# MobileViT + DeepLabV3 (small-sized model)
|
14 |
+
|
15 |
+
MobileViT model pre-trained on PASCAL VOC at resolution 512x512. It was introduced in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://arxiv.org/abs/2110.02178) by Sachin Mehta and Mohammad Rastegari, and first released in [this repository](https://github.com/apple/ml-cvnets). The license used is [Apple sample code license](https://github.com/apple/ml-cvnets/blob/main/LICENSE).
|
16 |
+
|
17 |
+
Disclaimer: The team releasing MobileViT did not write a model card for this model so this model card has been written by the Hugging Face team.
|
18 |
+
|
19 |
+
## Model description
|
20 |
+
|
21 |
+
MobileViT is a light-weight, low latency convolutional neural network that combines MobileNetV2-style layers with a new block that replaces local processing in convolutions with global processing using transformers. As with ViT (Vision Transformer), the image data is converted into flattened patches before it is processed by the transformer layers. Afterwards, however, the patches are "unflattened" back into feature maps. This allows the MobileViT-block to be placed anywhere inside a CNN. MobileViT does not require any positional embeddings.
|
22 |
+
|
23 |
+
The model in this repo adds a [DeepLabV3](https://arxiv.org/abs/1706.05587) head to the MobileViT backbone for semantic segmentation.
|
24 |
+
|
25 |
+
## Intended uses & limitations
|
26 |
+
|
27 |
+
You can use the raw model for semantic segmentation. See the [model hub](https://huggingface.co/models?search=mobilevit) to look for fine-tuned versions on a task that interests you.
|
28 |
+
|
29 |
+
### How to use
|
30 |
+
|
31 |
+
Here is how to use this model:
|
32 |
+
|
33 |
+
```python
|
34 |
+
from transformers import MobileViTFeatureExtractor, MobileViTForSemanticSegmentation
|
35 |
+
from PIL import Image
|
36 |
+
import requests
|
37 |
+
|
38 |
+
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
|
39 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
40 |
+
|
41 |
+
feature_extractor = MobileViTFeatureExtractor.from_pretrained('Matthijs/deeplabv3-mobilevit-small')
|
42 |
+
model = MobileViTForSemanticSegmentation.from_pretrained('Matthijs/deeplabv3-mobilevit-small')
|
43 |
+
|
44 |
+
inputs = feature_extractor(images=image, return_tensors="pt")
|
45 |
+
|
46 |
+
outputs = model(**inputs)
|
47 |
+
logits = outputs.logits
|
48 |
+
predicted_mask = logits.argmax(1).squeeze(0)
|
49 |
+
```
|
50 |
+
|
51 |
+
Currently, both the feature extractor and model support PyTorch.
|
52 |
+
|
53 |
+
## Training data
|
54 |
+
|
55 |
+
The MobileViT + DeepLabV3 model was pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k), a dataset consisting of 1 million images and 1,000 classes, and then fine-tuned on the [PASCAL VOC2012](http://host.robots.ox.ac.uk/pascal/VOC/) dataset.
|
56 |
+
|
57 |
+
## Training procedure
|
58 |
+
|
59 |
+
### Preprocessing
|
60 |
+
|
61 |
+
At inference time, images are center-cropped at 512x512. Pixels are normalized to the range [0, 1]. Images are expected to be in BGR pixel order, not RGB.
|
62 |
+
|
63 |
+
### Pretraining
|
64 |
+
|
65 |
+
The MobileViT networks are trained from scratch for 300 epochs on ImageNet-1k on 8 NVIDIA GPUs with an effective batch size of 1024 and learning rate warmup for 3k steps, followed by cosine annealing. Also used were label smoothing cross-entropy loss and L2 weight decay. Training resolution varies from 160x160 to 320x320, using multi-scale sampling.
|
66 |
+
|
67 |
+
To obtain the DeepLabV3 model, MobileViT was fine-tuned on the PASCAL VOC dataset using 4 NVIDIA A100 GPUs.
|
68 |
+
|
69 |
+
## Evaluation results
|
70 |
+
|
71 |
+
| Model | PASCAL VOC mIOU | # params | URL |
|
72 |
+
|------------------|-----------------|-----------|--------------------------------------------------------------|
|
73 |
+
| MobileViT-XXS | 73.6 | 1.9 M | https://huggingface.co/Matthijs/deeplabv3-mobilevit-xx-small |
|
74 |
+
| MobileViT-XS | 77.1 | 2.9 M | https://huggingface.co/Matthijs/deeplabv3-mobilevit-x-small |
|
75 |
+
| **MobileViT-S** | **79.1** | **6.4 M** | https://huggingface.co/Matthijs/deeplabv3-mobilevit-small |
|
76 |
+
|
77 |
+
### BibTeX entry and citation info
|
78 |
+
|
79 |
+
```bibtex
|
80 |
+
@inproceedings{vision-transformer,
|
81 |
+
title = {MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer},
|
82 |
+
author = {Sachin Mehta and Mohammad Rastegari},
|
83 |
+
year = {2022},
|
84 |
+
URL = {https://arxiv.org/abs/2110.02178}
|
85 |
+
}
|
86 |
+
```
|