Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,7 @@ SmolVLM is a compact open multimodal model that accepts arbitrary sequences of i
|
|
21 |
- **Model type:** Multi-modal model (image+text)
|
22 |
- **Language(s) (NLP):** English
|
23 |
- **License:** Apache 2.0
|
24 |
-
- **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see
|
25 |
|
26 |
## Resources
|
27 |
|
@@ -160,15 +160,19 @@ We release the SmolVLM checkpoints under the Apache 2.0 license.
|
|
160 |
|
161 |
### Training Data
|
162 |
|
163 |
-
|
164 |
|
165 |
-
The training data is: ![Training data](smolvlm-data.pdf)
|
166 |
|
167 |
|
168 |
-
#### Speeds, Sizes, Times [optional]
|
169 |
-
|
170 |
-
TODO
|
171 |
|
172 |
## Evaluation
|
173 |
|
174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
- **Model type:** Multi-modal model (image+text)
|
22 |
- **Language(s) (NLP):** English
|
23 |
- **License:** Apache 2.0
|
24 |
+
- **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
|
25 |
|
26 |
## Resources
|
27 |
|
|
|
160 |
|
161 |
### Training Data
|
162 |
|
163 |
+
The training data comes from [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix) datasets, with emphasis on document understanding (25%) and image captioning (18%), while maintaining balanced coverage across other crucial capabilities like visual reasoning, chart comprehension, and general instruction following.<img src="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/resolve/main/mixture_the_cauldron.png" alt="Example Image" style="width:70%;" />
|
164 |
|
|
|
165 |
|
166 |
|
|
|
|
|
|
|
167 |
|
168 |
## Evaluation
|
169 |
|
170 |
+
| Model | MMMU (val) | MathVista (testmini) | MMStar (val) | DocVQA (test) | TextVQA (val) | Min GPU RAM required (GB) |
|
171 |
+
|-------------------|------------|----------------------|--------------|---------------|---------------|---------------------------|
|
172 |
+
| SmolVLM | 38.8 | 44.6 | 42.1 | 81.6 | 72.7 | 5.02 |
|
173 |
+
| Qwen-VL 2B | 41.1 | 47.8 | 47.5 | 90.1 | 79.7 | 13.70 |
|
174 |
+
| InternVL2 2B | 34.3 | 46.3 | 49.8 | 86.9 | 73.4 | 10.52 |
|
175 |
+
| PaliGemma 3B 448px| 34.9 | 28.7 | 48.3 | 32.2 | 56.0 | 6.72 |
|
176 |
+
| moondream2 | 32.4 | 24.3 | 40.3 | 70.5 | 65.2 | 3.87 |
|
177 |
+
| MiniCPM-V-2 | 38.2 | 39.8 | 39.1 | 71.9 | 74.1 | 7.88 |
|
178 |
+
| MM1.5 1B | 35.8 | 37.2 | 0.0 | 81.0 | 72.5 | NaN |
|