Upload 2 files
Browse files- README.md +17 -0
- config.json +0 -1
README.md
CHANGED
@@ -277,6 +277,23 @@ extra_gated_button_content: Submit
|
|
277 |
extra_gated_eu_disallowed: true
|
278 |
---
|
279 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
280 |
## Model Information
|
281 |
|
282 |
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text \+ images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
|
|
|
277 |
extra_gated_eu_disallowed: true
|
278 |
---
|
279 |
|
280 |
+
## Meta-Llama-3.2-8B-Instruct
|
281 |
+
|
282 |
+
This model is derived from [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) multimodal model by removing the Vision layer and becomes a **text-only model**.
|
283 |
+
|
284 |
+
## Conversion Details
|
285 |
+
|
286 |
+
1. Use `transformers.MllamaForConditionalGeneration` to load the "meta-llama\Meta-Llama-3.2-11B-Vision-Instruct" model.
|
287 |
+
2. Inspect the model structure, which mainly consists of `language_model` and `vision_model`, then remove the `vision_model`.
|
288 |
+
3. From the config's `text_config`, identify the positions of the cross-attention layers (`cross_attention_layers`) and remove them from the `language_model` layers. For the 11B model, the `cross_attention_layers` are [3, 8, 13, 18, 23, 28, 33, 38].
|
289 |
+
4. Rename the model structure from `language_model.model` to `model`.
|
290 |
+
5. The `embed_tokens` of Mllama has `num_embeddings` set to `vocab_size + 8`, so ensure it matches the `vocab_size`.
|
291 |
+
6. After removing the vision_model and cross_attention, the model becomes a text-only model, and its size reduces from 11B to 8B.
|
292 |
+
|
293 |
+
___
|
294 |
+
# Below is the original README content.
|
295 |
+
___
|
296 |
+
|
297 |
## Model Information
|
298 |
|
299 |
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text \+ images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
|
config.json
CHANGED
@@ -32,7 +32,6 @@
|
|
32 |
"transformers_version": "4.45.2",
|
33 |
"use_cache": true,
|
34 |
"vocab_size": 128256,
|
35 |
-
"architectures": null,
|
36 |
"bad_words_ids": null,
|
37 |
"begin_suppress_tokens": null,
|
38 |
"chunk_size_feed_forward": 0,
|
|
|
32 |
"transformers_version": "4.45.2",
|
33 |
"use_cache": true,
|
34 |
"vocab_size": 128256,
|
|
|
35 |
"bad_words_ids": null,
|
36 |
"begin_suppress_tokens": null,
|
37 |
"chunk_size_feed_forward": 0,
|