avans06 commited on
Commit
b022a4d
·
verified ·
1 Parent(s): 72c39a3

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +17 -0
  2. config.json +0 -1
README.md CHANGED
@@ -277,6 +277,23 @@ extra_gated_button_content: Submit
277
  extra_gated_eu_disallowed: true
278
  ---
279
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280
  ## Model Information
281
 
282
  The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text \+ images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
 
277
  extra_gated_eu_disallowed: true
278
  ---
279
 
280
+ ## Meta-Llama-3.2-8B-Instruct
281
+
282
+ This model is derived from [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) multimodal model by removing the Vision layer and becomes a **text-only model**.
283
+
284
+ ## Conversion Details
285
+
286
+ 1. Use `transformers.MllamaForConditionalGeneration` to load the "meta-llama\Meta-Llama-3.2-11B-Vision-Instruct" model.
287
+ 2. Inspect the model structure, which mainly consists of `language_model` and `vision_model`, then remove the `vision_model`.
288
+ 3. From the config's `text_config`, identify the positions of the cross-attention layers (`cross_attention_layers`) and remove them from the `language_model` layers. For the 11B model, the `cross_attention_layers` are [3, 8, 13, 18, 23, 28, 33, 38].
289
+ 4. Rename the model structure from `language_model.model` to `model`.
290
+ 5. The `embed_tokens` of Mllama has `num_embeddings` set to `vocab_size + 8`, so ensure it matches the `vocab_size`.
291
+ 6. After removing the vision_model and cross_attention, the model becomes a text-only model, and its size reduces from 11B to 8B.
292
+
293
+ ___
294
+ # Below is the original README content.
295
+ ___
296
+
297
  ## Model Information
298
 
299
  The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text \+ images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
config.json CHANGED
@@ -32,7 +32,6 @@
32
  "transformers_version": "4.45.2",
33
  "use_cache": true,
34
  "vocab_size": 128256,
35
- "architectures": null,
36
  "bad_words_ids": null,
37
  "begin_suppress_tokens": null,
38
  "chunk_size_feed_forward": 0,
 
32
  "transformers_version": "4.45.2",
33
  "use_cache": true,
34
  "vocab_size": 128256,
 
35
  "bad_words_ids": null,
36
  "begin_suppress_tokens": null,
37
  "chunk_size_feed_forward": 0,