avans06
/

Meta-Llama-3.2-8B-Instruct

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

avans06 commited on Oct 13, 2024

Commit

b022a4d

·

verified ·

1 Parent(s): 72c39a3

Upload 2 files

Files changed (2) hide show

README.md +17 -0
config.json +0 -1

README.md CHANGED Viewed

@@ -277,6 +277,23 @@ extra_gated_button_content: Submit
 extra_gated_eu_disallowed: true
 ---
 ## Model Information
 The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text \+ images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

 extra_gated_eu_disallowed: true
 ---
+## Meta-Llama-3.2-8B-Instruct
+This model is derived from [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) multimodal model by removing the Vision layer and becomes a **text-only model**.
+## Conversion Details
+1. Use `transformers.MllamaForConditionalGeneration` to load the "meta-llama\Meta-Llama-3.2-11B-Vision-Instruct" model.
+2. Inspect the model structure, which mainly consists of `language_model` and `vision_model`, then remove the `vision_model`.
+3. From the config's `text_config`, identify the positions of the cross-attention layers (`cross_attention_layers`) and remove them from the `language_model` layers. For the 11B model, the `cross_attention_layers` are [3, 8, 13, 18, 23, 28, 33, 38].
+4. Rename the model structure from `language_model.model` to `model`.
+5. The `embed_tokens` of Mllama has `num_embeddings` set to `vocab_size + 8`, so ensure it matches the `vocab_size`.
+6. After removing the vision_model and cross_attention, the model becomes a text-only model, and its size reduces from 11B to 8B.
+___
+# Below is the original README content.
+___
 ## Model Information
 The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text \+ images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

config.json CHANGED Viewed

@@ -32,7 +32,6 @@
   "transformers_version": "4.45.2",
   "use_cache": true,
   "vocab_size": 128256,
-  "architectures": null,
   "bad_words_ids": null,
   "begin_suppress_tokens": null,
   "chunk_size_feed_forward": 0,

   "transformers_version": "4.45.2",
   "use_cache": true,
   "vocab_size": 128256,
   "bad_words_ids": null,
   "begin_suppress_tokens": null,
   "chunk_size_feed_forward": 0,