AnyModal
/

Image-Captioning-Llama-3.2-1B

Model card Files Files and versions Community

ritabratamaiti commited on Dec 5, 2024

Commit

7697704

·

verified ·

1 Parent(s): d9893c6

Update README.md

Files changed (1) hide show

README.md +2 -1

README.md CHANGED Viewed

@@ -13,6 +13,7 @@ tags:
 - vlm
 - vision
 - multimodal
 ---
 # AnyModal/Image-Captioning-Llama-3.2-1B
@@ -117,7 +118,7 @@ Refer to the project repository for further implementation details and customiza
 ## Project Details
 - **Vision Encoder**: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
-- **Projector Network**: Projects visual features into a token space compatible with Llama 3.2-1B.
 - **Language Model**: Llama 3.2-1B, a pre-trained causal language model for text generation.
 This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.

 - vlm
 - vision
 - multimodal
+- AnyModal
 ---
 # AnyModal/Image-Captioning-Llama-3.2-1B
 ## Project Details
 - **Vision Encoder**: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
+- **Projector Network**: Projects visual features into a token space compatible with Llama 3.2-1B using a dense network.
 - **Language Model**: Llama 3.2-1B, a pre-trained causal language model for text generation.
 This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.