ritabratamaiti commited on
Commit
7697704
·
verified ·
1 Parent(s): d9893c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -13,6 +13,7 @@ tags:
13
  - vlm
14
  - vision
15
  - multimodal
 
16
  ---
17
  # AnyModal/Image-Captioning-Llama-3.2-1B
18
 
@@ -117,7 +118,7 @@ Refer to the project repository for further implementation details and customiza
117
  ## Project Details
118
 
119
  - **Vision Encoder**: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
120
- - **Projector Network**: Projects visual features into a token space compatible with Llama 3.2-1B.
121
  - **Language Model**: Llama 3.2-1B, a pre-trained causal language model for text generation.
122
 
123
  This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.
 
13
  - vlm
14
  - vision
15
  - multimodal
16
+ - AnyModal
17
  ---
18
  # AnyModal/Image-Captioning-Llama-3.2-1B
19
 
 
118
  ## Project Details
119
 
120
  - **Vision Encoder**: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
121
+ - **Projector Network**: Projects visual features into a token space compatible with Llama 3.2-1B using a dense network.
122
  - **Language Model**: Llama 3.2-1B, a pre-trained causal language model for text generation.
123
 
124
  This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.