ritabratamaiti
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -13,6 +13,7 @@ tags:
|
|
13 |
- vlm
|
14 |
- vision
|
15 |
- multimodal
|
|
|
16 |
---
|
17 |
# AnyModal/Image-Captioning-Llama-3.2-1B
|
18 |
|
@@ -117,7 +118,7 @@ Refer to the project repository for further implementation details and customiza
|
|
117 |
## Project Details
|
118 |
|
119 |
- **Vision Encoder**: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
|
120 |
-
- **Projector Network**: Projects visual features into a token space compatible with Llama 3.2-1B.
|
121 |
- **Language Model**: Llama 3.2-1B, a pre-trained causal language model for text generation.
|
122 |
|
123 |
This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.
|
|
|
13 |
- vlm
|
14 |
- vision
|
15 |
- multimodal
|
16 |
+
- AnyModal
|
17 |
---
|
18 |
# AnyModal/Image-Captioning-Llama-3.2-1B
|
19 |
|
|
|
118 |
## Project Details
|
119 |
|
120 |
- **Vision Encoder**: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
|
121 |
+
- **Projector Network**: Projects visual features into a token space compatible with Llama 3.2-1B using a dense network.
|
122 |
- **Language Model**: Llama 3.2-1B, a pre-trained causal language model for text generation.
|
123 |
|
124 |
This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.
|