OEvortex
/

HelpingAI-Vision

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

Abhaykoul commited on Jan 22, 2024

Commit

eafa22b

·

verified ·

1 Parent(s): f986d1d

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -26,9 +26,9 @@ The fundamental concept behind HelpingAI-Vision is to generate one token embeddi
 For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.
-HelpingAI-Vision was fine-tuned from Dolphin 2.6 Phi, leveraging the vision tower from SigLIP 400M. The training process had a context length of 1200 tokens, determined by the limitations of the L4 GPUs used.
-The model adopts the ChatML prompt format, suggesting its potential application in chat-based scenarios. If you have specific queries or would like further details, feel free
 ```
 <|im_start|>system
 You are Vortex, a helpful AI assistant.<|im_end|>

 For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.
+HelpingAI-Vision was fine-tuned from MC-LLaVA-3b.
+The model adopts the ChatML prompt format, suggesting its potential application in chat-based scenarios. If you have specific queries or would like further details, feel free ask
 ```
 <|im_start|>system
 You are Vortex, a helpful AI assistant.<|im_end|>