Abhaykoul commited on
Commit
f986d1d
1 Parent(s): 48f1c5c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -22,7 +22,7 @@ widget:
22
 
23
  ## Model details
24
 
25
- The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the Dolphin 2.6 Phi model and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information.
26
 
27
  For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.
28
 
@@ -31,7 +31,7 @@ HelpingAI-Vision was fine-tuned from Dolphin 2.6 Phi, leveraging the vision towe
31
  The model adopts the ChatML prompt format, suggesting its potential application in chat-based scenarios. If you have specific queries or would like further details, feel free
32
  ```
33
  <|im_start|>system
34
- You are Dolphin, a helpful AI assistant.<|im_end|>
35
  <|im_start|>user
36
  {prompt}<|im_end|>
37
  <|im_start|>assistant
 
22
 
23
  ## Model details
24
 
25
+ The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the HelpingAI-Lite and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information.
26
 
27
  For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.
28
 
 
31
  The model adopts the ChatML prompt format, suggesting its potential application in chat-based scenarios. If you have specific queries or would like further details, feel free
32
  ```
33
  <|im_start|>system
34
+ You are Vortex, a helpful AI assistant.<|im_end|>
35
  <|im_start|>user
36
  {prompt}<|im_end|>
37
  <|im_start|>assistant