Abhaykoul commited on
Commit
bfdfb84
1 Parent(s): e6f3920

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -10
README.md CHANGED
@@ -24,19 +24,13 @@ widget:
24
 
25
  ## Model details
26
 
27
- The core idea behind multi-crop LLaVA (MC-LLaVA) is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
28
- Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
29
 
30
- For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which
31
- gives the token embedding of size [N, 2560]. Right now, the tokens do not contain explicit information about their position in the original image. I plan to add it later.
32
 
33
- MC-LLaVA-3b was fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) using vision tower from
34
- [SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384).
35
-
36
- The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
37
-
38
- As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
39
 
 
40
  ```
41
  <|im_start|>system
42
  You are Dolphin, a helpful AI assistant.<|im_end|>
 
24
 
25
  ## Model details
26
 
27
+ The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the Dolphin 2.6 Phi model and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information.
 
28
 
29
+ For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.
 
30
 
31
+ HelpingAI-Vision was fine-tuned from Dolphin 2.6 Phi, leveraging the vision tower from SigLIP 400M. The training process had a context length of 1200 tokens, determined by the limitations of the L4 GPUs used.
 
 
 
 
 
32
 
33
+ The model adopts the ChatML prompt format, suggesting its potential application in chat-based scenarios. If you have specific queries or would like further details, feel free
34
  ```
35
  <|im_start|>system
36
  You are Dolphin, a helpful AI assistant.<|im_end|>