OpenGVLab
/

InternVL-Chat-V1-1

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

czczup commited on Jan 27, 2024

Commit

b74c3eb

·

verified ·

1 Parent(s): 9d90772

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -36,7 +36,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
     - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - SFT Stage
     - Learnable Component: MLP + LLM
-    - Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M samples.
 ## Model Usage

     - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - SFT Stage
     - Learnable Component: MLP + LLM
+    - Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 6M samples.
 ## Model Usage