OpenGVLab
/

InternVL-Chat-V1-1

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

czczup commited on Jan 26, 2024

Commit

f564018

•

1 Parent(s): da92483

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -25,7 +25,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
 - **Model Type:** multimodal chatbot
 - **Model Stats:**
   - Architecture: InternViT-6B + MLP + LLaMA2-13B
-  - Params (M): 19B
   - Image size: 448 x 448
   - Number of visual tokens: 256
@@ -33,7 +33,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
   - Pretraining Stage
     - Learnable Component: InternViT-6B
     - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
-    - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448x448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - SFT Stage
     - Learnable Component: MLP + LLM
     - Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M.

 - **Model Type:** multimodal chatbot
 - **Model Stats:**
   - Architecture: InternViT-6B + MLP + LLaMA2-13B
+  - Params: 19B
   - Image size: 448 x 448
   - Number of visual tokens: 256
   - Pretraining Stage
     - Learnable Component: InternViT-6B
     - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
+    - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - SFT Stage
     - Learnable Component: MLP + LLM
     - Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M.