Update README.md
Browse files
README.md
CHANGED
@@ -25,7 +25,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
|
|
25 |
- **Model Type:** multimodal chatbot
|
26 |
- **Model Stats:**
|
27 |
- Architecture: InternViT-6B + MLP + LLaMA2-13B
|
28 |
-
- Params
|
29 |
- Image size: 448 x 448
|
30 |
- Number of visual tokens: 256
|
31 |
|
@@ -33,7 +33,7 @@ It is _**the largest open-source vision/vision-language foundation model (14B)**
|
|
33 |
- Pretraining Stage
|
34 |
- Learnable Component: InternViT-6B
|
35 |
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
36 |
-
- Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to
|
37 |
- SFT Stage
|
38 |
- Learnable Component: MLP + LLM
|
39 |
- Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M.
|
|
|
25 |
- **Model Type:** multimodal chatbot
|
26 |
- **Model Stats:**
|
27 |
- Architecture: InternViT-6B + MLP + LLaMA2-13B
|
28 |
+
- Params: 19B
|
29 |
- Image size: 448 x 448
|
30 |
- Number of visual tokens: 256
|
31 |
|
|
|
33 |
- Pretraining Stage
|
34 |
- Learnable Component: InternViT-6B
|
35 |
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
36 |
+
- Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
37 |
- SFT Stage
|
38 |
- Learnable Component: MLP + LLM
|
39 |
- Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 10M.
|