czczup commited on
Commit
0f3cf67
·
verified ·
1 Parent(s): 784af06

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -29
README.md CHANGED
@@ -18,26 +18,22 @@ pipeline_tag: visual-question-answering
18
 
19
  InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
20
 
21
- It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
22
-
23
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/4SynvLt2qH8JXFQVI_fmv.png)
24
-
25
  ## Model Details
26
- - **Model Type:** multimodal chatbot
27
  - **Model Stats:**
28
- - Architecture: InternViT-6B + MLP + LLaMA2-13B
29
  - Params: 19B
30
  - Image size: 448 x 448
31
  - Number of visual tokens: 256
32
 
33
  - **Training Strategy:**
34
  - Pretraining Stage
35
- - Learnable Component: InternViT-6B + MLP
36
- - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
37
- - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
38
- - SFT Stage
39
- - Learnable Component: MLP + LLM
40
- - Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 6M samples.
41
 
42
 
43
  ## Model Usage
@@ -113,23 +109,7 @@ This model can also conduct an in-depth analysis of AAAI's official website and
113
 
114
  ## Evaluation
115
 
116
- **MultiModal Benchmark**
117
-
118
- \* Training set observed.
119
-
120
- | MathVista<br>(testmini) | MMB<br>(dev/test) | MMB−CN<br>(dev/test) | MMMU<br>(val/test) | CMMMU<br>(val/test) | MMVP | MME | POPE | Tiny LVLM | SEEDv1<br>(image) | LLaVA Wild | MM−Vet |
121
- | ----------------------- | --------------------- | --------------------- | ---------------------- | --------------------- | ---- | ------------------------ | ---- | --------- | ----------------- | ---------- | ------ |
122
- | 34.5 | 76.7&nbsp;/&nbsp;75.4 | 71.9&nbsp;/&nbsp;70.3 | 39.1&nbsp;/&nbsp;35.3 | 34.8&nbsp;/&nbsp;34.0 | 44.7 | 1675.1&nbsp;/&nbsp;348.6 | 87.1 | 343.2 | 73.2 | 73.2 | 46.7 |
123
-
124
- **Image Captioning & Visual Question Answering**
125
-
126
- \* Training set observed.
127
-
128
- | COCO<br>(test) | Flickr30K<br>(test) | NoCaps<br>(val) | VQAv2<br>(testdev) | OKVQA<br>(val) | TextVQA<br>(val) | VizWiz<br>(val/test) | AI2D<br>(test) | GQA<br>(test) | ScienceQA<br>(image) |
129
- | -------------- | ------------------- | --------------- | ------------------ | -------------- | ---------------- | --------------------- | -------------- | ------------- | -------------------- |
130
- | 142.2\* | 85.3 | 120.8 | 80.9\* | 64.1\* | 65.9 | 59.0&nbsp;/&nbsp;57.3 | 72.2\* | 62.5\* | 90.1\* |
131
-
132
- - We found that incorrect images were used for training and testing in `AI2D`, meaning that for problems where `abcLabel` is True, `abc_images` were not utilized. We have now corrected the images used for testing, but the results may still be somewhat lower as a consequence.
133
 
134
  ## Citation
135
 
 
18
 
19
  InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
20
 
 
 
 
 
21
  ## Model Details
22
+ - **Model Type:** multimodal large language model (MLLM)
23
  - **Model Stats:**
24
+ - Architecture: [InternViT-6B-448px](https://huggingface.co/OpenGVLab/InternViT-6B-448px) + MLP + LLaMA2-13B (One of our internal SFT versions)
25
  - Params: 19B
26
  - Image size: 448 x 448
27
  - Number of visual tokens: 256
28
 
29
  - **Training Strategy:**
30
  - Pretraining Stage
31
+ - Learnable Component: InternViT-6B + LLaMA2-13B
32
+ - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
33
+ - Note: In this stage, we load the pretrained weights of the original [InternViT-6B-224px](https://huggingface.co/OpenGVLab/InternViT-6B-224px) and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle operation to reduce 1024 tokens to 256 tokens.
34
+ - Supervised Finetuning Stage
35
+ - Learnable Component: MLP + LLaMA2-13B
36
+ - Data: A comprehensive collection of open-source datasets, along with their Chinese translation versions, totaling approximately 6M samples.
37
 
38
 
39
  ## Model Usage
 
109
 
110
  ## Evaluation
111
 
112
+ See [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation) for detailed evaluation results.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  ## Citation
115