Update README.md
Browse files
README.md
CHANGED
@@ -18,26 +18,22 @@ pipeline_tag: visual-question-answering
|
|
18 |
|
19 |
InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
|
20 |
|
21 |
-
It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
|
22 |
-
|
23 |
-
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/4SynvLt2qH8JXFQVI_fmv.png)
|
24 |
-
|
25 |
## Model Details
|
26 |
-
- **Model Type:** multimodal
|
27 |
- **Model Stats:**
|
28 |
-
- Architecture: InternViT-6B + MLP + LLaMA2-13B
|
29 |
- Params: 19B
|
30 |
- Image size: 448 x 448
|
31 |
- Number of visual tokens: 256
|
32 |
|
33 |
- **Training Strategy:**
|
34 |
- Pretraining Stage
|
35 |
-
- Learnable Component: InternViT-6B +
|
36 |
-
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR
|
37 |
-
- Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
38 |
-
-
|
39 |
-
- Learnable Component: MLP +
|
40 |
-
- Data: A comprehensive collection of open-source
|
41 |
|
42 |
|
43 |
## Model Usage
|
@@ -113,23 +109,7 @@ This model can also conduct an in-depth analysis of AAAI's official website and
|
|
113 |
|
114 |
## Evaluation
|
115 |
|
116 |
-
|
117 |
-
|
118 |
-
\* Training set observed.
|
119 |
-
|
120 |
-
| MathVista<br>(testmini) | MMB<br>(dev/test) | MMB−CN<br>(dev/test) | MMMU<br>(val/test) | CMMMU<br>(val/test) | MMVP | MME | POPE | Tiny LVLM | SEEDv1<br>(image) | LLaVA Wild | MM−Vet |
|
121 |
-
| ----------------------- | --------------------- | --------------------- | ---------------------- | --------------------- | ---- | ------------------------ | ---- | --------- | ----------------- | ---------- | ------ |
|
122 |
-
| 34.5 | 76.7 / 75.4 | 71.9 / 70.3 | 39.1 / 35.3 | 34.8 / 34.0 | 44.7 | 1675.1 / 348.6 | 87.1 | 343.2 | 73.2 | 73.2 | 46.7 |
|
123 |
-
|
124 |
-
**Image Captioning & Visual Question Answering**
|
125 |
-
|
126 |
-
\* Training set observed.
|
127 |
-
|
128 |
-
| COCO<br>(test) | Flickr30K<br>(test) | NoCaps<br>(val) | VQAv2<br>(testdev) | OKVQA<br>(val) | TextVQA<br>(val) | VizWiz<br>(val/test) | AI2D<br>(test) | GQA<br>(test) | ScienceQA<br>(image) |
|
129 |
-
| -------------- | ------------------- | --------------- | ------------------ | -------------- | ---------------- | --------------------- | -------------- | ------------- | -------------------- |
|
130 |
-
| 142.2\* | 85.3 | 120.8 | 80.9\* | 64.1\* | 65.9 | 59.0 / 57.3 | 72.2\* | 62.5\* | 90.1\* |
|
131 |
-
|
132 |
-
- We found that incorrect images were used for training and testing in `AI2D`, meaning that for problems where `abcLabel` is True, `abc_images` were not utilized. We have now corrected the images used for testing, but the results may still be somewhat lower as a consequence.
|
133 |
|
134 |
## Citation
|
135 |
|
|
|
18 |
|
19 |
InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
|
20 |
|
|
|
|
|
|
|
|
|
21 |
## Model Details
|
22 |
+
- **Model Type:** multimodal large language model (MLLM)
|
23 |
- **Model Stats:**
|
24 |
+
- Architecture: [InternViT-6B-448px](https://huggingface.co/OpenGVLab/InternViT-6B-448px) + MLP + LLaMA2-13B (One of our internal SFT versions)
|
25 |
- Params: 19B
|
26 |
- Image size: 448 x 448
|
27 |
- Number of visual tokens: 256
|
28 |
|
29 |
- **Training Strategy:**
|
30 |
- Pretraining Stage
|
31 |
+
- Learnable Component: InternViT-6B + LLaMA2-13B
|
32 |
+
- Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
|
33 |
+
- Note: In this stage, we load the pretrained weights of the original [InternViT-6B-224px](https://huggingface.co/OpenGVLab/InternViT-6B-224px) and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle operation to reduce 1024 tokens to 256 tokens.
|
34 |
+
- Supervised Finetuning Stage
|
35 |
+
- Learnable Component: MLP + LLaMA2-13B
|
36 |
+
- Data: A comprehensive collection of open-source datasets, along with their Chinese translation versions, totaling approximately 6M samples.
|
37 |
|
38 |
|
39 |
## Model Usage
|
|
|
109 |
|
110 |
## Evaluation
|
111 |
|
112 |
+
See [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation) for detailed evaluation results.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
## Citation
|
115 |
|