OpenGVLab
/

InternVL-Chat-V1-1

@@ -18,26 +18,22 @@ pipeline_tag: visual-question-answering
 InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
-It is _**the largest open-source vision/vision-language foundation model (14B)**_ to date, achieving _**32 state-of-the-art**_ performances on a wide range of tasks such as visual perception, cross-modal retrieval, multimodal dialogue, etc.
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/4SynvLt2qH8JXFQVI_fmv.png)
 ## Model Details
-- **Model Type:** multimodal chatbot
 - **Model Stats:**
-  - Architecture: InternViT-6B + MLP + LLaMA2-13B
   - Params: 19B
   - Image size: 448 x 448
   - Number of visual tokens: 256
 - **Training Strategy:**
   - Pretraining Stage
-    - Learnable Component: InternViT-6B + MLP
-    - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
-    - Note: In this stage, we load the pretrained weights of InternViT-6B-224px and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
-  - SFT Stage
-    - Learnable Component: MLP + LLM
-    - Data: A comprehensive collection of open-source SFT datasets, along with their Chinese translation versions, totaling approximately 6M samples.
 ## Model Usage
@@ -113,23 +109,7 @@ This model can also conduct an in-depth analysis of AAAI's official website and
 ## Evaluation
-**MultiModal Benchmark**
-\* Training set observed.
-| MathVista<br>(testmini) | MMB<br>(dev/test)     | MMB−CN<br>(dev/test)  | MMMU<br>(val/test)     | CMMMU<br>(val/test)   | MMVP | MME                      | POPE | Tiny LVLM | SEEDv1<br>(image) | LLaVA Wild | MM−Vet |
-| ----------------------- | --------------------- | --------------------- | ---------------------- | --------------------- | ---- | ------------------------ | ---- | --------- | ----------------- | ---------- | ------ |
-| 34.5                    | 76.7&nbsp;/&nbsp;75.4 | 71.9&nbsp;/&nbsp;70.3 | 39.1&nbsp;/&nbsp;35.3  | 34.8&nbsp;/&nbsp;34.0 | 44.7 | 1675.1&nbsp;/&nbsp;348.6 | 87.1 | 343.2     | 73.2              | 73.2       | 46.7   |
-**Image Captioning & Visual Question Answering**
-\* Training set observed.
-| COCO<br>(test) | Flickr30K<br>(test) | NoCaps<br>(val) | VQAv2<br>(testdev) | OKVQA<br>(val) | TextVQA<br>(val) | VizWiz<br>(val/test)  | AI2D<br>(test) | GQA<br>(test) | ScienceQA<br>(image) |
-| -------------- | ------------------- | --------------- | ------------------ | -------------- | ---------------- | --------------------- | -------------- | ------------- | -------------------- |
-| 142.2\*        | 85.3                | 120.8           | 80.9\*             | 64.1\*         | 65.9             | 59.0&nbsp;/&nbsp;57.3 | 72.2\*         | 62.5\*        | 90.1\*               |
-- We found that incorrect images were used for training and testing in `AI2D`, meaning that for problems where `abcLabel` is True, `abc_images` were not utilized. We have now corrected the images used for testing, but the results may still be somewhat lower as a consequence.
 ## Citation

 InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
 ## Model Details
+- **Model Type:** multimodal large language model (MLLM)
 - **Model Stats:**
+  - Architecture: [InternViT-6B-448px](https://huggingface.co/OpenGVLab/InternViT-6B-448px) + MLP + LLaMA2-13B (One of our internal SFT versions)
   - Params: 19B
   - Image size: 448 x 448
   - Number of visual tokens: 256
 - **Training Strategy:**
   - Pretraining Stage
+    - Learnable Component: InternViT-6B + LLaMA2-13B
+    - Data: Trained on 72M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
+    - Note: In this stage, we load the pretrained weights of the original [InternViT-6B-224px](https://huggingface.co/OpenGVLab/InternViT-6B-224px) and interpolate its position embedding to the size corresponding to 448 x 448 pixels. Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle operation to reduce 1024 tokens to 256 tokens.
+  - Supervised Finetuning Stage
+    - Learnable Component: MLP + LLaMA2-13B
+    - Data: A comprehensive collection of open-source datasets, along with their Chinese translation versions, totaling approximately 6M samples.
 ## Model Usage
 ## Evaluation
+See [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#-evaluation) for detailed evaluation results.
 ## Citation