OpenGVLab
/

InternVL-Chat-V1-2

@@ -10,7 +10,7 @@ datasets:
 pipeline_tag: visual-question-answering
 ---
-# Model Card for InternVL-Chat-V1.2
 <p align="center">
 <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k0tma4PhPFrwJvpS_gVQf.webp" alt="Image Description" width="300" height="300">
 </p>
@@ -19,7 +19,7 @@ pipeline_tag: visual-question-answering
 [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)  [\[🚀 Quick Start\]](#model-usage)  [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat)  [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
-We are excited to introduce InternVL-Chat-V1.2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
 <p align="center">
 <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
@@ -50,10 +50,10 @@ For better training reproducibility, we follow the minimalist design and data ef
 | Model                                                      | Vision Foundation Model                                                     | Release Date           |Note                                |
 | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
-| InternVL-Chat-V1.5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))            | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5))    |2024.04.18       |          support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
-| InternVL-Chat-V1.2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))    |2024.02.21     |        more SFT data and stronger  |
-| InternVL-Chat-V1.2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )           |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))     |2024.02.11       |             scaling up LLM to 34B       |
-| InternVL-Chat-V1.1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))            |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0))    |2024.01.24         |   support Chinese and stronger OCR   |
@@ -70,9 +70,9 @@ For better training reproducibility, we follow the minimalist design and data ef
 | Qwen−VL−Max\*      | unknown    | 51.4          | 46.8           | 51.0                    | 77.6          | 75.7             | -    | -        | -                    | -    | 79.5             | -                 | -                | -             |
 |                    |            |               |                |                         |               |                  |      |          |                      |      |                  |                   |                  |               |
 | LLaVA−NEXT−34B     | 672x672    | 51.1          | 44.7           | 46.5                    | 79.3          | 79.0             | -    | 1631/397 | 81.8                 | 87.7 | 69.5             | 75.9              | 63.8             | 67.1          |
-| InternVL−Chat−V1.2 | 448x448    | 51.6          | 46.2           | 47.7                    | 82.2          | 81.2             | 56.7 | 1687/489 | 83.3                 | 88.0 | 72.5             | 75.6              | 60.0             | 64.0          |
-- In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
 - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
@@ -80,7 +80,7 @@ For better training reproducibility, we follow the minimalist design and data ef
 ### Data Preparation
-Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1.2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
 For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
@@ -95,14 +95,14 @@ The hyperparameters used for finetuning are listed in the following table.
 | Hyperparameter     | Trainable Param  | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
 | ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
-| InternVL−Chat−V1.2 | 40B (full model) | 512               | 1e-5          | 1      | 2048       | 0.05         |
 ## Model Usage
-We provide an example code to run InternVL-Chat-V1.2 using `transformers`.
 You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.

 pipeline_tag: visual-question-answering
 ---
+# Model Card for InternVL-Chat-V1-2
 <p align="center">
 <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/k0tma4PhPFrwJvpS_gVQf.webp" alt="Image Description" width="300" height="300">
 </p>
 [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)  [\[🚀 Quick Start\]](#model-usage)  [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat)  [\[📖 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
+We are excited to introduce InternVL-Chat-V1-2. Inspired by [LLaVA-NeXT-34B](https://llava-vl.github.io/blog/2024-01-30-llava-next/), we have also adopted [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B) as the language model. Below is the pipeline.
 <p align="center">
 <img width="600" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/GIEKCvNc1Y5iMQqLv645p.png">
 | Model                                                      | Vision Foundation Model                                                     | Release Date           |Note                                |
 | :---------------------------------------------------------:|:--------------------------------------------------------------------------: |:----------------------:| :---------------------------------- |
+| InternVL-Chat-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5))            | InternViT-6B-448px-V1-5(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-5))    |2024.04.18       |          support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
+| InternVL-Chat-V1-2-Plus(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2-Plus) ) |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))    |2024.02.21     |        more SFT data and stronger  |
+| InternVL-Chat-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-2) )           |InternViT-6B-448px-V1-2(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2))     |2024.02.11       |             scaling up LLM to 34B       |
+| InternVL-Chat-V1-1(🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-1))            |InternViT-6B-448px-V1-0(🤗 [HF link](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0))    |2024.01.24         |   support Chinese and stronger OCR   |
 | Qwen−VL−Max\*      | unknown    | 51.4          | 46.8           | 51.0                    | 77.6          | 75.7             | -    | -        | -                    | -    | 79.5             | -                 | -                | -             |
 |                    |            |               |                |                         |               |                  |      |          |                      |      |                  |                   |                  |               |
 | LLaVA−NEXT−34B     | 672x672    | 51.1          | 44.7           | 46.5                    | 79.3          | 79.0             | -    | 1631/397 | 81.8                 | 87.7 | 69.5             | 75.9              | 63.8             | 67.1          |
+| InternVL−Chat−V1-2 | 448x448    | 51.6          | 46.2           | 47.7                    | 82.2          | 81.2             | 56.7 | 1687/489 | 83.3                 | 88.0 | 72.5             | 75.6              | 60.0             | 64.0          |
+- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
 - Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
 ### Data Preparation
+Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
 For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
 | Hyperparameter     | Trainable Param  | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
 | ------------------ | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
+| InternVL−Chat−V1-2 | 40B (full model) | 512               | 1e-5          | 1      | 2048       | 0.05         |
 ## Model Usage
+We provide an example code to run InternVL-Chat-V1-2 using `transformers`.
 You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.