czczup commited on
Commit
ccc6078
Β·
verified Β·
1 Parent(s): 072535f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -1
README.md CHANGED
@@ -16,7 +16,11 @@ pipeline_tag: visual-question-answering
16
 
17
  [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#model-usage) [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[πŸ“– 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
18
 
19
- We released InternVL-Chat-V1-1, featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. In this version, we explored increasing the resolution to 448x448, enhancing OCR capabilities, and improving support for Chinese conversations.
 
 
 
 
20
 
21
  ## Model Details
22
  - **Model Type:** multimodal large language model (MLLM)
 
16
 
17
  [\[πŸ€— HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL) [\[πŸš€ Quick Start\]](#model-usage) [\[🌐 Community-hosted API\]](https://rapidapi.com/adushar1320/api/internvl-chat) [\[πŸ“– 中文解读\]](https://zhuanlan.zhihu.com/p/675877376)
18
 
19
+ We released InternVL-Chat-V1-1, featuring a structure similar to LLaVA, including a ViT, an MLP projector, and an LLM. As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
20
+
21
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 50%;">
22
+
23
+ In this version, we explored increasing the resolution to 448 Γ— 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 Γ— 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle operation to reduce the 1024 tokens to 256 tokens.
24
 
25
  ## Model Details
26
  - **Model Type:** multimodal large language model (MLLM)