# CogVLM2 Movie Caption LoRA

👋 Wechat · 💡Online Demo · 🎈Github Page · 📑 Paper

📍Experience the larger-scale CogVLM model on the ZhipuAI Open Platform.

## Model introduction We launch a new generation of **CogVLM2** series of models and open source two models built with [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements: 1. Significant improvements in many benchmarks such as `TextVQA`, `DocVQA`. 2. Support **8K** content length. 3. Support image resolution up to **1344 * 1344**. 4. Provide an open source model version that supports both **Chinese and English**. You can see the details of the CogVLM2 family of open source models in the table below: | Model name | cogvlm2-llama3-chat-19B | cogvlm2-llama3-chinese-chat-19B | |------------------|-------------------------------------|-------------------------------------| | Base Model | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct | | Language | English | Chinese, English | | Model size | 19B | 19B | | Task | Image understanding, dialogue model | Image understanding, dialogue model | | Text length | 8K | 8K | | Image resolution | 1344 * 1344 | 1344 * 1344 | ## Benchmark Our open source models have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models, as shown in the table below: | Model | Open Source | LLM Size | TextVQA | DocVQA | ChartQA | OCRbench | VCR_EASY | VCR_HARD | MMMU | MMVet | MMBench | |----------------------------|-------------|----------|----------|----------|----------|----------|-------------|-------------|----------|----------|----------| | CogVLM1.1 | ✅ | 7B | 69.7 | - | 68.3 | 590 | 73.9 | 34.6 | 37.3 | 52.0 | 65.8 | | LLaVA-1.5 | ✅ | 13B | 61.3 | - | - | 337 | - | - | 37.0 | 35.4 | 67.7 | | Mini-Gemini | ✅ | 34B | 74.1 | - | - | - | - | - | 48.0 | 59.3 | 80.6 | | LLaVA-NeXT-LLaMA3 | ✅ | 8B | - | 78.2 | 69.5 | - | - | - | 41.7 | - | 72.1 | | LLaVA-NeXT-110B | ✅ | 110B | - | 85.7 | 79.7 | - | - | - | 49.1 | - | 80.5 | | InternVL-1.5 | ✅ | 20B | 80.6 | 90.9 | **83.8** | 720 | 14.7 | 2.0 | 46.8 | 55.4 | **82.3** | | QwenVL-Plus | ❌ | - | 78.9 | 91.4 | 78.1 | 726 | - | - | 51.4 | 55.7 | 67.0 | | Claude3-Opus | ❌ | - | - | 89.3 | 80.8 | 694 | 63.85 | 37.8 | **59.4** | 51.7 | 63.3 | | Gemini Pro 1.5 | ❌ | - | 73.5 | 86.5 | 81.3 | - | 62.73 | 28.1 | 58.5 | - | - | | GPT-4V | ❌ | - | 78.0 | 88.4 | 78.5 | 656 | 52.04 | 25.8 | 56.8 | **67.7** | 75.0 | | **CogVLM2-LLaMA3** | ✅ | 8B | 84.2 | **92.3** | 81.0 | 756 | **83.3** | **38.0** | 44.3 | 60.4 | 80.5 | | **CogVLM2-LLaMA3-Chinese** | ✅ | 8B | **85.0** | 88.4 | 74.7 | **780** | 79.9 | 25.1 | 42.8 | 60.5 | 78.9 | All reviews were obtained without using any external OCR tools ("pixel only").