CogVLM2 Movie Caption LoRA

👋 Wechat · 💡Online Demo · 🎈Github Page · 📑 Paper

📍Experience the larger-scale CogVLM model on the ZhipuAI Open Platform.

Model introduction

We launch a new generation of CogVLM2 series of models and open source two models built with Meta-Llama-3-8B-Instruct. Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:

Significant improvements in many benchmarks such as TextVQA, DocVQA.
Support 8K content length.
Support image resolution up to 1344 * 1344.
Provide an open source model version that supports both Chinese and English.

You can see the details of the CogVLM2 family of open source models in the table below:

Model name	cogvlm2-llama3-chat-19B	cogvlm2-llama3-chinese-chat-19B
Base Model	Meta-Llama-3-8B-Instruct	Meta-Llama-3-8B-Instruct
Language	English	Chinese, English
Model size	19B	19B
Task	Image understanding, dialogue model	Image understanding, dialogue model
Text length	8K	8K
Image resolution	1344 * 1344	1344 * 1344

Benchmark

Our open source models have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models, as shown in the table below:

Model	Open Source	LLM Size	TextVQA	DocVQA	ChartQA	OCRbench	VCR_EASY	VCR_HARD	MMMU	MMVet	MMBench
CogVLM1.1	✅	7B	69.7	-	68.3	590	73.9	34.6	37.3	52.0	65.8
LLaVA-1.5	✅	13B	61.3	-	-	337	-	-	37.0	35.4	67.7
Mini-Gemini	✅	34B	74.1	-	-	-	-	-	48.0	59.3	80.6
LLaVA-NeXT-LLaMA3	✅	8B	-	78.2	69.5	-	-	-	41.7	-	72.1
LLaVA-NeXT-110B	✅	110B	-	85.7	79.7	-	-	-	49.1	-	80.5
InternVL-1.5	✅	20B	80.6	90.9	83.8	720	14.7	2.0	46.8	55.4	82.3
QwenVL-Plus	❌	-	78.9	91.4	78.1	726	-	-	51.4	55.7	67.0
Claude3-Opus	❌	-	-	89.3	80.8	694	63.85	37.8	59.4	51.7	63.3
Gemini Pro 1.5	❌	-	73.5	86.5	81.3	-	62.73	28.1	58.5	-	-
GPT-4V	❌	-	78.0	88.4	78.5	656	52.04	25.8	56.8	67.7	75.0
CogVLM2-LLaMA3	✅	8B	84.2	92.3	81.0	756	83.3	38.0	44.3	60.4	80.5
CogVLM2-LLaMA3-Chinese	✅	8B	85.0	88.4	74.7	780	79.9	25.1	42.8	60.5	78.9

All reviews were obtained without using any external OCR tools ("pixel only").