arxiv:2408.16500

CogVLM2: Visual Language Models for Image and Video Understanding

Published on Aug 29

· Submitted by

akhaliq on Aug 30

#2 Paper of the day

Upvote

Authors:

Wenyi Hong ,

Wenmeng Yu ,

Qingsong Lv ,

Yean Cheng ,

Shiyu Huang ,

Zhuoyi Yang ,

Guanyu Feng ,

Xixuan Song ,

Abstract

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to 1344 times 1344 pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.

View arXiv page View PDF Add to collection