arxiv:2412.14233

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Published on Dec 18

· Submitted by

syp115 on Dec 20

Upvote

Authors:

Yanpeng Sun ,

Yuxiang Zhao ,

Xiaofan Li ,

Abstract

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at https://github.com/syp2ysy/DCE.

View arXiv page View PDF Add to collection

Community

syp115

Paper author Paper submitter about 23 hours ago

•

edited about 20 hours ago

DCE leverage visual specialists to replicate various human visual capabilities, and subsequently employ large language models (LLMs) to simulate the human cognitive process. This combined approach enables us to generate high-quality image captions by closely mimicking the way humans perceive and interpret visual information.

librarian-bot

about 1 hour ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.14233 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.14233 in a Space README.md to link it from this page.