MyVLM: Personalizing VLMs for User-Specific Queries
Abstract
Recent large-scale vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and generating textual descriptions for visual content. However, these models lack an understanding of user-specific concepts. In this work, we take a first step toward the personalization of VLMs, enabling them to learn and reason over user-provided concepts. For example, we explore whether these models can learn to recognize you in an image and communicate what you are doing, tailoring the model to reflect your personal experiences and relationships. To effectively recognize a variety of user-specific concepts, we augment the VLM with external concept heads that function as toggles for the model, enabling the VLM to identify the presence of specific target concepts in a given image. Having recognized the concept, we learn a new concept embedding in the intermediate feature space of the VLM. This embedding is tasked with guiding the language model to naturally integrate the target concept in its generated response. We apply our technique to BLIP-2 and LLaVA for personalized image captioning and further show its applicability for personalized visual question-answering. Our experiments demonstrate our ability to generalize to unseen images of learned concepts while preserving the model behavior on unrelated inputs.
Community
Interesting paper!
I have a question for model architecture choice:
Can Q-Former structure boost your methods? Or, other structures without Q-Former but only MLP connector cannot benefit from your insight?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- RegionGPT: Towards Region Understanding Vision Language Model (2024)
- Question Aware Vision Transformer for Multimodal Reasoning (2024)
- FlexCap: Generating Rich, Localized, and Flexible Captions in Images (2024)
- Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception (2024)
- The (R)Evolution of Multimodal Large Language Models: A Survey (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper