VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Abstract
Video Highlight Detection and Moment Retrieval (HD/MR) are essential in video analysis. Recent joint prediction transformer models often overlook their cross-task dynamics and video-text alignment and refinement. Moreover, most models typically use limited, uni-directional attention mechanisms, resulting in weakly integrated representations and suboptimal performance in capturing the interdependence between video and text modalities. Although large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains, their application in this field remains relatively underexplored. Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules with an alignment loss for better video-text feature alignment, (ii) Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware clip representations, and (iii) Uni-directional joint-task feedback mechanism enhancing both tasks through correlation. In addition, (iv) we introduce hard positive/negative losses for adaptive error penalization and improved learning, and (v) leverage LVLMs like BLIP-2 for enhanced multimodal feature integration and intelligent pretraining using synthetic data generated from LVLMs. Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance. Codes and models are available at https://github.com/dpaul06/VideoLights .
Community
This paper examines the efficacy of feature refinement through multimodal matching and the effectiveness of intelligent pretraining dataset generation using MVLM. It also explores pretraining techniques for moment retrieval and highlight detection in videos based on natural language queries.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding (2024)
- CLIP-SCGI: Synthesized Caption-Guided Inversion for Person Re-Identification (2024)
- Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding (2024)
- Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering (2024)
- VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos (2024)
- Multi-Modal interpretable automatic video captioning (2024)
- VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper