MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
Abstract
Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
Community
MEMO is a state-of-the-art open-weight model for audio-driven talking video generation.
Project Page: https://memoavatar.github.io
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion (2024)
- LetsTalk: Latent Diffusion Transformer for Talking Video Synthesis (2024)
- Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis (2024)
- SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model (2024)
- FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait (2024)
- Takin-ADA: Emotion Controllable Audio-Driven Animation with Canonical and Landmark Loss Optimization (2024)
- Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper