arxiv:2008.10010

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

Published on Aug 23, 2020

Authors:

Abstract

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model and evaluation benchmarks on our website: cvit.iiit.ac.in/research/projects/cvit-projects/a-lip-sync-expert-is-all-you-need-for-speech-to-lip-generation-in-the-wild. The code and models are released at this GitHub repository: github.com/Rudrabha/Wav2Lip. You can also try out the interactive demo at this link: bhaasha.iiit.ac.in/lipsync.

View arXiv page View PDF Add to collection

Community

TheProjectsGuy

Nov 25, 2023

•

edited Nov 25, 2023

Proposes Wav2Lip: morphs lip movements (talking faces) of arbitrary identities in dynamic settings (audio controlled visemes/visual lip movements from phonemes/audio waveform) by learning a lip-sync discriminator (generated lipsync almost as good as real); blending the generated face in the target video; proposes ReSyncED dataset for benchmarking lip-sync. Constrained talking face generation are for specific faces (like ObamaNet); unconstrained task: given speech and face image, generate lip-synced version that matches audio. L1 reconstruction loss and (conditioned) discriminator loss of LipGAN are not adequate, lip-sync discriminator needed. Lip-sync expert is a modified SyncNet model: take lower half video and audio, get in-sync and out-of-sync pairs (through common time-stamps), minimize sync and maximize out-of-sync margin; use color images, use a deeper models with residual connections, use cosine-similarity with BCE loss (dot between video and speech embeddings gives logit probability); train on LRS2 dataset. Uses LipGAN generator: identity encoder, speech encoder, and face decoder; identity encoder takes random reference segment with pose prior from target face (with lower half masked), concatenated along channel axis; speech encoder is 2D convolutions over melspectrogram representation of audio; concat/stack speech and identity (also containing pose) and give to decoder (conv-transpose and conv - up-conv with residuals from identity). Time (temporal) samples are stacked along batch for generator and along channel for discriminator (which takes lower half of face only); probability from discriminator gives sync error for generator. Also train another discriminator for visual quality (avoid blurry regions). Generator error minimizes sync loss, generator loss, and reconstruction loss. Trained (only generator and quality discriminator, the lipsync discriminator is frozen) on LRS2 dataset. During inference, use the same frame instead of a random reference segment (for identity), no changes to target audio. SSIM, PSNR, LMD are not specific to lip-sync; proposed Lip-sync error (LSE) distance (between lip and audio representations), and LSE-confidence (average confidence score) are better. The extra discriminator (for quality) has slight drop in lip-sync, but improvement in quality. ReSyncED for evaluation (YouTube videos): audio dubbed, random pairs, and TTS generated (Google translate for text, and DeepVoice TTS). Highest preference rating for Wav2Lip (over LipGAN and Speech2Vod). Larger temporal window is better for discriminator. From IIITH (CVIT).