4M: Massively Multimodal Masked Modeling
Generate realistic talking heads from image+audio
High quality image generation in 3 second