--- license: apache-2.0 datasets: - Loie/VGGSound base_model: - riffusion/riffusion-model-v1 pipeline_tag: video-to-audio tags: - video2audio ---

Kandinsky-4-v2a: A Video to Audio pipeline





Kandinsky 4.0 Post | Project Page | Technical Report | GitHub | Kandinsky 4.0 T2V Flash HuggingFace | Kandinsky 4.0 V2A HuggingFace
## Description Video to Audio pipeline consists of a visual encoder, a text encoder, UNet diffusion model to generate spectrogram and Griffin-lim algorithm to convert spectrogram into audio. Visual and text encoders share the same multimodal visual language decoder ([cogvlm2-video-llama3-chat](link)). Our UNet diffusion model is a finetune of the music generation model [riffusion](https://huggingface.co/riffusion/riffusion-model-v1). We made modifications in the architecture to condition on video frames and improve the synchronization between video and audio. Also, we replace the text encoder with the decoder of [cogvlm2-video-llama3-chat](link). ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/mLXroYZt8X2brCDGPcPJZ.png) ## Installation ```bash git clone https://github.com/ai-forever/Kandinsky-4.git cd Kandinsky-4 conda install -c conda-forge ffmpeg -y pip install -r kandinsky4_video2audio/requirements.txt pip install "git+https://github.com/facebookresearch/pytorchvideo.git" ``` ## Inference Inference code for Video-to-Audio: ```python import torch import torchvision from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline from kandinsky4_video2audio.utils import load_video, create_video device='cuda:0' pipe = Video2AudioPipeline( "ai-forever/kandinsky-4-v2a", torch_dtype=torch.float16, device = device ) video_path = 'assets/inputs/1.mp4' video, _, fps = torchvision.io.read_video(video_path) prompt="clean. clear. good quality." negative_prompt = "hissing noise. drumming rythm. saying. poor quality." video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12) out = pipe( video_input, prompt, negative_prompt=negative_prompt, duration_sec=duration_sec, )[0] save_path = f'assets/outputs/1.mp4' create_video( out, video_complete, display_video=True, save_path=save_path, device=device ) ```
# Authors + Zein Shaheen: [GitHub](https://github.com/zeinsh) + Arseniy Shakhmatov: [Github](https://github.com/cene555), [Blog](https://t.me/gradientdip) + Ivan Kirillov: [GitHub](https://github.com/funnylittleman) + Andrei Shutkin: [GitHub](https://github.com/maleficxp) + Denis Parkhomenko: [GitHub](https://github.com/nihao88) + Julia Agafonova [GitHub](https://github.com/Julia132) + Andrey Kuznetsov: [GitHub](https://github.com/kuznetsoffandrey), [Blog](https://t.me/complete_ai) + Denis Dimitrov: [GitHub](https://github.com/denndimitrov), [Blog](https://t.me/dendi_math_ai)