---
license: apache-2.0
datasets:
- Loie/VGGSound
base_model:
- riffusion/riffusion-model-v1
pipeline_tag: video-to-audio
tags:
- video2audio
---
Kandinsky-4-v2a: A Video to Audio pipeline
## Description
Video to Audio pipeline consists of a visual encoder, a text encoder, UNet diffusion model to generate spectrogram and Griffin-lim algorithm to convert spectrogram into audio.
Visual and text encoders share the same multimodal visual language decoder ([cogvlm2-video-llama3-chat](link)).
Our UNet diffusion model is a finetune of the music generation model [riffusion](https://huggingface.co/riffusion/riffusion-model-v1). We made modifications in the architecture to condition on video frames and improve the synchronization between video and audio. Also, we replace the text encoder with the decoder of [cogvlm2-video-llama3-chat](link).
![image/png](https://cdn-uploads.huggingface.co/production/uploads/5f91b1208a61a359f44e1851/mLXroYZt8X2brCDGPcPJZ.png)
## Installation
```bash
git clone https://github.com/ai-forever/Kandinsky-4.git
cd Kandinsky-4
conda install -c conda-forge ffmpeg -y
pip install -r kandinsky4_video2audio/requirements.txt
pip install "git+https://github.com/facebookresearch/pytorchvideo.git"
```
## Inference
Inference code for Video-to-Audio:
```python
import torch
import torchvision
from kandinsky4_video2audio.video2audio_pipe import Video2AudioPipeline
from kandinsky4_video2audio.utils import load_video, create_video
device='cuda:0'
pipe = Video2AudioPipeline(
"ai-forever/kandinsky-4-v2a",
torch_dtype=torch.float16,
device = device
)
video_path = 'assets/inputs/1.mp4'
video, _, fps = torchvision.io.read_video(video_path)
prompt="clean. clear. good quality."
negative_prompt = "hissing noise. drumming rythm. saying. poor quality."
video_input, video_complete, duration_sec = load_video(video, fps['video_fps'], num_frames=96, max_duration_sec=12)
out = pipe(
video_input,
prompt,
negative_prompt=negative_prompt,
duration_sec=duration_sec,
)[0]
save_path = f'assets/outputs/1.mp4'
create_video(
out,
video_complete,
display_video=True,
save_path=save_path,
device=device
)
```
# Authors
+ Zein Shaheen: [GitHub](https://github.com/zeinsh)
+ Arseniy Shakhmatov: [Github](https://github.com/cene555), [Blog](https://t.me/gradientdip)
+ Ivan Kirillov: [GitHub](https://github.com/funnylittleman)
+ Andrei Shutkin: [GitHub](https://github.com/maleficxp)
+ Denis Parkhomenko: [GitHub](https://github.com/nihao88)
+ Julia Agafonova [GitHub](https://github.com/Julia132)
+ Andrey Kuznetsov: [GitHub](https://github.com/kuznetsoffandrey), [Blog](https://t.me/complete_ai)
+ Denis Dimitrov: [GitHub](https://github.com/denndimitrov), [Blog](https://t.me/dendi_math_ai)