Spaces:
Running
on
A10G
Running
on
A10G
title: Vta Ldm | |
emoji: π§ | |
colorFrom: green | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 4.39.0 | |
app_file: app.py | |
pinned: false | |
short_description: Video-to-Audio Generation with Hidden Alignment | |
# Video-to-Audio Generation with Hidden Alignment | |
Manjie Xu, Chenxing Li, Yong Ren, Rilin Chen, Yu Gu, Wei Liang, Dong Yu | |
Tencent AI Lab | |
<a href='https://arxiv.org/abs/2407.07464'> | |
<img src='https://img.shields.io/badge/Paper-Arxiv-green?style=plastic&logo=arXiv&logoColor=green' alt='Paper Arxiv'> | |
</a> | |
<a href='https://sites.google.com/view/vta-ldm/home'> | |
<img src='https://img.shields.io/badge/Project-Page-blue?style=plastic&logo=Google%20chrome&logoColor=blue' alt='Project Page'> | |
</a> | |
Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. We aim to offer insights into the video-to-audio generation paradigm. | |
## Install | |
First install the python requirements. We recommend using conda: | |
``` | |
conda create -n vta-ldm python=3.10 | |
conda activate vta-ldm | |
pip install -r requirements.txt | |
``` | |
Then download the checkpoints from [huggingface](https://huggingface.co/ariesssxu/vta-ldm-clip4clip-v-large), we recommend using git lfs: | |
``` | |
mkdir ckpt && cd ckpt | |
git clone https://huggingface.co/ariesssxu/vta-ldm-clip4clip-v-large | |
# pull if large files are skipped: | |
cd vta-ldm-clip4clip-v-large && git lfs pull | |
``` | |
## Model List | |
- β VTA_LDM (the base model) | |
- π³οΈ VTA_LDM+IB/LB/CAVP/VIVIT | |
- π³οΈ VTA_LDM+text | |
- π³οΈ VTA_LDM+PE | |
- π³οΈ VTA_LDM+text+concat | |
- π³οΈ VTA_LDM+pretrain+text+concat | |
## Inference | |
Put the video pieces into the `data` directory. Run the provided inference script to generate audio content from the input videos: | |
``` | |
bash inference_from_video.sh | |
``` | |
You can custom the hyperparameters to fit your personal requirements. We also provide a script that can help merge the generated audio content with the original video based on ffmpeg: | |
``` | |
bash tools/merge_video_audio | |
``` | |
## Training | |
TBD. Code Coming Soon. | |
## Ack | |
This work is based on some of the great repos: | |
[diffusers](https://github.com/huggingface/diffusers) | |
[Tango](https://github.com/declare-lab/tango) | |
[Audioldm](https://github.com/haoheliu/AudioLDM) | |
## Cite us | |
``` | |
@misc{xu2024vta-ldm, | |
title={Video-to-Audio Generation with Hidden Alignment}, | |
author={Manjie Xu and Chenxing Li and Yong Ren and Rilin Chen and Yu Gu and Wei Liang and Dong Yu}, | |
year={2024}, | |
eprint={2407.07464}, | |
archivePrefix={arXiv}, | |
url={https://arxiv.org/abs/2407.07464}, | |
} | |
``` | |
## Disclaimer | |
This is not an official product by Tencent Ltd. |