Spaces:
Sleeping
Sleeping
Delete README.md
Browse files
README.md
DELETED
@@ -1,127 +0,0 @@
|
|
1 |
-
# AudioLCM: Text-to-Audio Generation with Latent Consistency Models
|
2 |
-
|
3 |
-
#### Huadai Liu, Rongjie Huang, Yang Liu, Hengyuan Cao, Jialei Wang, Xize Cheng, Siqi Zheng, Zhou Zhao
|
4 |
-
|
5 |
-
PyTorch Implementation of [AudioLCM]: a efficient and high-quality text-to-audio generation with latent consistency model.
|
6 |
-
|
7 |
-
<!-- [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2301.12661)
|
8 |
-
[![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-blue)](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio)
|
9 |
-
[![GitHub Stars](https://img.shields.io/github/stars/Text-to-Audio/Make-An-Audio?style=social)](https://github.com/Text-to-Audio/Make-An-Audio) -->
|
10 |
-
|
11 |
-
We provide our implementation and pretrained models as open source in this repository.
|
12 |
-
|
13 |
-
Visit our [demo page](https://audiolcm.github.io/) for audio samples.
|
14 |
-
|
15 |
-
<!-- [Text-to-Audio HuggingFace Space](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio) | [Audio Inpainting HuggingFace Space](https://huggingface.co/spaces/AIGC-Audio/Make_An_Audio_inpaint) -->
|
16 |
-
|
17 |
-
## News
|
18 |
-
<!-- - Jan, 2023: **[Make-An-Audio](https://arxiv.org/abs/2207.06389)** submitted to arxiv. -->
|
19 |
-
- June, 2024: **[AudioLCM]** released in Github.
|
20 |
-
|
21 |
-
## Quick Started
|
22 |
-
We provide an example of how you can generate high-fidelity samples quickly using AudioLCM.
|
23 |
-
|
24 |
-
To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
|
25 |
-
|
26 |
-
|
27 |
-
### Support Datasets and Pretrained Models
|
28 |
-
|
29 |
-
Simply run following command to download the weights from [Google drive](https://drive.google.com/drive/folders/1zZTI3-nHrUIywKFqwxlFO6PjB66JA8jI?usp=drive_link).
|
30 |
-
Download bert-base-uncased weights from [Hugging Face](https://huggingface.co/google-bert/bert-base-uncased). Down load t5-v1_1-large weights from [Hugging Face](https://huggingface.co/google/t5-v1_1-large). Download CLAP weights from [Hugging Face](https://huggingface.co/microsoft/msclap/blob/main/CLAP_weights_2022.pth).
|
31 |
-
|
32 |
-
```
|
33 |
-
Download:
|
34 |
-
audiolcm.ckpt and put it into ./ckpts
|
35 |
-
BigVGAN vocoder and put it into ./vocoder/logs/bigvnat16k93.5w
|
36 |
-
t5-v1_1-large and put it into ./ldm/modules/encoders/CLAP
|
37 |
-
bert-base-uncased and put it into ./ldm/modules/encoders/CLAP
|
38 |
-
CLAP_weights_2022.pth and put it into ./wav_evaluation/useful_ckpts/CLAP
|
39 |
-
```
|
40 |
-
<!-- The directory structure should be:
|
41 |
-
```
|
42 |
-
useful_ckpts/
|
43 |
-
βββ bigvgan
|
44 |
-
β βββ args.yml
|
45 |
-
β βββ best_netG.pt
|
46 |
-
βββ CLAP
|
47 |
-
β βββ config.yml
|
48 |
-
β βββ CLAP_weights_2022.pth
|
49 |
-
βββ maa1_full.ckpt
|
50 |
-
``` -->
|
51 |
-
|
52 |
-
|
53 |
-
### Dependencies
|
54 |
-
See requirements in `requirement.txt`:
|
55 |
-
|
56 |
-
## Inference with pretrained model
|
57 |
-
```bash
|
58 |
-
python scripts/txt2audio_for_lcm.py --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps -r ckpt/audiolcm.ckpt
|
59 |
-
```
|
60 |
-
# Train
|
61 |
-
## dataset preparation
|
62 |
-
We can't provide the dataset download link for copyright issues. We provide the process code to generate melspec.
|
63 |
-
Before training, we need to construct the dataset information into a tsv file, which includes name (id for each audio), dataset (which dataset the audio belongs to), audio_path (the path of .wav file),caption (the caption of the audio) ,mel_path (the processed melspec file path of each audio). We provide a tsv file of audiocaps test set: ./audiocaps_test_16000_struct.tsv as a sample.
|
64 |
-
### generate the melspec file of audio
|
65 |
-
Assume you have already got a tsv file to link each caption to its audio_path, which mean the tsv_file have "name","audio_path","dataset" and "caption" columns in it.
|
66 |
-
To get the melspec of audio, run the following command, which will save mels in ./processed
|
67 |
-
```bash
|
68 |
-
python ldm/data/preprocess/mel_spec.py --tsv_path tmp.tsv
|
69 |
-
```
|
70 |
-
Add the duration into the tsv file
|
71 |
-
```bash
|
72 |
-
python ldm/data/preprocess/add_duration.py
|
73 |
-
```
|
74 |
-
## Train variational autoencoder
|
75 |
-
Assume we have processed several datasets, and save the .tsv files in data/*.tsv . Replace **data.params.spec_dir_path** with the **data**(the directory that contain tsvs) in the config file. Then we can train VAE with the following command. If you don't have 8 gpus in your machine, you can replace --gpus 0,1,...,gpu_nums
|
76 |
-
```bash
|
77 |
-
python main.py --base configs/train/vae.yaml -t --gpus 0,1,2,3,4,5,6,7
|
78 |
-
```
|
79 |
-
The training result will be save in ./logs/
|
80 |
-
## train latent diffsuion
|
81 |
-
After Trainning VAE, replace model.params.first_stage_config.params.ckpt_path with your trained VAE checkpoint path in the config file.
|
82 |
-
Run the following command to train Diffusion model
|
83 |
-
```bash
|
84 |
-
python main.py --base configs/autoencoder1d.yaml -t --gpus 0,1,2,3,4,5,6,7
|
85 |
-
```
|
86 |
-
The training result will be save in ./logs/
|
87 |
-
# Evaluation
|
88 |
-
## generate audiocaps samples
|
89 |
-
```bash
|
90 |
-
python scripts/txt2audio_for_lcm.py --ddim_steps 2 -b configs/audiolcm.yaml --sample_rate 16000 --vocoder-ckpt vocoder/logs/bigvnat16k93.5w --outdir results --test-dataset audiocaps -r ckpt/audiolcm.ckpt
|
91 |
-
```
|
92 |
-
|
93 |
-
## calculate FD,FAD,IS,KL
|
94 |
-
install [audioldm_eval](https://github.com/haoheliu/audioldm_eval) by
|
95 |
-
```bash
|
96 |
-
git clone git@github.com:haoheliu/audioldm_eval.git
|
97 |
-
```
|
98 |
-
Then test with:
|
99 |
-
```bash
|
100 |
-
python scripts/test.py --pred_wavsdir {the directory that saves the audios you generated} --gt_wavsdir {the directory that saves audiocaps test set waves}
|
101 |
-
```
|
102 |
-
## calculate Clap_score
|
103 |
-
```bash
|
104 |
-
python wav_evaluation/cal_clap_score.py --tsv_path {the directory that saves the audios you generated}/result.tsv
|
105 |
-
```
|
106 |
-
|
107 |
-
|
108 |
-
## Acknowledgements
|
109 |
-
This implementation uses parts of the code from the following Github repos:
|
110 |
-
[Make-An-Audio](https://github.com/Text-to-Audio/Make-An-Audio)
|
111 |
-
[CLAP](https://github.com/LAION-AI/CLAP),
|
112 |
-
[Stable Diffusion](https://github.com/CompVis/stable-diffusion),
|
113 |
-
as described in our code.
|
114 |
-
|
115 |
-
<!-- ## Citations ##
|
116 |
-
If you find this code useful in your research, please consider citing:
|
117 |
-
```bibtex
|
118 |
-
@article{huang2023make,
|
119 |
-
title={Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models},
|
120 |
-
author={Huang, Rongjie and Huang, Jiawei and Yang, Dongchao and Ren, Yi and Liu, Luping and Li, Mingze and Ye, Zhenhui and Liu, Jinglin and Yin, Xiang and Zhao, Zhou},
|
121 |
-
journal={arXiv preprint arXiv:2301.12661},
|
122 |
-
year={2023}
|
123 |
-
}
|
124 |
-
``` -->
|
125 |
-
|
126 |
-
# Disclaimer ##
|
127 |
-
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|