Spaces:
Runtime error
Runtime error
update readme
Browse files- README.md +27 -2
- scripts/train_vae.py +4 -3
README.md
CHANGED
@@ -15,7 +15,10 @@ license: gpl-3.0
|
|
15 |
|
16 |
---
|
17 |
|
18 |
-
**UPDATES**:
|
|
|
|
|
|
|
19 |
|
20 |
4/10/2022
|
21 |
It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
|
@@ -49,6 +52,7 @@ You can play around with some pretrained models on [Google Colab](https://colab.
|
|
49 |
```bash
|
50 |
pip install .
|
51 |
```
|
|
|
52 |
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
|
53 |
|
54 |
```bash
|
@@ -58,8 +62,8 @@ python scripts/audio_to_images.py \
|
|
58 |
--input_dir path-to-audio-files \
|
59 |
--output_dir path-to-output-data
|
60 |
```
|
61 |
-
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
|
62 |
|
|
|
63 |
```bash
|
64 |
python scripts/audio_to_images.py \
|
65 |
--resolution 256 \
|
@@ -67,6 +71,7 @@ python scripts/audio_to_images.py \
|
|
67 |
--output_dir data/audio-diffusion-256 \
|
68 |
--push_to_hub teticio/audio-diffusion-256
|
69 |
```
|
|
|
70 |
## Train model
|
71 |
#### Run training on local machine.
|
72 |
```bash
|
@@ -83,6 +88,7 @@ accelerate launch --config_file config/accelerate_local.yaml \
|
|
83 |
--lr_warmup_steps 500 \
|
84 |
--mixed_precision no
|
85 |
```
|
|
|
86 |
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
|
87 |
```bash
|
88 |
accelerate launch --config_file config/accelerate_local.yaml \
|
@@ -101,6 +107,7 @@ accelerate launch --config_file config/accelerate_local.yaml \
|
|
101 |
--hub_model_id audio-diffusion-256 \
|
102 |
--hub_token $(cat $HOME/.huggingface/token)
|
103 |
```
|
|
|
104 |
#### Run training on SageMaker.
|
105 |
```bash
|
106 |
accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
@@ -115,3 +122,21 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
115 |
--lr_warmup_steps 500 \
|
116 |
--mixed_precision no
|
117 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
---
|
17 |
|
18 |
+
**UPDATES**:
|
19 |
+
|
20 |
+
15/10/2022
|
21 |
+
Added latent audio diffusion (see below).
|
22 |
|
23 |
4/10/2022
|
24 |
It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
|
|
|
52 |
```bash
|
53 |
pip install .
|
54 |
```
|
55 |
+
|
56 |
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
|
57 |
|
58 |
```bash
|
|
|
62 |
--input_dir path-to-audio-files \
|
63 |
--output_dir path-to-output-data
|
64 |
```
|
|
|
65 |
|
66 |
+
#### Generate dataset of 256x256 Mel spectrograms and push to hub (you will need to be authenticated with `huggingface-cli login`).
|
67 |
```bash
|
68 |
python scripts/audio_to_images.py \
|
69 |
--resolution 256 \
|
|
|
71 |
--output_dir data/audio-diffusion-256 \
|
72 |
--push_to_hub teticio/audio-diffusion-256
|
73 |
```
|
74 |
+
|
75 |
## Train model
|
76 |
#### Run training on local machine.
|
77 |
```bash
|
|
|
88 |
--lr_warmup_steps 500 \
|
89 |
--mixed_precision no
|
90 |
```
|
91 |
+
|
92 |
#### Run training on local machine with `batch_size` of 2 and `gradient_accumulation_steps` 8 to compensate, so that 256x256 resolution model fits on commercial grade GPU and push to hub.
|
93 |
```bash
|
94 |
accelerate launch --config_file config/accelerate_local.yaml \
|
|
|
107 |
--hub_model_id audio-diffusion-256 \
|
108 |
--hub_token $(cat $HOME/.huggingface/token)
|
109 |
```
|
110 |
+
|
111 |
#### Run training on SageMaker.
|
112 |
```bash
|
113 |
accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
|
122 |
--lr_warmup_steps 500 \
|
123 |
--mixed_precision no
|
124 |
```
|
125 |
+
## Latent Audio Diffusion
|
126 |
+
Rather than denoising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train denoising diffusion models and run inference with them. Secondly, as the latent space is really a array (tensor) of guassian variables with a particular mean, decoded images are invariant to guassian noise. And thirdly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
|
127 |
+
|
128 |
+
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality, rather like its cousin `transformers` in the early days of development. In order to train a VAE (Variational Autoencoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format.
|
129 |
+
|
130 |
+
#### Train an autoencoder.
|
131 |
+
```bash
|
132 |
+
python scripts/train_vae.py \
|
133 |
+
--dataset_name teticio/audio-diffusion-256 \
|
134 |
+
--batch_size 2 \
|
135 |
+
--gradient_accumulation_steps 12
|
136 |
+
```
|
137 |
+
|
138 |
+
#### Train latent diffusion model.
|
139 |
+
```bash
|
140 |
+
accelerate launch ...
|
141 |
+
--vae models/autoencoder-kl
|
142 |
+
```
|
scripts/train_vae.py
CHANGED
@@ -4,7 +4,6 @@
|
|
4 |
|
5 |
# TODO
|
6 |
# grayscale
|
7 |
-
# update README
|
8 |
|
9 |
import os
|
10 |
import argparse
|
@@ -107,7 +106,7 @@ class ImageLogger(Callback):
|
|
107 |
|
108 |
class HFModelCheckpoint(ModelCheckpoint):
|
109 |
|
110 |
-
def __init__(self, ldm_config, hf_checkpoint
|
111 |
super().__init__(*args, **kwargs)
|
112 |
self.ldm_config = ldm_config
|
113 |
self.hf_checkpoint = hf_checkpoint
|
@@ -131,7 +130,9 @@ if __name__ == "__main__":
|
|
131 |
parser.add_argument("--ldm_checkpoint_dir",
|
132 |
type=str,
|
133 |
default="models/ldm-autoencoder-kl")
|
134 |
-
parser.add_argument("--hf_checkpoint_dir",
|
|
|
|
|
135 |
parser.add_argument("-r",
|
136 |
"--resume_from_checkpoint",
|
137 |
type=str,
|
|
|
4 |
|
5 |
# TODO
|
6 |
# grayscale
|
|
|
7 |
|
8 |
import os
|
9 |
import argparse
|
|
|
106 |
|
107 |
class HFModelCheckpoint(ModelCheckpoint):
|
108 |
|
109 |
+
def __init__(self, ldm_config, hf_checkpoint, *args, **kwargs):
|
110 |
super().__init__(*args, **kwargs)
|
111 |
self.ldm_config = ldm_config
|
112 |
self.hf_checkpoint = hf_checkpoint
|
|
|
130 |
parser.add_argument("--ldm_checkpoint_dir",
|
131 |
type=str,
|
132 |
default="models/ldm-autoencoder-kl")
|
133 |
+
parser.add_argument("--hf_checkpoint_dir",
|
134 |
+
type=str,
|
135 |
+
default="models/autoencoder-kl")
|
136 |
parser.add_argument("-r",
|
137 |
"--resume_from_checkpoint",
|
138 |
type=str,
|