Spaces:
Runtime error
Runtime error
Robert Smith
commited on
Commit
•
43ebb3b
1
Parent(s):
f34a81b
tidy
Browse files- .gitignore +2 -0
- README.md +19 -12
- audiodiffusion/__init__.py +33 -24
- notebooks/test_model.ipynb +164 -17
.gitignore
CHANGED
@@ -9,3 +9,5 @@ audiodiffusion.egg-info
|
|
9 |
lightning_logs
|
10 |
taming
|
11 |
checkpoints
|
|
|
|
|
|
9 |
lightning_logs
|
10 |
taming
|
11 |
checkpoints
|
12 |
+
Pipfile
|
13 |
+
Pipfile.lock
|
README.md
CHANGED
@@ -11,20 +11,19 @@ license: gpl-3.0
|
|
11 |
---
|
12 |
# audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
|
13 |
|
14 |
-
### Apply
|
15 |
|
16 |
---
|
17 |
|
18 |
**UPDATES**:
|
19 |
|
20 |
-
|
21 |
-
Added latent audio diffusion (see below). Also added the possibility to train a model to use DDIM ([Denoising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf)) by setting `--scheduler ddim`. These have the benefit that samples can be generated with much fewer steps (~50) than used in training.
|
22 |
|
23 |
-
|
24 |
-
It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
|
25 |
|
26 |
-
|
27 |
-
|
|
|
28 |
|
29 |
---
|
30 |
|
@@ -32,11 +31,13 @@ You can now generate an audio based on a previous one. You can use this to gener
|
|
32 |
|
33 |
---
|
34 |
|
|
|
|
|
35 |
Audio can be represented as images by transforming to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), such as the one shown above. The class `Mel` in `mel.py` can convert a slice of audio into a mel spectrogram of `x_res` x `y_res` and vice versa. The higher the resolution, the less audio information will be lost. You can see how this works in the [`test_mel.ipynb`](https://github.com/teticio/audio-diffusion/blob/main/notebooks/test_mel.ipynb) notebook.
|
36 |
|
37 |
-
A DDPM
|
38 |
|
39 |
-
You can play around with some
|
40 |
|
41 |
|
42 |
| Model | Dataset | Description |
|
@@ -54,7 +55,6 @@ pip install .
|
|
54 |
```
|
55 |
|
56 |
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
|
57 |
-
|
58 |
```bash
|
59 |
python scripts/audio_to_images.py \
|
60 |
--resolution 64,64 \
|
@@ -119,10 +119,17 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
|
|
119 |
--lr_warmup_steps 500 \
|
120 |
--mixed_precision no
|
121 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
## Latent Audio Diffusion
|
123 |
-
Rather than
|
124 |
|
125 |
-
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality
|
126 |
|
127 |
#### Install dependencies to train with Stable Diffusion
|
128 |
```
|
|
|
11 |
---
|
12 |
# audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
|
13 |
|
14 |
+
### Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package.
|
15 |
|
16 |
---
|
17 |
|
18 |
**UPDATES**:
|
19 |
|
20 |
+
**22/10/2022**. Added DDIM encoder and ability to interpolate between audios in latent "noise" space. Mel spectrograms no longer have to be square (thanks to Tristan for this one), so you can set the vertical (frequency) and horizontal (time) resolutions independently.
|
|
|
21 |
|
22 |
+
**15/10/2022**. Added latent audio diffusion (see below). Also added the possibility to train a DDIM ([Denoising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf)). These have the benefit that samples can be generated with much fewer steps (~50) than used in training.
|
|
|
23 |
|
24 |
+
**4/10/2022**. It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
|
25 |
+
|
26 |
+
**27/9/2022**. You can now generate an audio based on a previous one. You can use this to generate variations of the same audio or even to "remix" a track (via a sort of "style transfer"). You can find examples of how to do this in the [`test_model.ipynb`](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) notebook.
|
27 |
|
28 |
---
|
29 |
|
|
|
31 |
|
32 |
---
|
33 |
|
34 |
+
## DDPM ([De-noising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239))
|
35 |
+
|
36 |
Audio can be represented as images by transforming to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), such as the one shown above. The class `Mel` in `mel.py` can convert a slice of audio into a mel spectrogram of `x_res` x `y_res` and vice versa. The higher the resolution, the less audio information will be lost. You can see how this works in the [`test_mel.ipynb`](https://github.com/teticio/audio-diffusion/blob/main/notebooks/test_mel.ipynb) notebook.
|
37 |
|
38 |
+
A DDPM is trained on a set of mel spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar mel spectrograms, which are then converted back into audio.
|
39 |
|
40 |
+
You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
|
41 |
|
42 |
|
43 |
| Model | Dataset | Description |
|
|
|
55 |
```
|
56 |
|
57 |
#### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
|
|
|
58 |
```bash
|
59 |
python scripts/audio_to_images.py \
|
60 |
--resolution 64,64 \
|
|
|
119 |
--lr_warmup_steps 500 \
|
120 |
--mixed_precision no
|
121 |
```
|
122 |
+
## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
|
123 |
+
#### A DDIM can be trained by adding the parameter
|
124 |
+
```bash
|
125 |
+
--scheduler ddim
|
126 |
+
```
|
127 |
+
Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
|
128 |
+
|
129 |
## Latent Audio Diffusion
|
130 |
+
Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
|
131 |
|
132 |
+
At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
|
133 |
|
134 |
#### Install dependencies to train with Stable Diffusion
|
135 |
```
|
audiodiffusion/__init__.py
CHANGED
@@ -6,12 +6,12 @@ import numpy as np
|
|
6 |
from PIL import Image
|
7 |
from tqdm.auto import tqdm
|
8 |
from librosa.beat import beat_track
|
9 |
-
from diffusers import (DiffusionPipeline,
|
10 |
-
|
11 |
|
12 |
from .mel import Mel
|
13 |
|
14 |
-
VERSION = "1.2.
|
15 |
|
16 |
|
17 |
class AudioDiffusion:
|
@@ -24,7 +24,7 @@ class AudioDiffusion:
|
|
24 |
top_db: int = 80,
|
25 |
cuda: bool = torch.cuda.is_available(),
|
26 |
progress_bar: Iterable = tqdm):
|
27 |
-
"""Class for generating audio using
|
28 |
|
29 |
Args:
|
30 |
model_id (String): name of model (local directory or Hugging Face Hub)
|
@@ -60,18 +60,21 @@ class AudioDiffusion:
|
|
60 |
top_db=top_db)
|
61 |
|
62 |
def generate_spectrogram_and_audio(
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
|
|
|
|
68 |
"""Generate random mel spectrogram and convert to audio.
|
69 |
|
70 |
Args:
|
71 |
-
steps (int): number of de-noising steps
|
72 |
generator (torch.Generator): random number generator or None
|
73 |
-
step_generator (torch.Generator): random number generator used to
|
74 |
eta (float): parameter between 0 and 1 used with DDIM scheduler
|
|
|
75 |
|
76 |
Returns:
|
77 |
PIL Image: mel spectrogram
|
@@ -83,7 +86,8 @@ class AudioDiffusion:
|
|
83 |
steps=steps,
|
84 |
generator=generator,
|
85 |
step_generator=step_generator,
|
86 |
-
eta=eta
|
|
|
87 |
return images[0], (sample_rate, audios[0])
|
88 |
|
89 |
def generate_spectrogram_and_audio_from_audio(
|
@@ -92,7 +96,7 @@ class AudioDiffusion:
|
|
92 |
raw_audio: np.ndarray = None,
|
93 |
slice: int = 0,
|
94 |
start_step: int = 0,
|
95 |
-
steps: int =
|
96 |
generator: torch.Generator = None,
|
97 |
mask_start_secs: float = 0,
|
98 |
mask_end_secs: float = 0,
|
@@ -107,11 +111,11 @@ class AudioDiffusion:
|
|
107 |
raw_audio (np.ndarray): audio as numpy array
|
108 |
slice (int): slice number of audio to convert
|
109 |
start_step (int): step to start from
|
110 |
-
steps (int): number of de-noising steps
|
111 |
generator (torch.Generator): random number generator or None
|
112 |
mask_start_secs (float): number of seconds of audio to mask (not generate) at start
|
113 |
mask_end_secs (float): number of seconds of audio to mask (not generate) at end
|
114 |
-
step_generator (torch.Generator): random number generator used to
|
115 |
eta (float): parameter between 0 and 1 used with DDIM scheduler
|
116 |
noise (torch.Tensor): noisy image or None
|
117 |
|
@@ -173,7 +177,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
|
|
173 |
raw_audio: np.ndarray = None,
|
174 |
slice: int = 0,
|
175 |
start_step: int = 0,
|
176 |
-
steps: int =
|
177 |
generator: torch.Generator = None,
|
178 |
mask_start_secs: float = 0,
|
179 |
mask_end_secs: float = 0,
|
@@ -190,23 +194,24 @@ class AudioDiffusionPipeline(DiffusionPipeline):
|
|
190 |
raw_audio (np.ndarray): audio as numpy array
|
191 |
slice (int): slice number of audio to convert
|
192 |
start_step (int): step to start from
|
193 |
-
steps (int): number of de-noising steps
|
194 |
generator (torch.Generator): random number generator or None
|
195 |
mask_start_secs (float): number of seconds of audio to mask (not generate) at start
|
196 |
mask_end_secs (float): number of seconds of audio to mask (not generate) at end
|
197 |
-
step_generator (torch.Generator): random number generator used to
|
198 |
eta (float): parameter between 0 and 1 used with DDIM scheduler
|
199 |
-
noise (torch.Tensor):
|
200 |
|
201 |
Returns:
|
202 |
List[PIL Image]: mel spectrograms
|
203 |
(float, List[np.ndarray]): sample rate and raw audios
|
204 |
"""
|
205 |
|
|
|
|
|
206 |
self.scheduler.set_timesteps(steps)
|
207 |
step_generator = step_generator or generator
|
208 |
-
|
209 |
-
# For backwards compatiibility
|
210 |
if type(self.unet.sample_size) == int:
|
211 |
self.unet.sample_size = (self.unet.sample_size,
|
212 |
self.unet.sample_size)
|
@@ -215,6 +220,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
|
|
215 |
(batch_size, self.unet.in_channels) + self.unet.sample_size,
|
216 |
generator=generator)
|
217 |
images = noise
|
|
|
218 |
|
219 |
if audio_file is not None or raw_audio is not None:
|
220 |
mel.load_audio(audio_file, raw_audio)
|
@@ -289,11 +295,12 @@ class AudioDiffusionPipeline(DiffusionPipeline):
|
|
289 |
return images, (mel.get_sample_rate(), audios)
|
290 |
|
291 |
@torch.no_grad()
|
292 |
-
def encode(self, images: List[Image.Image]) -> np.ndarray:
|
293 |
"""Reverse step process: recover noisy image from generated image.
|
294 |
|
295 |
Args:
|
296 |
images (List[PIL Image]): list of images to encode
|
|
|
297 |
|
298 |
Returns:
|
299 |
np.ndarray: noise tensor of shape (batch_size, 1, height, width)
|
@@ -301,6 +308,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
|
|
301 |
|
302 |
# Only works with DDIM as this method is deterministic
|
303 |
assert isinstance(self.scheduler, DDIMScheduler)
|
|
|
304 |
sample = np.array([
|
305 |
np.frombuffer(image.tobytes(), dtype="uint8").reshape(
|
306 |
(1, image.height, image.width)) for image in images
|
@@ -308,7 +316,8 @@ class AudioDiffusionPipeline(DiffusionPipeline):
|
|
308 |
sample = ((sample / 255) * 2 - 1)
|
309 |
sample = torch.Tensor(sample).to(self.device)
|
310 |
|
311 |
-
for t in torch.flip(self.scheduler.timesteps,
|
|
|
312 |
prev_timestep = (t - self.scheduler.num_train_timesteps //
|
313 |
self.scheduler.num_inference_steps)
|
314 |
alpha_prod_t = self.scheduler.alphas_cumprod[t]
|
@@ -334,7 +343,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
|
|
334 |
Args:
|
335 |
x0 (torch.Tensor): first tensor to interpolate between
|
336 |
x1 (torch.Tensor): seconds tensor to interpolate between
|
337 |
-
alpha (float): interpolation
|
338 |
|
339 |
Returns:
|
340 |
torch.Tensor: interpolated tensor
|
|
|
6 |
from PIL import Image
|
7 |
from tqdm.auto import tqdm
|
8 |
from librosa.beat import beat_track
|
9 |
+
from diffusers import (DiffusionPipeline, UNet2DConditionModel, DDIMScheduler,
|
10 |
+
DDPMScheduler, AutoencoderKL)
|
11 |
|
12 |
from .mel import Mel
|
13 |
|
14 |
+
VERSION = "1.2.3"
|
15 |
|
16 |
|
17 |
class AudioDiffusion:
|
|
|
24 |
top_db: int = 80,
|
25 |
cuda: bool = torch.cuda.is_available(),
|
26 |
progress_bar: Iterable = tqdm):
|
27 |
+
"""Class for generating audio using De-noising Diffusion Probabilistic Models.
|
28 |
|
29 |
Args:
|
30 |
model_id (String): name of model (local directory or Hugging Face Hub)
|
|
|
60 |
top_db=top_db)
|
61 |
|
62 |
def generate_spectrogram_and_audio(
|
63 |
+
self,
|
64 |
+
steps: int = None,
|
65 |
+
generator: torch.Generator = None,
|
66 |
+
step_generator: torch.Generator = None,
|
67 |
+
eta: float = 0,
|
68 |
+
noise: torch.Tensor = None
|
69 |
+
) -> Tuple[Image.Image, Tuple[int, np.ndarray]]:
|
70 |
"""Generate random mel spectrogram and convert to audio.
|
71 |
|
72 |
Args:
|
73 |
+
steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
|
74 |
generator (torch.Generator): random number generator or None
|
75 |
+
step_generator (torch.Generator): random number generator used to de-noise or None
|
76 |
eta (float): parameter between 0 and 1 used with DDIM scheduler
|
77 |
+
noise (torch.Tensor): noisy image or None
|
78 |
|
79 |
Returns:
|
80 |
PIL Image: mel spectrogram
|
|
|
86 |
steps=steps,
|
87 |
generator=generator,
|
88 |
step_generator=step_generator,
|
89 |
+
eta=eta,
|
90 |
+
noise=noise)
|
91 |
return images[0], (sample_rate, audios[0])
|
92 |
|
93 |
def generate_spectrogram_and_audio_from_audio(
|
|
|
96 |
raw_audio: np.ndarray = None,
|
97 |
slice: int = 0,
|
98 |
start_step: int = 0,
|
99 |
+
steps: int = None,
|
100 |
generator: torch.Generator = None,
|
101 |
mask_start_secs: float = 0,
|
102 |
mask_end_secs: float = 0,
|
|
|
111 |
raw_audio (np.ndarray): audio as numpy array
|
112 |
slice (int): slice number of audio to convert
|
113 |
start_step (int): step to start from
|
114 |
+
steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
|
115 |
generator (torch.Generator): random number generator or None
|
116 |
mask_start_secs (float): number of seconds of audio to mask (not generate) at start
|
117 |
mask_end_secs (float): number of seconds of audio to mask (not generate) at end
|
118 |
+
step_generator (torch.Generator): random number generator used to de-noise or None
|
119 |
eta (float): parameter between 0 and 1 used with DDIM scheduler
|
120 |
noise (torch.Tensor): noisy image or None
|
121 |
|
|
|
177 |
raw_audio: np.ndarray = None,
|
178 |
slice: int = 0,
|
179 |
start_step: int = 0,
|
180 |
+
steps: int = None,
|
181 |
generator: torch.Generator = None,
|
182 |
mask_start_secs: float = 0,
|
183 |
mask_end_secs: float = 0,
|
|
|
194 |
raw_audio (np.ndarray): audio as numpy array
|
195 |
slice (int): slice number of audio to convert
|
196 |
start_step (int): step to start from
|
197 |
+
steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
|
198 |
generator (torch.Generator): random number generator or None
|
199 |
mask_start_secs (float): number of seconds of audio to mask (not generate) at start
|
200 |
mask_end_secs (float): number of seconds of audio to mask (not generate) at end
|
201 |
+
step_generator (torch.Generator): random number generator used to de-noise or None
|
202 |
eta (float): parameter between 0 and 1 used with DDIM scheduler
|
203 |
+
noise (torch.Tensor): noise tensor of shape (batch_size, 1, height, width) or None
|
204 |
|
205 |
Returns:
|
206 |
List[PIL Image]: mel spectrograms
|
207 |
(float, List[np.ndarray]): sample rate and raw audios
|
208 |
"""
|
209 |
|
210 |
+
steps = steps or 50 if isinstance(self.scheduler,
|
211 |
+
DDIMScheduler) else 1000
|
212 |
self.scheduler.set_timesteps(steps)
|
213 |
step_generator = step_generator or generator
|
214 |
+
# For backwards compatibility
|
|
|
215 |
if type(self.unet.sample_size) == int:
|
216 |
self.unet.sample_size = (self.unet.sample_size,
|
217 |
self.unet.sample_size)
|
|
|
220 |
(batch_size, self.unet.in_channels) + self.unet.sample_size,
|
221 |
generator=generator)
|
222 |
images = noise
|
223 |
+
mask = None
|
224 |
|
225 |
if audio_file is not None or raw_audio is not None:
|
226 |
mel.load_audio(audio_file, raw_audio)
|
|
|
295 |
return images, (mel.get_sample_rate(), audios)
|
296 |
|
297 |
@torch.no_grad()
|
298 |
+
def encode(self, images: List[Image.Image], steps: int = 50) -> np.ndarray:
|
299 |
"""Reverse step process: recover noisy image from generated image.
|
300 |
|
301 |
Args:
|
302 |
images (List[PIL Image]): list of images to encode
|
303 |
+
steps (int): number of encoding steps to perform (defaults to 50)
|
304 |
|
305 |
Returns:
|
306 |
np.ndarray: noise tensor of shape (batch_size, 1, height, width)
|
|
|
308 |
|
309 |
# Only works with DDIM as this method is deterministic
|
310 |
assert isinstance(self.scheduler, DDIMScheduler)
|
311 |
+
self.scheduler.set_timesteps(steps)
|
312 |
sample = np.array([
|
313 |
np.frombuffer(image.tobytes(), dtype="uint8").reshape(
|
314 |
(1, image.height, image.width)) for image in images
|
|
|
316 |
sample = ((sample / 255) * 2 - 1)
|
317 |
sample = torch.Tensor(sample).to(self.device)
|
318 |
|
319 |
+
for t in self.progress_bar(torch.flip(self.scheduler.timesteps,
|
320 |
+
(0, ))):
|
321 |
prev_timestep = (t - self.scheduler.num_train_timesteps //
|
322 |
self.scheduler.num_inference_steps)
|
323 |
alpha_prod_t = self.scheduler.alphas_cumprod[t]
|
|
|
343 |
Args:
|
344 |
x0 (torch.Tensor): first tensor to interpolate between
|
345 |
x1 (torch.Tensor): seconds tensor to interpolate between
|
346 |
+
alpha (float): interpolation between 0 and 1
|
347 |
|
348 |
Returns:
|
349 |
torch.Tensor: interpolated tensor
|
notebooks/test_model.ipynb
CHANGED
@@ -53,6 +53,25 @@
|
|
53 |
"from audiodiffusion import AudioDiffusion"
|
54 |
]
|
55 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
56 |
{
|
57 |
"cell_type": "markdown",
|
58 |
"id": "7fd945bb",
|
@@ -74,8 +93,6 @@
|
|
74 |
"\n",
|
75 |
"#@markdown teticio/audio-diffusion-instrumental-hiphop-256 - trained on instrumental hiphop\n",
|
76 |
"\n",
|
77 |
-
"#@markdown teticio/audio-diffusion-ddim-256 - DDIM model trained on my Spotify \"liked\" playlist\n",
|
78 |
-
"\n",
|
79 |
"model_id = \"teticio/audio-diffusion-256\" #@param [\"teticio/audio-diffusion-256\", \"teticio/audio-diffusion-breaks-256\", \"audio-diffusion-instrumenal-hiphop-256\", \"teticio/audio-diffusion-ddim-256\"]"
|
80 |
]
|
81 |
},
|
@@ -86,9 +103,7 @@
|
|
86 |
"metadata": {},
|
87 |
"outputs": [],
|
88 |
"source": [
|
89 |
-
"audio_diffusion = AudioDiffusion(model_id=model_id)
|
90 |
-
"mel = Mel(x_res=256, y_res=256)\n",
|
91 |
-
"generator = torch.Generator()"
|
92 |
]
|
93 |
},
|
94 |
{
|
@@ -299,17 +314,90 @@
|
|
299 |
" audio2) = audio_diffusion.generate_spectrogram_and_audio_from_audio(\n",
|
300 |
" raw_audio=mel.get_audio_slice(slice),\n",
|
301 |
" mask_start_secs=1,\n",
|
302 |
-
" mask_end_secs=1
|
|
|
303 |
"display(Audio(audio, rate=sample_rate))\n",
|
304 |
"display(Audio(audio2, rate=sample_rate))"
|
305 |
]
|
306 |
},
|
307 |
{
|
308 |
"cell_type": "markdown",
|
309 |
-
"id": "
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
310 |
"metadata": {},
|
311 |
"source": [
|
312 |
-
"###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
313 |
]
|
314 |
},
|
315 |
{
|
@@ -319,35 +407,94 @@
|
|
319 |
"metadata": {},
|
320 |
"outputs": [],
|
321 |
"source": [
|
322 |
-
"
|
|
|
323 |
]
|
324 |
},
|
325 |
{
|
326 |
"cell_type": "code",
|
327 |
"execution_count": null,
|
328 |
-
"id": "
|
329 |
"metadata": {},
|
330 |
"outputs": [],
|
331 |
"source": [
|
332 |
-
"image =
|
333 |
-
"image"
|
334 |
]
|
335 |
},
|
336 |
{
|
337 |
"cell_type": "code",
|
338 |
"execution_count": null,
|
339 |
-
"id": "
|
340 |
"metadata": {},
|
341 |
"outputs": [],
|
342 |
"source": [
|
343 |
-
"
|
344 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
345 |
]
|
346 |
},
|
347 |
{
|
348 |
"cell_type": "code",
|
349 |
"execution_count": null,
|
350 |
-
"id": "
|
351 |
"metadata": {},
|
352 |
"outputs": [],
|
353 |
"source": []
|
@@ -374,7 +521,7 @@
|
|
374 |
"name": "python",
|
375 |
"nbconvert_exporter": "python",
|
376 |
"pygments_lexer": "ipython3",
|
377 |
-
"version": "3.
|
378 |
},
|
379 |
"toc": {
|
380 |
"base_numbering": 1,
|
|
|
53 |
"from audiodiffusion import AudioDiffusion"
|
54 |
]
|
55 |
},
|
56 |
+
{
|
57 |
+
"cell_type": "code",
|
58 |
+
"execution_count": null,
|
59 |
+
"id": "b294a94a",
|
60 |
+
"metadata": {},
|
61 |
+
"outputs": [],
|
62 |
+
"source": [
|
63 |
+
"mel = Mel(x_res=256, y_res=256)\n",
|
64 |
+
"generator = torch.Generator()"
|
65 |
+
]
|
66 |
+
},
|
67 |
+
{
|
68 |
+
"cell_type": "markdown",
|
69 |
+
"id": "f3feb265",
|
70 |
+
"metadata": {},
|
71 |
+
"source": [
|
72 |
+
"## DDPM (Denoising Diffusion Probabilistic Models)"
|
73 |
+
]
|
74 |
+
},
|
75 |
{
|
76 |
"cell_type": "markdown",
|
77 |
"id": "7fd945bb",
|
|
|
93 |
"\n",
|
94 |
"#@markdown teticio/audio-diffusion-instrumental-hiphop-256 - trained on instrumental hiphop\n",
|
95 |
"\n",
|
|
|
|
|
96 |
"model_id = \"teticio/audio-diffusion-256\" #@param [\"teticio/audio-diffusion-256\", \"teticio/audio-diffusion-breaks-256\", \"audio-diffusion-instrumenal-hiphop-256\", \"teticio/audio-diffusion-ddim-256\"]"
|
97 |
]
|
98 |
},
|
|
|
103 |
"metadata": {},
|
104 |
"outputs": [],
|
105 |
"source": [
|
106 |
+
"audio_diffusion = AudioDiffusion(model_id=model_id)"
|
|
|
|
|
107 |
]
|
108 |
},
|
109 |
{
|
|
|
314 |
" audio2) = audio_diffusion.generate_spectrogram_and_audio_from_audio(\n",
|
315 |
" raw_audio=mel.get_audio_slice(slice),\n",
|
316 |
" mask_start_secs=1,\n",
|
317 |
+
" mask_end_secs=1,\n",
|
318 |
+
" step_generator=torch.Generator())\n",
|
319 |
"display(Audio(audio, rate=sample_rate))\n",
|
320 |
"display(Audio(audio2, rate=sample_rate))"
|
321 |
]
|
322 |
},
|
323 |
{
|
324 |
"cell_type": "markdown",
|
325 |
+
"id": "efc32dae",
|
326 |
+
"metadata": {},
|
327 |
+
"source": [
|
328 |
+
"## DDIM (Denoising Diffusion Implicit Models)"
|
329 |
+
]
|
330 |
+
},
|
331 |
+
{
|
332 |
+
"cell_type": "code",
|
333 |
+
"execution_count": null,
|
334 |
+
"id": "a021f78a",
|
335 |
+
"metadata": {},
|
336 |
+
"outputs": [],
|
337 |
+
"source": [
|
338 |
+
"audio_diffusion = AudioDiffusion(model_id='teticio/audio-diffusion-ddim-256')"
|
339 |
+
]
|
340 |
+
},
|
341 |
+
{
|
342 |
+
"cell_type": "markdown",
|
343 |
+
"id": "deb23339",
|
344 |
"metadata": {},
|
345 |
"source": [
|
346 |
+
"### Generation can be done in many fewer steps with DDIMs"
|
347 |
+
]
|
348 |
+
},
|
349 |
+
{
|
350 |
+
"cell_type": "code",
|
351 |
+
"execution_count": null,
|
352 |
+
"id": "c105a497",
|
353 |
+
"metadata": {},
|
354 |
+
"outputs": [],
|
355 |
+
"source": [
|
356 |
+
"for _ in range(10):\n",
|
357 |
+
" seed = generator.seed()\n",
|
358 |
+
" print(f'Seed = {seed}')\n",
|
359 |
+
" generator.manual_seed(seed)\n",
|
360 |
+
" image, (sample_rate,\n",
|
361 |
+
" audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
|
362 |
+
" generator=generator)\n",
|
363 |
+
" display(image)\n",
|
364 |
+
" display(Audio(audio, rate=sample_rate))\n",
|
365 |
+
" loop = AudioDiffusion.loop_it(audio, sample_rate)\n",
|
366 |
+
" if loop is not None:\n",
|
367 |
+
" display(Audio(loop, rate=sample_rate))\n",
|
368 |
+
" else:\n",
|
369 |
+
" print(\"Unable to determine loop points\")"
|
370 |
+
]
|
371 |
+
},
|
372 |
+
{
|
373 |
+
"cell_type": "markdown",
|
374 |
+
"id": "cab4692c",
|
375 |
+
"metadata": {},
|
376 |
+
"source": [
|
377 |
+
"The parameter eta controls the variance:\n",
|
378 |
+
"* 0 - DDIM (deterministic)\n",
|
379 |
+
"* 1 - DDPM (Denoising DIffusion )"
|
380 |
+
]
|
381 |
+
},
|
382 |
+
{
|
383 |
+
"cell_type": "code",
|
384 |
+
"execution_count": null,
|
385 |
+
"id": "72bdd207",
|
386 |
+
"metadata": {},
|
387 |
+
"outputs": [],
|
388 |
+
"source": [
|
389 |
+
"image, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
|
390 |
+
" steps=1000, generator=generator, eta=1)\n",
|
391 |
+
"display(image)\n",
|
392 |
+
"display(Audio(audio, rate=sample_rate))"
|
393 |
+
]
|
394 |
+
},
|
395 |
+
{
|
396 |
+
"cell_type": "markdown",
|
397 |
+
"id": "b8d5442c",
|
398 |
+
"metadata": {},
|
399 |
+
"source": [
|
400 |
+
"### DDIMs can be used as encoders..."
|
401 |
]
|
402 |
},
|
403 |
{
|
|
|
407 |
"metadata": {},
|
408 |
"outputs": [],
|
409 |
"source": [
|
410 |
+
"# Doesn't have to be an audio from the train dataset, this is just for convenience\n",
|
411 |
+
"ds = load_dataset('teticio/audio-diffusion-256')"
|
412 |
]
|
413 |
},
|
414 |
{
|
415 |
"cell_type": "code",
|
416 |
"execution_count": null,
|
417 |
+
"id": "278d1d80",
|
418 |
"metadata": {},
|
419 |
"outputs": [],
|
420 |
"source": [
|
421 |
+
"image = ds['train'][264]['image']\n",
|
422 |
+
"display(Audio(mel.image_to_audio(image), rate=mel.get_sample_rate()))"
|
423 |
]
|
424 |
},
|
425 |
{
|
426 |
"cell_type": "code",
|
427 |
"execution_count": null,
|
428 |
+
"id": "912b54e4",
|
429 |
"metadata": {},
|
430 |
"outputs": [],
|
431 |
"source": [
|
432 |
+
"noise = audio_diffusion.pipe.encode([image], steps=50)"
|
433 |
+
]
|
434 |
+
},
|
435 |
+
{
|
436 |
+
"cell_type": "code",
|
437 |
+
"execution_count": null,
|
438 |
+
"id": "c7b31f97",
|
439 |
+
"metadata": {},
|
440 |
+
"outputs": [],
|
441 |
+
"source": [
|
442 |
+
"# Reconstruct original audio from noise\n",
|
443 |
+
"_, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
|
444 |
+
" noise=noise, generator=generator)\n",
|
445 |
+
"display(Audio(audio, rate=sample_rate))"
|
446 |
+
]
|
447 |
+
},
|
448 |
+
{
|
449 |
+
"cell_type": "markdown",
|
450 |
+
"id": "998c776b",
|
451 |
+
"metadata": {},
|
452 |
+
"source": [
|
453 |
+
"### ...or to interpolate between audios"
|
454 |
+
]
|
455 |
+
},
|
456 |
+
{
|
457 |
+
"cell_type": "code",
|
458 |
+
"execution_count": null,
|
459 |
+
"id": "33f82367",
|
460 |
+
"metadata": {},
|
461 |
+
"outputs": [],
|
462 |
+
"source": [
|
463 |
+
"image2 = ds['train'][15978]['image']\n",
|
464 |
+
"display(Audio(mel.image_to_audio(image2), rate=mel.get_sample_rate()))"
|
465 |
+
]
|
466 |
+
},
|
467 |
+
{
|
468 |
+
"cell_type": "code",
|
469 |
+
"execution_count": null,
|
470 |
+
"id": "f93fb6c0",
|
471 |
+
"metadata": {},
|
472 |
+
"outputs": [],
|
473 |
+
"source": [
|
474 |
+
"noise2 = audio_diffusion.pipe.encode([image2], steps=50)"
|
475 |
+
]
|
476 |
+
},
|
477 |
+
{
|
478 |
+
"cell_type": "code",
|
479 |
+
"execution_count": null,
|
480 |
+
"id": "a4190563",
|
481 |
+
"metadata": {},
|
482 |
+
"outputs": [],
|
483 |
+
"source": [
|
484 |
+
"alpha = 0.5 #@param {type:\"slider\", min:0, max:1, step:.1}\n",
|
485 |
+
"_, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
|
486 |
+
" noise=audio_diffusion.pipe.slerp(noise, noise2, alpha),\n",
|
487 |
+
" steps=50,\n",
|
488 |
+
" generator=generator)\n",
|
489 |
+
"display(Audio(mel.image_to_audio(image), rate=mel.get_sample_rate()))\n",
|
490 |
+
"display(Audio(mel.image_to_audio(image2), rate=mel.get_sample_rate()))\n",
|
491 |
+
"display(Audio(audio, rate=sample_rate))"
|
492 |
]
|
493 |
},
|
494 |
{
|
495 |
"cell_type": "code",
|
496 |
"execution_count": null,
|
497 |
+
"id": "0b05539f",
|
498 |
"metadata": {},
|
499 |
"outputs": [],
|
500 |
"source": []
|
|
|
521 |
"name": "python",
|
522 |
"nbconvert_exporter": "python",
|
523 |
"pygments_lexer": "ipython3",
|
524 |
+
"version": "3.8.9"
|
525 |
},
|
526 |
"toc": {
|
527 |
"base_numbering": 1,
|