Spaces:

teticio
/

audio-diffusion

Runtime error

App Files Files Community

Robert Smith commited on Oct 23, 2022

Commit

43ebb3b

1 Parent(s): f34a81b

tidy

Browse files

Files changed (4) hide show

.gitignore +2 -0
README.md +19 -12
audiodiffusion/__init__.py +33 -24
notebooks/test_model.ipynb +164 -17

.gitignore CHANGED Viewed

@@ -9,3 +9,5 @@ audiodiffusion.egg-info
 lightning_logs
 taming
 checkpoints

 lightning_logs
 taming
 checkpoints
+Pipfile
+Pipfile.lock

README.md CHANGED Viewed

@@ -11,20 +11,19 @@ license: gpl-3.0
 ---
 # audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
-### Apply [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package to synthesize music instead of images.
 ---
 **UPDATES**:
-15/10/2022
-Added latent audio diffusion (see below). Also added the possibility to train a model to use DDIM ([Denoising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf)) by setting `--scheduler ddim`. These have the benefit that samples can be generated with much fewer steps (~50) than used in training.
-4/10/2022
-It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
-27/9/2022
-You can now generate an audio based on a previous one. You can use this to generate variations of the same audio or even to "remix" a track (via a sort of "style transfer"). You can find examples of how to do this in the [`test_model.ipynb`](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) notebook.
 ---
@@ -32,11 +31,13 @@ You can now generate an audio based on a previous one. You can use this to gener
 ---
 Audio can be represented as images by transforming to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), such as the one shown above. The class `Mel` in `mel.py` can convert a slice of audio into a mel spectrogram of `x_res` x `y_res` and vice versa. The higher the resolution, the less audio information will be lost. You can see how this works in the [`test_mel.ipynb`](https://github.com/teticio/audio-diffusion/blob/main/notebooks/test_mel.ipynb) notebook.
-A DDPM model is trained on a set of mel spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar mel spectrograms, which are then converted back into audio.
-You can play around with some pretrained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
 | Model | Dataset | Description |
@@ -54,7 +55,6 @@ pip install .
 ```
 #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
 ```bash
 python scripts/audio_to_images.py \
   --resolution 64,64 \
@@ -119,10 +119,17 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
   --lr_warmup_steps 500 \
   --mixed_precision no
 ```
 ## Latent Audio Diffusion
-Rather than denoising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train denoising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
-At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality, rather like its cousin `transformers` in the early days of development. In order to train a VAE (Variational Autoencoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
 #### Install dependencies to train with Stable Diffusion
 ```

 ---
 # audio-diffusion [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/gradio_app.ipynb)
+### Apply diffusion models to synthesize music instead of images using the new Hugging Face [diffusers](https://github.com/huggingface/diffusers) package.
 ---
 **UPDATES**:
+**22/10/2022**. Added DDIM encoder and ability to interpolate between audios in latent "noise" space. Mel spectrograms no longer have to be square (thanks to Tristan for this one), so you can set the vertical (frequency) and horizontal (time) resolutions independently.
+**15/10/2022**. Added latent audio diffusion (see below). Also added the possibility to train a DDIM ([Denoising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf)). These have the benefit that samples can be generated with much fewer steps (~50) than used in training.
+**4/10/2022**. It is now possible to mask parts of the input audio during generation which means you can stitch several samples together (think "out-painting").
+**27/9/2022**. You can now generate an audio based on a previous one. You can use this to generate variations of the same audio or even to "remix" a track (via a sort of "style transfer"). You can find examples of how to do this in the [`test_model.ipynb`](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) notebook.
 ---
 ---
+## DDPM ([De-noising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239))
 Audio can be represented as images by transforming to a [mel spectrogram](https://en.wikipedia.org/wiki/Mel-frequency_cepstrum), such as the one shown above. The class `Mel` in `mel.py` can convert a slice of audio into a mel spectrogram of `x_res` x `y_res` and vice versa. The higher the resolution, the less audio information will be lost. You can see how this works in the [`test_mel.ipynb`](https://github.com/teticio/audio-diffusion/blob/main/notebooks/test_mel.ipynb) notebook.
+A DDPM is trained on a set of mel spectrograms that have been generated from a directory of audio files. It is then used to synthesize similar mel spectrograms, which are then converted back into audio.
+You can play around with some pre-trained models on [Google Colab](https://colab.research.google.com/github/teticio/audio-diffusion/blob/master/notebooks/test_model.ipynb) or [Hugging Face spaces](https://huggingface.co/spaces/teticio/audio-diffusion). Check out some automatically generated loops [here](https://soundcloud.com/teticio2/sets/audio-diffusion-loops).
 | Model | Dataset | Description |
 ```
 #### Training can be run with Mel spectrograms of resolution 64x64 on a single commercial grade GPU (e.g. RTX 2080 Ti). The `hop_length` should be set to 1024 for better results.
 ```bash
 python scripts/audio_to_images.py \
   --resolution 64,64 \
   --lr_warmup_steps 500 \
   --mixed_precision no
 ```
+## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
+#### A DDIM can be trained by adding the parameter
+```bash
+  --scheduler ddim
+```
+Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
 ## Latent Audio Diffusion
+Rather than de-noising images directly, it is interesting to work in the "latent space" after first encoding images using an autoencoder. This has a number of advantages. Firstly, the information in the images is compressed into a latent space of a much lower dimension, so it is much faster to train de-noising diffusion models and run inference with them. Secondly, similar images tend to be clustered together and interpolating between two images in latent space can produce meaningful combinations.
+At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
 #### Install dependencies to train with Stable Diffusion
 ```

audiodiffusion/__init__.py CHANGED Viewed

@@ -6,12 +6,12 @@ import numpy as np
 from PIL import Image
 from tqdm.auto import tqdm
 from librosa.beat import beat_track
-from diffusers import (DiffusionPipeline, DDPMPipeline, UNet2DConditionModel,
-                       DDIMScheduler, DDPMScheduler, AutoencoderKL)
 from .mel import Mel
-VERSION = "1.2.2"
 class AudioDiffusion:
@@ -24,7 +24,7 @@ class AudioDiffusion:
                  top_db: int = 80,
                  cuda: bool = torch.cuda.is_available(),
                  progress_bar: Iterable = tqdm):
-        """Class for generating audio using Denoising Diffusion Probabilistic Models.
         Args:
             model_id (String): name of model (local directory or Hugging Face Hub)
@@ -60,18 +60,21 @@ class AudioDiffusion:
                        top_db=top_db)
     def generate_spectrogram_and_audio(
-            self,
-            steps: int = 1000,
-            generator: torch.Generator = None,
-            step_generator: torch.Generator = None,
-            eta: float = 0) -> Tuple[Image.Image, Tuple[int, np.ndarray]]:
         """Generate random mel spectrogram and convert to audio.
         Args:
-            steps (int): number of de-noising steps to perform (defaults to num_train_timesteps)
             generator (torch.Generator): random number generator or None
-            step_generator (torch.Generator): random number generator used to denoise or None
             eta (float): parameter between 0 and 1 used with DDIM scheduler
         Returns:
             PIL Image: mel spectrogram
@@ -83,7 +86,8 @@ class AudioDiffusion:
                                      steps=steps,
                                      generator=generator,
                                      step_generator=step_generator,
-                                     eta=eta)
         return images[0], (sample_rate, audios[0])
     def generate_spectrogram_and_audio_from_audio(
@@ -92,7 +96,7 @@ class AudioDiffusion:
         raw_audio: np.ndarray = None,
         slice: int = 0,
         start_step: int = 0,
-        steps: int = 1000,
         generator: torch.Generator = None,
         mask_start_secs: float = 0,
         mask_end_secs: float = 0,
@@ -107,11 +111,11 @@ class AudioDiffusion:
             raw_audio (np.ndarray): audio as numpy array
             slice (int): slice number of audio to convert
             start_step (int): step to start from
-            steps (int): number of de-noising steps to perform (defaults to num_train_timesteps)
             generator (torch.Generator): random number generator or None
             mask_start_secs (float): number of seconds of audio to mask (not generate) at start
             mask_end_secs (float): number of seconds of audio to mask (not generate) at end
-            step_generator (torch.Generator): random number generator used to denoise or None
             eta (float): parameter between 0 and 1 used with DDIM scheduler
             noise (torch.Tensor): noisy image or None
@@ -173,7 +177,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
         raw_audio: np.ndarray = None,
         slice: int = 0,
         start_step: int = 0,
-        steps: int = 1000,
         generator: torch.Generator = None,
         mask_start_secs: float = 0,
         mask_end_secs: float = 0,
@@ -190,23 +194,24 @@ class AudioDiffusionPipeline(DiffusionPipeline):
             raw_audio (np.ndarray): audio as numpy array
             slice (int): slice number of audio to convert
             start_step (int): step to start from
-            steps (int): number of de-noising steps to perform (defaults to num_train_timesteps)
             generator (torch.Generator): random number generator or None
             mask_start_secs (float): number of seconds of audio to mask (not generate) at start
             mask_end_secs (float): number of seconds of audio to mask (not generate) at end
-            step_generator (torch.Generator): random number generator used to denoise or None
             eta (float): parameter between 0 and 1 used with DDIM scheduler
-            noise (torch.Tensor): noisy image or None
         Returns:
             List[PIL Image]: mel spectrograms
             (float, List[np.ndarray]): sample rate and raw audios
         """
         self.scheduler.set_timesteps(steps)
         step_generator = step_generator or generator
-        mask = None
-        # For backwards compatiibility
         if type(self.unet.sample_size) == int:
             self.unet.sample_size = (self.unet.sample_size,
                                      self.unet.sample_size)
@@ -215,6 +220,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
                 (batch_size, self.unet.in_channels) + self.unet.sample_size,
                 generator=generator)
         images = noise
         if audio_file is not None or raw_audio is not None:
             mel.load_audio(audio_file, raw_audio)
@@ -289,11 +295,12 @@ class AudioDiffusionPipeline(DiffusionPipeline):
         return images, (mel.get_sample_rate(), audios)
     @torch.no_grad()
-    def encode(self, images: List[Image.Image]) -> np.ndarray:
         """Reverse step process: recover noisy image from generated image.
         Args:
             images (List[PIL Image]): list of images to encode
         Returns:
             np.ndarray: noise tensor of shape (batch_size, 1, height, width)
@@ -301,6 +308,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
         # Only works with DDIM as this method is deterministic
         assert isinstance(self.scheduler, DDIMScheduler)
         sample = np.array([
             np.frombuffer(image.tobytes(), dtype="uint8").reshape(
                 (1, image.height, image.width)) for image in images
@@ -308,7 +316,8 @@ class AudioDiffusionPipeline(DiffusionPipeline):
         sample = ((sample / 255) * 2 - 1)
         sample = torch.Tensor(sample).to(self.device)
-        for t in torch.flip(self.scheduler.timesteps, (0, )):
             prev_timestep = (t - self.scheduler.num_train_timesteps //
                              self.scheduler.num_inference_steps)
             alpha_prod_t = self.scheduler.alphas_cumprod[t]
@@ -334,7 +343,7 @@ class AudioDiffusionPipeline(DiffusionPipeline):
         Args:
             x0 (torch.Tensor): first tensor to interpolate between
             x1 (torch.Tensor): seconds tensor to interpolate between
-            alpha (float): interpolation betwen 0 and 1
         Returns:
             torch.Tensor: interpolated tensor

 from PIL import Image
 from tqdm.auto import tqdm
 from librosa.beat import beat_track
+from diffusers import (DiffusionPipeline, UNet2DConditionModel, DDIMScheduler,
+                       DDPMScheduler, AutoencoderKL)
 from .mel import Mel
+VERSION = "1.2.3"
 class AudioDiffusion:
                  top_db: int = 80,
                  cuda: bool = torch.cuda.is_available(),
                  progress_bar: Iterable = tqdm):
+        """Class for generating audio using De-noising Diffusion Probabilistic Models.
         Args:
             model_id (String): name of model (local directory or Hugging Face Hub)
                        top_db=top_db)
     def generate_spectrogram_and_audio(
+        self,
+        steps: int = None,
+        generator: torch.Generator = None,
+        step_generator: torch.Generator = None,
+        eta: float = 0,
+        noise: torch.Tensor = None
+    ) -> Tuple[Image.Image, Tuple[int, np.ndarray]]:
         """Generate random mel spectrogram and convert to audio.
         Args:
+            steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
             generator (torch.Generator): random number generator or None
+            step_generator (torch.Generator): random number generator used to de-noise or None
             eta (float): parameter between 0 and 1 used with DDIM scheduler
+            noise (torch.Tensor): noisy image or None
         Returns:
             PIL Image: mel spectrogram
                                      steps=steps,
                                      generator=generator,
                                      step_generator=step_generator,
+                                     eta=eta,
+                                     noise=noise)
         return images[0], (sample_rate, audios[0])
     def generate_spectrogram_and_audio_from_audio(
         raw_audio: np.ndarray = None,
         slice: int = 0,
         start_step: int = 0,
+        steps: int = None,
         generator: torch.Generator = None,
         mask_start_secs: float = 0,
         mask_end_secs: float = 0,
             raw_audio (np.ndarray): audio as numpy array
             slice (int): slice number of audio to convert
             start_step (int): step to start from
+            steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
             generator (torch.Generator): random number generator or None
             mask_start_secs (float): number of seconds of audio to mask (not generate) at start
             mask_end_secs (float): number of seconds of audio to mask (not generate) at end
+            step_generator (torch.Generator): random number generator used to de-noise or None
             eta (float): parameter between 0 and 1 used with DDIM scheduler
             noise (torch.Tensor): noisy image or None
         raw_audio: np.ndarray = None,
         slice: int = 0,
         start_step: int = 0,
+        steps: int = None,
         generator: torch.Generator = None,
         mask_start_secs: float = 0,
         mask_end_secs: float = 0,
             raw_audio (np.ndarray): audio as numpy array
             slice (int): slice number of audio to convert
             start_step (int): step to start from
+            steps (int): number of de-noising steps (defaults to 50 for DDIM, 1000 for DDPM)
             generator (torch.Generator): random number generator or None
             mask_start_secs (float): number of seconds of audio to mask (not generate) at start
             mask_end_secs (float): number of seconds of audio to mask (not generate) at end
+            step_generator (torch.Generator): random number generator used to de-noise or None
             eta (float): parameter between 0 and 1 used with DDIM scheduler
+            noise (torch.Tensor): noise tensor of shape (batch_size, 1, height, width) or None
         Returns:
             List[PIL Image]: mel spectrograms
             (float, List[np.ndarray]): sample rate and raw audios
         """
+        steps = steps or 50 if isinstance(self.scheduler,
+                                          DDIMScheduler) else 1000
         self.scheduler.set_timesteps(steps)
         step_generator = step_generator or generator
+        # For backwards compatibility
         if type(self.unet.sample_size) == int:
             self.unet.sample_size = (self.unet.sample_size,
                                      self.unet.sample_size)
                 (batch_size, self.unet.in_channels) + self.unet.sample_size,
                 generator=generator)
         images = noise
+        mask = None
         if audio_file is not None or raw_audio is not None:
             mel.load_audio(audio_file, raw_audio)
         return images, (mel.get_sample_rate(), audios)
     @torch.no_grad()
+    def encode(self, images: List[Image.Image], steps: int = 50) -> np.ndarray:
         """Reverse step process: recover noisy image from generated image.
         Args:
             images (List[PIL Image]): list of images to encode
+            steps (int): number of encoding steps to perform (defaults to 50)
         Returns:
             np.ndarray: noise tensor of shape (batch_size, 1, height, width)
         # Only works with DDIM as this method is deterministic
         assert isinstance(self.scheduler, DDIMScheduler)
+        self.scheduler.set_timesteps(steps)
         sample = np.array([
             np.frombuffer(image.tobytes(), dtype="uint8").reshape(
                 (1, image.height, image.width)) for image in images
         sample = ((sample / 255) * 2 - 1)
         sample = torch.Tensor(sample).to(self.device)
+        for t in self.progress_bar(torch.flip(self.scheduler.timesteps,
+                                              (0, ))):
             prev_timestep = (t - self.scheduler.num_train_timesteps //
                              self.scheduler.num_inference_steps)
             alpha_prod_t = self.scheduler.alphas_cumprod[t]
         Args:
             x0 (torch.Tensor): first tensor to interpolate between
             x1 (torch.Tensor): seconds tensor to interpolate between
+            alpha (float): interpolation between 0 and 1
         Returns:
             torch.Tensor: interpolated tensor

notebooks/test_model.ipynb CHANGED Viewed

@@ -53,6 +53,25 @@
     "from audiodiffusion import AudioDiffusion"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "7fd945bb",
@@ -74,8 +93,6 @@
     "\n",
     "#@markdown teticio/audio-diffusion-instrumental-hiphop-256 - trained on instrumental hiphop\n",
     "\n",
-    "#@markdown teticio/audio-diffusion-ddim-256                - DDIM model trained on my Spotify \"liked\" playlist\n",
-    "\n",
     "model_id = \"teticio/audio-diffusion-256\"  #@param [\"teticio/audio-diffusion-256\", \"teticio/audio-diffusion-breaks-256\", \"audio-diffusion-instrumenal-hiphop-256\", \"teticio/audio-diffusion-ddim-256\"]"
    ]
   },
@@ -86,9 +103,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "audio_diffusion = AudioDiffusion(model_id=model_id)\n",
-    "mel = Mel(x_res=256, y_res=256)\n",
-    "generator = torch.Generator()"
    ]
   },
   {
@@ -299,17 +314,90 @@
     "    audio2) = audio_diffusion.generate_spectrogram_and_audio_from_audio(\n",
     "        raw_audio=mel.get_audio_slice(slice),\n",
     "        mask_start_secs=1,\n",
-    "        mask_end_secs=1, step_generator=torch.Generator())\n",
     "display(Audio(audio, rate=sample_rate))\n",
     "display(Audio(audio2, rate=sample_rate))"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "ef54cef3",
    "metadata": {},
    "source": [
-    "### Compare results with random sample from training set"
    ]
   },
   {
@@ -319,35 +407,94 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "ds = load_dataset(model_id)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "b9023846",
    "metadata": {},
    "outputs": [],
    "source": [
-    "image = random.choice(ds['train'])['image']\n",
-    "image"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "492e2334",
    "metadata": {},
    "outputs": [],
    "source": [
-    "audio = mel.image_to_audio(image)\n",
-    "Audio(data=audio, rate=mel.get_sample_rate())"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "4deb47f4",
    "metadata": {},
    "outputs": [],
    "source": []
@@ -374,7 +521,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.6"
   },
   "toc": {
    "base_numbering": 1,

     "from audiodiffusion import AudioDiffusion"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b294a94a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "mel = Mel(x_res=256, y_res=256)\n",
+    "generator = torch.Generator()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f3feb265",
+   "metadata": {},
+   "source": [
+    "## DDPM (Denoising Diffusion Probabilistic Models)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "7fd945bb",
     "\n",
     "#@markdown teticio/audio-diffusion-instrumental-hiphop-256 - trained on instrumental hiphop\n",
     "\n",
     "model_id = \"teticio/audio-diffusion-256\"  #@param [\"teticio/audio-diffusion-256\", \"teticio/audio-diffusion-breaks-256\", \"audio-diffusion-instrumenal-hiphop-256\", \"teticio/audio-diffusion-ddim-256\"]"
    ]
   },
    "metadata": {},
    "outputs": [],
    "source": [
+    "audio_diffusion = AudioDiffusion(model_id=model_id)"
    ]
   },
   {
     "    audio2) = audio_diffusion.generate_spectrogram_and_audio_from_audio(\n",
     "        raw_audio=mel.get_audio_slice(slice),\n",
     "        mask_start_secs=1,\n",
+    "        mask_end_secs=1,\n",
+    "        step_generator=torch.Generator())\n",
     "display(Audio(audio, rate=sample_rate))\n",
     "display(Audio(audio2, rate=sample_rate))"
    ]
   },
   {
    "cell_type": "markdown",
+   "id": "efc32dae",
+   "metadata": {},
+   "source": [
+    "## DDIM (Denoising Diffusion Implicit Models)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a021f78a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "audio_diffusion = AudioDiffusion(model_id='teticio/audio-diffusion-ddim-256')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "deb23339",
    "metadata": {},
    "source": [
+    "### Generation can be done in many fewer steps with DDIMs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c105a497",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for _ in range(10):\n",
+    "    seed = generator.seed()\n",
+    "    print(f'Seed = {seed}')\n",
+    "    generator.manual_seed(seed)\n",
+    "    image, (sample_rate,\n",
+    "            audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
+    "                generator=generator)\n",
+    "    display(image)\n",
+    "    display(Audio(audio, rate=sample_rate))\n",
+    "    loop = AudioDiffusion.loop_it(audio, sample_rate)\n",
+    "    if loop is not None:\n",
+    "        display(Audio(loop, rate=sample_rate))\n",
+    "    else:\n",
+    "        print(\"Unable to determine loop points\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cab4692c",
+   "metadata": {},
+   "source": [
+    "The parameter eta controls the variance:\n",
+    "* 0 - DDIM (deterministic)\n",
+    "* 1 - DDPM (Denoising DIffusion )"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "72bdd207",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "image, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
+    "    steps=1000, generator=generator, eta=1)\n",
+    "display(image)\n",
+    "display(Audio(audio, rate=sample_rate))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b8d5442c",
+   "metadata": {},
+   "source": [
+    "### DDIMs can be used as encoders..."
    ]
   },
   {
    "metadata": {},
    "outputs": [],
    "source": [
+    "# Doesn't have to be an audio from the train dataset, this is just for convenience\n",
+    "ds = load_dataset('teticio/audio-diffusion-256')"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "278d1d80",
    "metadata": {},
    "outputs": [],
    "source": [
+    "image = ds['train'][264]['image']\n",
+    "display(Audio(mel.image_to_audio(image), rate=mel.get_sample_rate()))"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "912b54e4",
    "metadata": {},
    "outputs": [],
    "source": [
+    "noise = audio_diffusion.pipe.encode([image], steps=50)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c7b31f97",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Reconstruct original audio from noise\n",
+    "_, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
+    "    noise=noise, generator=generator)\n",
+    "display(Audio(audio, rate=sample_rate))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "998c776b",
+   "metadata": {},
+   "source": [
+    "### ...or to interpolate between audios"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "33f82367",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "image2 = ds['train'][15978]['image']\n",
+    "display(Audio(mel.image_to_audio(image2), rate=mel.get_sample_rate()))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f93fb6c0",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "noise2 = audio_diffusion.pipe.encode([image2], steps=50)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "a4190563",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "alpha = 0.5  #@param {type:\"slider\", min:0, max:1, step:.1}\n",
+    "_, (sample_rate, audio) = audio_diffusion.generate_spectrogram_and_audio(\n",
+    "    noise=audio_diffusion.pipe.slerp(noise, noise2, alpha),\n",
+    "    steps=50,\n",
+    "    generator=generator)\n",
+    "display(Audio(mel.image_to_audio(image), rate=mel.get_sample_rate()))\n",
+    "display(Audio(mel.image_to_audio(image2), rate=mel.get_sample_rate()))\n",
+    "display(Audio(audio, rate=sample_rate))"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
+   "id": "0b05539f",
    "metadata": {},
    "outputs": [],
    "source": []
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
+   "version": "3.8.9"
   },
   "toc": {
    "base_numbering": 1,