teticio commited on
Commit
af8111a
1 Parent(s): b3e97c5

train latent dm with pre-trained vae from hf hub

Browse files
Files changed (2) hide show
  1. README.md +10 -1
  2. scripts/train_unconditional.py +14 -3
README.md CHANGED
@@ -119,11 +119,13 @@ accelerate launch --config_file config/accelerate_sagemaker.yaml \
119
  --lr_warmup_steps 500 \
120
  --mixed_precision no
121
  ```
 
122
  ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
123
  #### A DDIM can be trained by adding the parameter
124
  ```bash
125
  --scheduler ddim
126
  ```
 
127
  Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
128
 
129
  ## Latent Audio Diffusion
@@ -131,7 +133,14 @@ Rather than de-noising images directly, it is interesting to work in the "latent
131
 
132
  At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
133
 
134
- #### Install dependencies to train with Stable Diffusion
 
 
 
 
 
 
 
135
  ```
136
  pip install omegaconf pytorch_lightning
137
  pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
 
119
  --lr_warmup_steps 500 \
120
  --mixed_precision no
121
  ```
122
+
123
  ## DDIM ([De-noising Diffusion Implicit Models](https://arxiv.org/pdf/2010.02502.pdf))
124
  #### A DDIM can be trained by adding the parameter
125
  ```bash
126
  --scheduler ddim
127
  ```
128
+
129
  Inference can the be run with far fewer steps than the number used for training (e.g., ~50), allowing for much faster generation. Without retraining, the parameter `eta` can be used to replicate a DDPM if it is set to 1 or a DDIM if it is set to 0, with all values in between being valid. When `eta` is 0 (the default value), the de-noising procedure is deterministic, which means that it can be run in reverse as a kind of encoder that recovers the original noise used in generation. A function `encode` has been added to `AudioDiffusionPipeline` for this purpose. It is then possible to interpolate between audios in the latent "noise" space using the function `slerp` (Spherical Linear intERPolation).
130
 
131
  ## Latent Audio Diffusion
 
133
 
134
  At the time of writing, the Hugging Face `diffusers` library is geared towards inference and lacking in training functionality (rather like its cousin `transformers` in the early days of development). In order to train a VAE (Variational AutoEncoder), I use the [stable-diffusion](https://github.com/CompVis/stable-diffusion) repo from CompVis and convert the checkpoints to `diffusers` format. Note that it uses a perceptual loss function for images; it would be nice to try a perceptual *audio* loss function.
135
 
136
+ #### Train latent diffusion model using pre-trained VAE.
137
+ ```bash
138
+ accelerate launch ...
139
+ ...
140
+ --vae teticio/latent-audio-diffusion-256
141
+ ```
142
+
143
+ #### Install dependencies to train with Stable Diffusion.
144
  ```
145
  pip install omegaconf pytorch_lightning
146
  pip install -e git+https://github.com/CompVis/stable-diffusion.git@main#egg=latent-diffusion
scripts/train_unconditional.py CHANGED
@@ -11,6 +11,7 @@ from accelerate.logging import get_logger
11
  from datasets import load_from_disk, load_dataset
12
  from diffusers import (DiffusionPipeline, DDPMScheduler, UNet2DModel,
13
  DDIMScheduler, AutoencoderKL)
 
14
  from diffusers.hub_utils import init_git_repo, push_to_hub
15
  from diffusers.optimization import get_scheduler
16
  from diffusers.training_utils import EMAModel
@@ -85,7 +86,11 @@ def main(args):
85
 
86
  vqvae = None
87
  if args.vae is not None:
88
- vqvae = AutoencoderKL.from_pretrained(args.vae)
 
 
 
 
89
  # Determine latent resolution
90
  with torch.no_grad():
91
  latent_resolution = vqvae.encode(
@@ -93,10 +98,16 @@ def main(args):
93
  resolution)).latent_dist.sample().shape[2:]
94
 
95
  if args.from_pretrained is not None:
96
- pipeline = DiffusionPipeline.from_pretrained(args.from_pretrained)
 
 
 
 
 
 
97
  model = pipeline.unet
98
  if hasattr(pipeline, 'vqvae'):
99
- vqvae = AutoencoderKL.from_pretrained(args.vae)
100
  else:
101
  model = UNet2DModel(
102
  sample_size=resolution if vqvae is None else latent_resolution,
 
11
  from datasets import load_from_disk, load_dataset
12
  from diffusers import (DiffusionPipeline, DDPMScheduler, UNet2DModel,
13
  DDIMScheduler, AutoencoderKL)
14
+ from diffusers.modeling_utils import EntryNotFoundError
15
  from diffusers.hub_utils import init_git_repo, push_to_hub
16
  from diffusers.optimization import get_scheduler
17
  from diffusers.training_utils import EMAModel
 
86
 
87
  vqvae = None
88
  if args.vae is not None:
89
+ try:
90
+ vqvae = AutoencoderKL.from_pretrained(args.vae)
91
+ except EnvironmentError:
92
+ vqvae = LatentAudioDiffusionPipeline.from_pretrained(
93
+ args.vae).vqvae
94
  # Determine latent resolution
95
  with torch.no_grad():
96
  latent_resolution = vqvae.encode(
 
98
  resolution)).latent_dist.sample().shape[2:]
99
 
100
  if args.from_pretrained is not None:
101
+ pipeline = {
102
+ 'LatentAudioDiffusionPipeline': LatentAudioDiffusionPipeline,
103
+ 'AudioDiffusionPipeline': AudioDiffusionPipeline
104
+ }.get(
105
+ DiffusionPipeline.get_config_dict(
106
+ args.from_pretrained)['_class_name'], AudioDiffusionPipeline)
107
+ pipeline = pipeline.from_pretrained(args.from_pretrained)
108
  model = pipeline.unet
109
  if hasattr(pipeline, 'vqvae'):
110
+ vqvae = pipeline.vqvae
111
  else:
112
  model = UNet2DModel(
113
  sample_size=resolution if vqvae is None else latent_resolution,