Spaces:
Running
on
Zero
Running
on
Zero
# MAGNeT: Masked Audio Generation using a Single Non-Autoregressive Transformer | |
AudioCraft provides the code and models for MAGNeT, [Masked Audio Generation using a Single Non-Autoregressive Transformer][arxiv]. | |
MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions. | |
It is a masked generative non-autoregressive Transformer trained over a 32kHz EnCodec tokenizer with 4 codebooks sampled at 50 Hz. | |
Unlike prior work on masked generative audio Transformers, such as [SoundStorm](https://arxiv.org/abs/2305.09636) and [VampNet](https://arxiv.org/abs/2307.04686), | |
MAGNeT doesn't require semantic token conditioning, model cascading or audio prompting, and employs a full text-to-audio using a single non-autoregressive Transformer. | |
Check out our [sample page][magnet_samples] or test the available demo! | |
We use 16K hours of licensed music to train MAGNeT. Specifically, we rely on an internal dataset | |
of 10K high-quality music tracks, and on the ShutterStock and Pond5 music data. | |
## Model Card | |
See [the model card](../model_cards/MAGNET_MODEL_CARD.md). | |
## Installation | |
Please follow the AudioCraft installation instructions from the [README](../README.md). | |
AudioCraft requires a GPU with at least 16 GB of memory for running inference with the medium-sized models (~1.5B parameters). | |
## Usage | |
We currently offer two ways to interact with MAGNeT: | |
1. You can use the gradio demo locally by running [`python -m demos.magnet_app --share`](../demos/magnet_app.py). | |
2. You can play with MAGNeT by running the jupyter notebook at [`demos/magnet_demo.ipynb`](../demos/magnet_demo.ipynb) locally (if you have a GPU). | |
## API | |
We provide a simple API and 6 pre-trained models. The pre trained models are: | |
- `facebook/magnet-small-10secs`: 300M model, text to music, generates 10-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-small-10secs) | |
- `facebook/magnet-medium-10secs`: 1.5B model, text to music, generates 10-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-medium-10secs) | |
- `facebook/magnet-small-30secs`: 300M model, text to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-small-30secs) | |
- `facebook/magnet-medium-30secs`: 1.5B model, text to music, generates 30-second samples - [🤗 Hub](https://huggingface.co/facebook/magnet-medium-30secs) | |
- `facebook/audio-magnet-small`: 300M model, text to sound-effect - [🤗 Hub](https://huggingface.co/facebook/audio-magnet-small) | |
- `facebook/audio-magnet-medium`: 1.5B model, text to sound-effect - [🤗 Hub](https://huggingface.co/facebook/audio-magnet-medium) | |
In order to use MAGNeT locally **you must have a GPU**. We recommend 16GB of memory, especially for | |
the medium size models. | |
See after a quick example for using the API. | |
```python | |
import torchaudio | |
from audiocraft.models import MAGNeT | |
from audiocraft.data.audio import audio_write | |
model = MAGNeT.get_pretrained('facebook/magnet-small-10secs') | |
descriptions = ['disco beat', 'energetic EDM', 'funky groove'] | |
wav = model.generate(descriptions) # generates 3 samples. | |
for idx, one_wav in enumerate(wav): | |
# Will save under {idx}.wav, with loudness normalization at -14 db LUFS. | |
audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness", loudness_compressor=True) | |
``` | |
## 🤗 Transformers Usage | |
Coming soon... | |
## Training | |
The [MagnetSolver](../audiocraft/solvers/magnet.py) implements MAGNeT's training pipeline. | |
It defines a masked generation task over multiple streams of discrete tokens | |
extracted from a pre-trained EnCodec model (see [EnCodec documentation](./ENCODEC.md) | |
for more details on how to train such model). | |
Note that **we do NOT provide any of the datasets** used for training MAGNeT. | |
We provide a dummy dataset containing just a few examples for illustrative purposes. | |
Please read first the [TRAINING documentation](./TRAINING.md), in particular the Environment Setup section. | |
### Example configurations and grids | |
We provide configurations to reproduce the released models and our research. | |
MAGNeT solvers configuration are available in [config/solver/magnet](../config/solver/magnet), | |
in particular: | |
* MAGNeT model for text-to-music: | |
[`solver=magnet/magnet_32khz`](../config/solver/magnet/magnet_32khz.yaml) | |
* MAGNeT model for text-to-sound: | |
[`solver=magnet/audio_magnet_16khz`](../config/solver/magnet/audio_magnet_16khz.yaml) | |
We provide 3 different scales, e.g. `model/lm/model_scale=small` (300M), or `medium` (1.5B), and `large` (3.3B). | |
Please find some example grids to train MAGNeT at | |
[audiocraft/grids/magnet](../audiocraft/grids/magnet/). | |
```shell | |
# text-to-music | |
dora grid magnet.magnet_32khz --dry_run --init | |
# text-to-sound | |
dora grid magnet.audio_magnet_16khz --dry_run --init | |
# Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup. | |
``` | |
### dataset and metadata | |
Learn more in the [datasets section](./DATASETS.md). | |
#### Music Models | |
MAGNeT's underlying dataset is an AudioDataset augmented with music-specific metadata. | |
The MAGNeT dataset implementation expects the metadata to be available as `.json` files | |
at the same location as the audio files. | |
#### Sound Models | |
Audio-MAGNeT's underlying dataset is an AudioDataset augmented with description metadata. | |
The Audio-MAGNeT dataset implementation expects the metadata to be available as `.json` files | |
at the same location as the audio files or through specified external folder. | |
### Audio tokenizers | |
See [MusicGen](./MUSICGEN.md) | |
### Fine tuning existing models | |
You can initialize your model to one of the pretrained models by using the `continue_from` argument, in particular | |
```bash | |
# Using pretrained MAGNeT model. | |
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=//pretrained/facebook/magnet-medium-10secs conditioner=text2music | |
# Using another model you already trained with a Dora signature SIG. | |
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=//sig/SIG conditioner=text2music | |
# Or providing manually a path | |
dora run solver=magnet/magnet_32khz model/lm/model_scale=medium continue_from=/checkpoints/my_other_xp/checkpoint.th | |
``` | |
**Warning:** You are responsible for selecting the other parameters accordingly, in a way that make it compatible | |
with the model you are fine tuning. Configuration is NOT automatically inherited from the model you continue from. In particular make sure to select the proper `conditioner` and `model/lm/model_scale`. | |
**Warning:** We currently do not support fine tuning a model with slightly different layers. If you decide | |
to change some parts, like the conditioning or some other parts of the model, you are responsible for manually crafting a checkpoint file from which we can safely run `load_state_dict`. | |
If you decide to do so, make sure your checkpoint is saved with `torch.save` and contains a dict | |
`{'best_state': {'model': model_state_dict_here}}`. Directly give the path to `continue_from` without a `//pretrained/` prefix. | |
### Evaluation stage | |
For the 6 pretrained MAGNeT models, objective metrics could be reproduced using the following grids: | |
```shell | |
# text-to-music | |
REGEN=1 dora grid magnet.magnet_pretrained_32khz_eval --dry_run --init | |
# text-to-sound | |
REGEN=1 dora grid magnet.audio_magnet_pretrained_16khz_eval --dry_run --init | |
# Remove the `--dry_run --init` flags to actually schedule the jobs once everything is setup. | |
``` | |
See [MusicGen](./MUSICGEN.md) for more details. | |
### Generation stage | |
See [MusicGen](./MUSICGEN.md) | |
### Playing with the model | |
Once you have launched some experiments, you can easily get access | |
to the Solver with the latest trained model using the following snippet. | |
```python | |
from audiocraft.solvers.magnet import MagnetSolver | |
solver = MagnetSolver.get_eval_solver_from_sig('SIG', device='cpu', batch_size=8) | |
solver.model | |
solver.dataloaders | |
``` | |
### Importing / Exporting models | |
We do not support currently loading a model from the Hugging Face implementation or exporting to it. | |
If you want to export your model in a way that is compatible with `audiocraft.models.MAGNeT` | |
API, you can run: | |
```python | |
from audiocraft.utils import export | |
from audiocraft import train | |
xp = train.main.get_xp_from_sig('SIG_OF_LM') | |
export.export_lm(xp.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/state_dict.bin') | |
# You also need to bundle the EnCodec model you used !! | |
## Case 1) you trained your own | |
xp_encodec = train.main.get_xp_from_sig('SIG_OF_ENCODEC') | |
export.export_encodec(xp_encodec.folder / 'checkpoint.th', '/checkpoints/my_audio_lm/compression_state_dict.bin') | |
## Case 2) you used a pretrained model. Give the name you used without the //pretrained/ prefix. | |
## This will actually not dump the actual model, simply a pointer to the right model to download. | |
export.export_pretrained_compression_model('facebook/encodec_32khz', '/checkpoints/my_audio_lm/compression_state_dict.bin') | |
``` | |
Now you can load your custom model with: | |
```python | |
import audiocraft.models | |
magnet = audiocraft.models.MAGNeT.get_pretrained('/checkpoints/my_audio_lm/') | |
``` | |
### Learn more | |
Learn more about AudioCraft training pipelines in the [dedicated section](./TRAINING.md). | |
## FAQ | |
#### What are top-k, top-p, temperature and classifier-free guidance? | |
Check out [@FurkanGozukara tutorial](https://github.com/FurkanGozukara/Stable-Diffusion/blob/main/Tutorials/AI-Music-Generation-Audiocraft-Tutorial.md#more-info-about-top-k-top-p-temperature-and-classifier-free-guidance-from-chatgpt). | |
#### Should I use FSDP or autocast ? | |
The two are mutually exclusive (because FSDP does autocast on its own). | |
You can use autocast up to 1.5B (medium), if you have enough RAM on your GPU. | |
FSDP makes everything more complex but will free up some memory for the actual | |
activations by sharding the optimizer state. | |
## Citation | |
``` | |
@misc{ziv2024masked, | |
title={Masked Audio Generation using a Single Non-Autoregressive Transformer}, | |
author={Alon Ziv and Itai Gat and Gael Le Lan and Tal Remez and Felix Kreuk and Alexandre Défossez and Jade Copet and Gabriel Synnaeve and Yossi Adi}, | |
year={2024}, | |
eprint={2401.04577}, | |
archivePrefix={arXiv}, | |
primaryClass={cs.SD} | |
} | |
``` | |
## License | |
See license information in the [model card](../model_cards/MAGNET_MODEL_CARD.md). | |
[arxiv]: https://arxiv.org/abs/2401.04577 | |
[magnet_samples]: https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT/ | |