AudioCraft objective metrics

In addition to training losses, AudioCraft provides a set of objective metrics for audio synthesis and audio generation. As these metrics may require extra dependencies and can be costly to train, they are often disabled by default. This section provides guidance for setting up and using these metrics in the AudioCraft training pipelines.

Available metrics

Audio synthesis quality metrics

SI-SNR

We provide an implementation of the Scale-Invariant Signal-to-Noise Ratio in PyTorch. No specific requirement is needed for this metric. Please activate the metric at the evaluation stage with the appropriate flag:

dora run <...> evaluate.metrics.sisnr=true

ViSQOL

We provide a Python wrapper around the ViSQOL official implementation to conveniently run ViSQOL within the training pipelines.

One must specify the path to the ViSQOL installation through the configuration in order to enable ViSQOL computations in AudioCraft:

# the first parameter is used to activate visqol computation while the second specify
# the path to visqol's library to be used by our python wrapper
dora run <...> evaluate.metrics.visqol=true metrics.visqol.bin=<path_to_visqol>

See an example grid: Compression with ViSQOL

To learn more about ViSQOL and how to build ViSQOL binary using bazel, please refer to the instructions available in the open source repository.

Audio generation metrics

Frechet Audio Distance

Similarly to ViSQOL, we use a Python wrapper around the Frechet Audio Distance official implementation in TensorFlow.

Note that we had to make several changes to the actual code in order to make it work. Please refer to the FrechetAudioDistanceMetric class documentation for more details. We do not plan to provide further support in obtaining a working setup for the Frechet Audio Distance at this stage.

# the first parameter is used to activate FAD metric computation while the second specify
# the path to FAD library to be used by our python wrapper
dora run <...> evaluate.metrics.fad=true metrics.fad.bin=<path_to_google_research_repository>

See an example grid: Evaluation with FAD

Kullback-Leibler Divergence

We provide a PyTorch implementation of the Kullback-Leibler Divergence computed over the probabilities of the labels obtained by a state-of-the-art audio classifier. We provide our implementation of the KLD using the PaSST classifier.

In order to use the KLD metric over PaSST, you must install the PaSST library as an extra dependency:

pip install 'git+https://github.com/kkoutini/passt_hear21@0.0.19#egg=hear21passt'

Then similarly, you can use the metric activating the corresponding flag:

# one could extend the kld metric with additional audio classifier models that can then be picked through the configuration
dora run <...> evaluate.metrics.kld=true metrics.kld.model=passt

Text consistency

We provide a text-consistency metric, similarly to the MuLan Cycle Consistency from MusicLM or the CLAP score used in Make-An-Audio. More specifically, we provide a PyTorch implementation of a Text consistency metric relying on a pre-trained Contrastive Language-Audio Pretraining (CLAP).

Please install the CLAP library as an extra dependency prior to using the metric:

pip install laion_clap

Then similarly, you can use the metric activating the corresponding flag:

# one could extend the text consistency metric with additional audio classifier models that can then be picked through the configuration
dora run ... evaluate.metrics.text_consistency=true metrics.text_consistency.model=clap

Note that the text consistency metric based on CLAP will require the CLAP checkpoint to be provided in the configuration.

Chroma cosine similarity

Finally, as introduced in MusicGen, we provide a Chroma Cosine Similarity metric in PyTorch. No specific requirement is needed for this metric. Please activate the metric at the evaluation stage with the appropriate flag:

dora run ... evaluate.metrics.chroma_cosine=true

Comparing against reconstructed audio

For all the above audio generation metrics, we offer the option to compute the metric on the reconstructed audio fed in EnCodec instead of the generated sample using the flag <metric>.use_gt=true.

Example usage

You will find example of configuration for the different metrics introduced above in:

The musicgen's default solver for all audio generation metrics
The compression's default solver for all audio synthesis metrics

Similarly, we provide different examples in our grids: