What libraries can I use for Audio-to-Audio?

The asteroid, fairseq, and speechbrain libraries are compatible with Audio-to-Audio.

What models can I use for Audio-to-Audio?

The ResembleAI/resemble-enhanceand microsoft/speecht5_vc models can be used for Audio-to-Audio.

What datasets can I use for Audio-to-Audio?

The and Matthijs/cmu-arctic-xvectors dataset can be used for Audio-to-Audio.

What metrics can I use for Audio-to-Audio?

The snriand sdri metrics can be used for Audio-to-Audio.

Tasks

Audio-to-Audio

Audio-to-Audio is a family of tasks in which the input is an audio and the output is one or multiple generated audios. Some example tasks are speech enhancement and source separation.

Inputs

Audio-to-Audio Model

Output

About Audio-to-Audio

Use Cases

Speech Enhancement (Noise removal)

Speech Enhancement is a bit self explanatory. It improves (or enhances) the quality of an audio by removing noise. There are multiple libraries to solve this task, such as Speechbrain, Asteroid and ESPNet. Here is a simple example using Speechbrain

from speechbrain.pretrained import SpectralMaskEnhancement
model = SpectralMaskEnhancement.from_hparams(
  "speechbrain/mtl-mimic-voicebank"
)
model.enhance_file("file.wav")

Alternatively, you can use Inference Endpoints to solve this task

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://router.huggingface.co/hf-inference/models/speechbrain/mtl-mimic-voicebank"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")

You can use huggingface.js to infer with audio-to-audio models on Hugging Face Hub.

import { InferenceClient } from "@huggingface/inference";

const inference = new InferenceClient(HF_TOKEN);
await inference.audioToAudio({
    data: await (await fetch("sample.flac")).blob(),
    model: "speechbrain/sepformer-wham",
});

Audio Source Separation

Audio Source Separation allows you to isolate different sounds from individual sources. For example, if you have an audio file with multiple people speaking, you can get an audio file for each of them. You can then use an Automatic Speech Recognition system to extract the text from each of these sources as an initial step for your system!

Audio-to-Audio can also be used to remove noise from audio files: you get one audio for the person speaking and another audio for the noise. This can also be useful when you have multi-person audio with some noise: yyou can get one audio for each person and then one audio for the noise.

Training a model for your own data

If you want to learn how to train models for the Audio-to-Audio task, we recommend the following tutorials:

Compatible libraries

Asteroid

Fairseq

speechbrain

using speechbrain/sepformer-wham

Inference Providers NEW

Audio-to-Audio

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Models for Audio-to-Audio

Browse Models (3,913)

ResembleAI/resemble-enhance

Audio-to-Audio • Updated Dec 21, 2023 • 141

Note A speech enhancement model.

microsoft/speecht5_vc

Audio-to-Audio • Updated Mar 22, 2023 • 14.5k • 103

Note A model that can change the voice in a speech recording.

Datasets for Audio-to-Audio

Browse Datasets (130)

Matthijs/cmu-arctic-xvectors

Viewer • Updated Feb 7, 2023 • 7.93k • 20.5k • 49

Note 512-element X-vector embeddings of speakers from CMU ARCTIC dataset.

Spaces using Audio-to-Audio

👁

younver/speechbrain-speech-separation

Note An application for speech separation.

🎵🔄🎵

nakas/audio-diffusion_style_transfer

Note An application for audio style transfer.

Metrics for Audio-to-Audio

snri: The Signal-to-Noise ratio is the relationship between the target signal level and the background noise level. It is calculated as the logarithm of the target signal divided by the background noise, in decibels.

sdri: The Signal-to-Distortion ratio is the relationship between the target signal and the sum of noise, interference, and artifact errors