NeMo
rlangman commited on
Commit
74b1c75
1 Parent(s): 1f15eb8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +141 -0
README.md ADDED
@@ -0,0 +1,141 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nsclv1
4
+ license_link: https://developer.nvidia.com/downloads/license/nsclv1
5
+ ---
6
+
7
+
8
+ # NVIDIA NeMo Audio Codec 44khz
9
+ <style>
10
+ img {
11
+ display: inline-table;
12
+ vertical-align: small;
13
+ margin: 0;
14
+ padding: 0;
15
+ }
16
+ </style>
17
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-HiFi--GAN-lightgrey#model-badge)](#model-architecture)
18
+ | [![Model size](https://img.shields.io/badge/Params-61.8M-lightgrey#model-badge)](#model-architecture)
19
+ | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
20
+
21
+ The NeMo Audio Codec is a neural audio codec which compresses audio into a quantized representation. The model can be used as a vocoder for speech synthesis.
22
+
23
+ The model works with full-bandwidth 44.1kHz speech. It might have lower performance with low-bandwidth speech (e.g. 16kHz speech upsampled to 44.1kHz) or with non-speech audio.
24
+
25
+ | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
26
+ |:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
27
+ | 44100 | 86.1 | 6.9kpbs | 8 | 1000 | 32 | [8, 5, 5, 5] |
28
+
29
+ ## Model Architecture
30
+ The NeMo Audio Codec model uses symmetric convolutional encoder-decoder architecture based on [HiFi-GAN](https://arxiv.org/abs/2010.05646). We use [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505), with 8 codebooks and 1000 entries per codebook.
31
+
32
+ For more details please refer to [our paper](https://arxiv.org/abs/2406.05298).
33
+
34
+ ### Input
35
+ - **Input Type:** Audio
36
+ - **Input Format(s):** .wav files
37
+ - **Input Parameters:** One-Dimensional (1D)
38
+ - **Other Properties Related to Input:** 44100 Hz Mono-channel Audio
39
+
40
+ ### Output
41
+ - **Output Type**: Audio
42
+ - **Output Format:** .wav files
43
+ - **Output Parameters:** One Dimensional (1D)
44
+ - **Other Properties Related to Output:** 44100 Hz Mono-channel Audio
45
+
46
+
47
+ ## How to Use this Model
48
+
49
+ The model is available for use in the [NVIDIA NeMo](https://github.com/NVIDIA/NeMo), and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
50
+
51
+ ### Inference
52
+ For inference, you can follow our [Audio Codec Inference Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Inference.ipynb) which automatically downloads the model checkpoint. Note that you will need to set the ```model_name``` parameter to "nvidia/audio-codec-44khz".
53
+
54
+ Alternatively, you can download the ```.nemo``` checkpoint from the "Files and versions" tab and use the code below to make an inference on the model:
55
+
56
+ ```
57
+ import librosa
58
+ import torch
59
+ import soundfile as sf
60
+ from nemo.collections.tts.models import AudioCodecModel
61
+
62
+ codec_path = ??? # set here the model .nemo checkpoint path
63
+ path_to_input_audio = ??? # path of the input audio
64
+ path_to_output_audio = ??? # path of the reconstructed output audio
65
+
66
+ nemo_codec_model = AudioCodecModel.restore_from(restore_path=codec_path, map_location="cpu").eval()
67
+
68
+ # get discrete tokens from audio
69
+ audio, _ = librosa.load(path_to_input_audio, sr=nemo_codec_model.sample_rate)
70
+
71
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
72
+ audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
73
+ audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
74
+
75
+ encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)
76
+
77
+ # Reconstruct audio from tokens
78
+ reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)
79
+
80
+ # save reconstructed audio
81
+ output_audio = reconstructed_audio.cpu().numpy().squeeze()
82
+ sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
83
+
84
+ ```
85
+
86
+ ### Training
87
+
88
+ For fine-tuning on another dataset please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_44100.yaml" config. You also will need to set ```pretrained_model_name``` to "audio-codec-44khz".
89
+
90
+ ## Training, Testing, and Evaluation Datasets:
91
+
92
+
93
+ ### Training Datasets
94
+
95
+ The NeMo Audio Codec is trained on a total of 14.2k hrs of speech data from 79 languages.
96
+
97
+ - [MLS English](https://www.openslr.org/94/) - 12.8k hours, 2.8k speakers, English
98
+ - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) - 1.4k hours, 50k speakers, 79 languages.
99
+
100
+ ### Test Datasets
101
+
102
+ - [MLS English](https://www.openslr.org/94/) - 15 hours, 42 speakers, English
103
+ - [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) - 2 hours, 1356 speakers, 59 languages
104
+
105
+ ## Performance
106
+
107
+ We evaluate our codec using several objective audio quality metrics. We evaluate [ViSQOL](https://github.com/google/visqol) and [PESQ](https://lightning.ai/docs/torchmetrics/stable/audio/perceptual_evaluation_speech_quality.html) for perception quality, [ESTOI](https://ieeexplore.ieee.org/document/7539284) for intelligbility, mel spectrogram and STFT distances for spectral reconstruction accuracy, and [SI-SDR](https://arxiv.org/abs/1811.02508) for phase reconstruction accuracy. Metrics are reported on the test set for both the MLS English and CommonVoice data. The model has not been trained or evaluated on non-speech audio.
108
+
109
+ | Dataset | ViSQOL |PESQ |ESTOI |Mel Distance |STFT Distance|SI-SDR |
110
+ |:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|:-----------:|
111
+ | MLS English | 4.51 | 3.74 | 0.93 | 0.093 | 0.031 | 8.33 |
112
+ | CommonVoice | 4.53 | 3.58 | 0.93 | 0.130 | 0.054 | 7.72 |
113
+
114
+ ## Software Integration
115
+
116
+ ### Supported Hardware Microarchitecture Compatibility:
117
+ - NVIDIA Ampere
118
+ - NVIDIA Blackwell
119
+ - NVIDIA Jetson
120
+ - NVIDIA Hopper
121
+ - NVIDIA Lovelace
122
+ - NVIDIA Pascal
123
+ - NVIDIA Turing
124
+ - NVIDIA Volta
125
+
126
+ ### Runtime Engine
127
+
128
+ - Nemo 2.0.0
129
+
130
+ ### Preferred Operating System
131
+
132
+ - Linux
133
+
134
+ ## License/Terms of Use
135
+ This model is for research and development only (non-commercial use) and the license to use this model is covered by the [NSCLv1](https://developer.nvidia.com/downloads/license/nsclv1).
136
+
137
+ ## Ethical Considerations:
138
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
139
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
140
+
141
+