--- language: en tags: - speech quantization license: mit datasets: - LibriTTS --- # Highlights This model is used for speech codec or quantization on English utterances. - Achieving higher codec quality under low band widths - Training with structured dropout, enabling various band widths during inference with a single model - Quantizing a raw speech waveform into a sequence of discrete tokens # FunCodec model This model is trained with [FunCodec](https://github.com/alibaba-damo-academy/FunCodec), an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group. This repository provides a pre-trained model on the LibriTTS corpus. It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis and other academic research topics. Compared with [EnCodec](https://arxiv.org/abs/2210.13438) and [SoundStream](https://arxiv.org/abs/2107.03312), the following improved techniques are utilized to train the model, resulting in higher codec quality and [ViSQOL](https://github.com/google/visqol) scores under the same band width: - The magnitude spectrum loss is employed to enhance the middle and high frequency signals - Structured dropout is employed to smooth the code space, as well as enable various band widths in a single model - Codes are initialized by k-means clusters rather than random values - Codebooks are maintained with exponential moving average and dead-code-elimination mechanism, resulting in high utilization factor for codebooks. ## Model description This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain several parallel sequences of discrete latent representations. Here is an overview of FunCodec models.

FunCodec architecture

In general, FunCodec models consist of five modules: a domain transformation module, an encoder, a RVQ module, a decoder and a domain inversion module. - Domain Transformation:transfer signals into time domain, short-time frequency domain, magnitude-angle domain or magnitude-phase domain. - Encoder:encode signals into compact representations with stacked convolutional and LSTM layers. - Semantic tokens (Optional): augment encoder outputs with semantic tokens to enhance the content information, not used in this model. - RVQ:quantize the representations into parallel sequences of discrete tokens with cascaded vector quantizers. - Decoder:decode quantized embeddings into different signal domains the same as inputs. - Domain Inversion:re-synthesize perceptible waveforms from different domains. More details can be found at: - Paper: [FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec](https://arxiv.org/abs/2309.07405) - Codebase: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec) ## Intended uses & sceneries ### Inference with FunCodec You can extract codecs and reconstruct them back to waveforms with FunCodec repository. #### FunCodec installation ```sh # Install Pytorch GPU (version >= 1.12.0): conda install pytorch==1.12.0 # for other versions, please refer: https://pytorch.org/get-started/locally # Download codebase: git clone https://github.com/alibaba-damo-academy/FunCodec.git # Install FunCodec codebase: cd FunCodec pip install --editable ./ ``` #### Codec extraction ```sh # Enter the example directory cd egs/LibriTTS/codec # Specify the model name model_name="audio_codec-encodec-en-libritts-16k-nq32ds320-pytorch" # Download the model git lfs install git clone https://huggingface.co/alibaba-damo/${model_name} mkdir exp mv ${model_name} exp/$model_name # Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs" bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \ --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \ --wav_scp input_wav.scp --out_dir outputs/codecs # input_wav.scp has the following format: # uttid1 path/to/file1.wav # uttid2 path/to/file2.wav # ... ``` ### Reconstruct waveforms from codecs ```shell # Reconstruct waveforms into "outputs/recon_wavs" bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \ --model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \ --wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs # codecs.txt is the output of stage 1, which has the following format: # uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]] # uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]] # ... ``` ### Inference with Huggingface Transformers Inference with Huggingface transformers package is under development. ### Application sceneries Running environment - Currently, the model only passed the tests on Linux-x86_64. Mac and Windows systems are not tested. Intended using sceneries - This model is suitable for academic usages - Speech quantization, codec and tokenization for English utterances ## Evaluation results ### Training configuration - Feature info: raw waveform input - Train info: Adam, lr 3e-4, batch_size 32, 2 gpu(Tesla V100), acc_grad 1, 300000 steps, speech_max_length 51200 - Loss info: L1, L2, discriminative loss - Model info: SEANet, Conv, LSTM - Train config: encodec_16k_n32_600k_step.yaml - Model size: 15.14 M parameters ### Experimental Results Test set: LibriTTS test-clean, ViSQOL scores | testset | 50 tk/s | 100 tk/s | 200 tk/s | 400 tk/s | |:--------:|:--------:|:--------:|:--------:|:--------:| | LibriTTS | 3.43 | 3.86 | 4.12 | 4.29 | ### Limitations and bias - Not very robust to background noises and reverberation ### BibTeX entry and citation info ```BibTeX @misc{du2023funcodec, title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec}, author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng}, year={2023}, eprint={2309.07405}, archivePrefix={arXiv}, primaryClass={cs.Sound} } ```