English
speech quantization
zhihaodu's picture
init commit
02f860a
---
language: en
tags:
- speech quantization
license: mit
datasets:
- LibriTTS
---
# Highlights
This model is used for speech codec or quantization on English utterances.
- Lower frame rate, 25 token/s for each quantizer
- Achieving higher codec quality under low band widths
- Training with structured dropout, enabling various band widths during inference with a single model
- Quantizing a raw speech waveform into a sequence of discrete tokens
# FunCodec model
This model is trained with [FunCodec](https://github.com/alibaba-damo-academy/FunCodec),
an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group.
This repository provides a pre-trained model on the LibriTTS corpus.
It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis
and other academic research topics.
Compared with [EnCodec](https://arxiv.org/abs/2210.13438) and [SoundStream](https://arxiv.org/abs/2107.03312),
the following improved techniques are utilized to train the model, resulting in higher codec quality and
[ViSQOL](https://github.com/google/visqol) scores under the same band width:
- The magnitude spectrum loss is employed to enhance the middle and high frequency signals
- Structured dropout is employed to smooth the code space, as well as enable various band widths in a single model
- Codes are initialized by k-means clusters rather than random values
- Codebooks are maintained with exponential moving average and dead-code-elimination mechanism, resulting in high utilization factor for codebooks.
## Model description
This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain
several parallel sequences of discrete latent representations. Here is an overview of FunCodec models.
<p align="center">
<img src="fig/framework.png" alt="FunCodec architecture"/>
</p>
In general, FunCodec models consist of five modules: a domain transformation module,
an encoder, a RVQ module, a decoder and a domain inversion module.
- Domain Transformation:transfer signals into time domain, short-time frequency domain, magnitude-angle domain or magnitude-phase domain.
- Encoder:encode signals into compact representations with stacked convolutional and LSTM layers.
- Semantic tokens (Optional): augment encoder outputs with semantic tokens to enhance the content information, not used in this model.
- RVQ:quantize the representations into parallel sequences of discrete tokens with cascaded vector quantizers.
- Decoder:decode quantized embeddings into different signal domains the same as inputs.
- Domain Inversion:re-synthesize perceptible waveforms from different domains.
More details can be found at:
- Paper: [FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec](https://arxiv.org/abs/2309.07405)
- Codebase: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec)
## Intended uses & sceneries
### Inference with FunCodec
You can extract codecs and reconstruct them back to waveforms with FunCodec repository.
#### FunCodec installation
```sh
# Install Pytorch GPU (version >= 1.12.0):
conda install pytorch==1.12.0
# for other versions, please refer: https://pytorch.org/get-started/locally
# Download codebase:
git clone https://github.com/alibaba-damo-academy/FunCodec.git
# Install FunCodec codebase:
cd FunCodec
pip install --editable ./
```
#### Codec extraction
```sh
# Enter the example directory
cd egs/LibriTTS/codec
# Specify the model name
model_name="audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch"
# Download the model
git lfs install
git clone https://huggingface.co/alibaba-damo/${model_name}
mkdir exp
mv ${model_name} exp/$model_name
# Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs"
bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
--model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
--wav_scp input_wav.scp --out_dir outputs/codecs
# input_wav.scp has the following format:
# uttid1 path/to/file1.wav
# uttid2 path/to/file2.wav
# ...
```
### Reconstruct waveforms from codecs
```shell
# Reconstruct waveforms into "outputs/recon_wavs"
bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
--model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
--wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs
# codecs.txt is the output of stage 1, which has the following format:
# uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
# uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]
# ...
```
### Inference with Huggingface Transformers
Inference with Huggingface transformers package is under development.
### Application sceneries
Running environment
- Currently, the model only passed the tests on Linux-x86_64. Mac and Windows systems are not tested.
Intended using sceneries
- This model is suitable for academic usages
- Speech quantization, codec and tokenization for English utterances
## Evaluation results
### Training configuration
- Feature info: raw waveform input
- Train info: Adam, lr 3e-4, batch_size 32, 2 gpu(Tesla V100), acc_grad 1, 300000 steps, speech_max_length 51200
- Loss info: L1, L2, discriminative loss
- Model info: SEANet, Conv, LSTM
- Train config: config.yaml
- Model size: 57.83 M parameters
### Experimental Results
Test set: LibriTTS-test, ViSQOL scores
| testset | 50 tk/s | 100 tk/s | 200 tk/s | 400 tk/s |
|:--------:|:--------:|:--------:|:--------:|:--------:|
| LibriTTS | 3.64 | 3.94 | 4.16 | 4.29 |
### Limitations and bias
- Not very robust to background noises and reverberation
### BibTeX entry and citation info
```BibTeX
@misc{du2023funcodec,
title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
year={2023},
eprint={2309.07405},
archivePrefix={arXiv},
primaryClass={cs.Sound}
}
```