init commit

02f860a about 1 year ago

6.1 kB

	---
	language: en
	tags:
	- speech quantization
	license: mit
	datasets:
	- LibriTTS
	---

	# Highlights
	This model is used for speech codec or quantization on English utterances.
	- Lower frame rate, 25 token/s for each quantizer
	- Achieving higher codec quality under low band widths
	- Training with structured dropout, enabling various band widths during inference with a single model
	- Quantizing a raw speech waveform into a sequence of discrete tokens

	# FunCodec model
	This model is trained with [FunCodec](https://github.com/alibaba-damo-academy/FunCodec),
	an open-source toolkits for speech quantization (codec) from the Damo academy, Alibaba Group.
	This repository provides a pre-trained model on the LibriTTS corpus.
	It can be applied to low-band-width speech communication, speech quantization, zero-shot speech synthesis
	and other academic research topics.
	Compared with [EnCodec](https://arxiv.org/abs/2210.13438) and [SoundStream](https://arxiv.org/abs/2107.03312),
	the following improved techniques are utilized to train the model, resulting in higher codec quality and
	[ViSQOL](https://github.com/google/visqol) scores under the same band width:
	- The magnitude spectrum loss is employed to enhance the middle and high frequency signals
	- Structured dropout is employed to smooth the code space, as well as enable various band widths in a single model
	- Codes are initialized by k-means clusters rather than random values
	- Codebooks are maintained with exponential moving average and dead-code-elimination mechanism, resulting in high utilization factor for codebooks.

	## Model description
	This model is a variational autoencoder that uses residual vector quantisation (RVQ) to obtain
	several parallel sequences of discrete latent representations. Here is an overview of FunCodec models.
	<p align="center">
	<img src="fig/framework.png" alt="FunCodec architecture"/>
	</p>

	In general, FunCodec models consist of five modules: a domain transformation module,
	an encoder, a RVQ module, a decoder and a domain inversion module.
	- Domain Transformation：transfer signals into time domain, short-time frequency domain, magnitude-angle domain or magnitude-phase domain.
	- Encoder：encode signals into compact representations with stacked convolutional and LSTM layers.
	- Semantic tokens (Optional): augment encoder outputs with semantic tokens to enhance the content information, not used in this model.
	- RVQ：quantize the representations into parallel sequences of discrete tokens with cascaded vector quantizers.
	- Decoder：decode quantized embeddings into different signal domains the same as inputs.
	- Domain Inversion：re-synthesize perceptible waveforms from different domains.

	More details can be found at：
	- Paper: [FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec](https://arxiv.org/abs/2309.07405)
	- Codebase: [FunCodec](https://github.com/alibaba-damo-academy/FunCodec)

	## Intended uses & sceneries
	### Inference with FunCodec

	You can extract codecs and reconstruct them back to waveforms with FunCodec repository.

	#### FunCodec installation
	```sh
	# Install Pytorch GPU (version >= 1.12.0):
	conda install pytorch==1.12.0
	# for other versions, please refer: https://pytorch.org/get-started/locally

	# Download codebase:
	git clone https://github.com/alibaba-damo-academy/FunCodec.git

	# Install FunCodec codebase:
	cd FunCodec
	pip install --editable ./
	```

	#### Codec extraction
	```sh
	# Enter the example directory
	cd egs/LibriTTS/codec
	# Specify the model name
	model_name="audio_codec-encodec-en-libritts-16k-nq32ds640-pytorch"
	# Download the model
	git lfs install
	git clone https://huggingface.co/alibaba-damo/${model_name}
	mkdir exp
	mv ${model_name} exp/$model_name
	# Extracting codec within the input file "input_wav.scp" and the codecs are saved under "outputs/codecs"
	bash encoding_decoding.sh --stage 1 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
	--model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
	--wav_scp input_wav.scp --out_dir outputs/codecs
	# input_wav.scp has the following format:
	# uttid1 path/to/file1.wav
	# uttid2 path/to/file2.wav
	# ...
	```

	### Reconstruct waveforms from codecs
	```shell
	# Reconstruct waveforms into "outputs/recon_wavs"
	bash encoding_decoding.sh --stage 2 --batch_size 16 --num_workers 4 --gpu_devices "0,1" \
	--model_dir exp/${model_name} --bit_width 16000 --file_sampling_rate 16000 \
	--wav_scp outputs/codecs/codecs.txt --out_dir outputs/recon_wavs
	# codecs.txt is the output of stage 1, which has the following format：
	# uttid1 [[[1, 2, 3, ...],[2, 3, 4, ...], ...]]
	# uttid2 [[[9, 7, 5, ...],[3, 1, 2, ...], ...]]
	# ...
	```

	### Inference with Huggingface Transformers
	Inference with Huggingface transformers package is under development.

	### Application sceneries
	Running environment
	- Currently, the model only passed the tests on Linux-x86_64. Mac and Windows systems are not tested.

	Intended using sceneries
	- This model is suitable for academic usages
	- Speech quantization, codec and tokenization for English utterances

	## Evaluation results

	### Training configuration
	- Feature info: raw waveform input
	- Train info: Adam, lr 3e-4, batch_size 32, 2 gpu(Tesla V100), acc_grad 1, 300000 steps, speech_max_length 51200
	- Loss info: L1, L2, discriminative loss
	- Model info: SEANet, Conv, LSTM
	- Train config: config.yaml
	- Model size: 57.83 M parameters

	### Experimental Results

	Test set: LibriTTS-test, ViSQOL scores
	\| testset \| 50 tk/s \| 100 tk/s \| 200 tk/s \| 400 tk/s \|
	\|:--------:\|:--------:\|:--------:\|:--------:\|:--------:\|
	\| LibriTTS \| 3.64 \| 3.94 \| 4.16 \| 4.29 \|

	### Limitations and bias
	- Not very robust to background noises and reverberation

	### BibTeX entry and citation info
	```BibTeX
	@misc{du2023funcodec,
	title={FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec},
	author={Zhihao Du, Shiliang Zhang, Kai Hu, Siqi Zheng},
	year={2023},
	eprint={2309.07405},
	archivePrefix={arXiv},
	primaryClass={cs.Sound}
	}
	```