Generative Spoken Language Modeling

Paper
Demo

We build and evaluate generative speech2speech systems using Log Mel Filtebank, Modified CPC, HuBERT Base and Wav2Vec 2.0 Large. Our system is composed of three components, namely, speech2unit, ulm and unit2speech. We explain about models and usage of these components in their respective sub-directories. See the links below.

Speech to Unit Model (speech2unit)

Speech to unit model is used for quantizing raw speech into learned discrete speech units. More details

Unit Language Model (ulm)

Unit Language Model is a generative language model trained on discrete speech units. More details

Unit to Speech Model (unit2speech)

Unit to speech model is used for synthesizing speech from discrete speech units. More details

Metrics

We show how to compute ASR based metrics as well as zero-shot metrics proposed in our paper here.

Tools

We share two tools to resynthesize a given spoken utterance, and generate novel spoken language given a spoken prompt. More detail