HuBERT

Pre-trained and fine-tuned (ASR) models

Model	Pretraining Data	Finetuning Dataset	Model
HuBERT Base (~95M params)	Librispeech 960 hr	No finetuning (Pretrained Model)	download
HuBERT Large (~316M params)	Libri-Light 60k hr	No finetuning (Pretrained Model)	download
HuBERT Extra Large (~1B params)	Libri-Light 60k hr	No finetuning (Pretrained Model)	download
HuBERT Large	Libri-Light 60k hr	Librispeech 960 hr	download
HuBERT Extra Large	Libri-Light 60k hr	Librispeech 960 hr	download

Load a model

ckpt_path = "/path/to/the/checkpoint.pt"
models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
model = models[0]

Train a new model

Data preparation

Follow the steps in ./simple_kmeans to create:

{train,valid}.tsv waveform list files
{train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 50Hz for HuBERT features by default.

Pre-train a HuBERT model

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.km are saved at /path/to/labels, and the label rate is 100Hz.

To train a base model (12 layer transformer), run:

$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/pretrain \
  --config-name hubert_base_librispeech \
  task.data=/path/to/data task.label_dir=/path/to/labels model.label_rate=100

Fine-tune a HuBERT model with a CTC loss

Suppose {train,valid}.tsv are saved at /path/to/data, and their corresponding character transcripts {train,valid}.ltr are saved at /path/to/trans.

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run

$ python fairseq_cli/hydra_train.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/finetune \
  --config-name base_10h \
  task.data=/path/to/data task.label_dir=/path/to/trans \
  model.w2v_path=/path/to/checkpoint

Decode a HuBERT model

Suppose the test.tsv and test.ltr are the waveform list and transcripts of the split to be decoded, saved at /path/to/data, and the fine-tuned model is saved at /path/to/checkpoint. We support three decoding modes:

Viterbi decoding: greedy decoding without a language model
KenLM decoding: decoding with an arpa-format KenLM n-gram language model
Fairseq-LM deocding: decoding with a Fairseq neural language model

Viterbi decoding

task.normalize needs to be consistent with the value used during fine-tuning. Decoding results will be saved at /path/to/experiment/directory/decode/viterbi/test.

$ python examples/speech_recognition/new/infer.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/decode \
  --config-name infer_viterbi \
  task.data=/path/to/data \
  task.normalize=[true|false] \
  decoding.exp_dir=/path/to/experiment/directory \
  common_eval.path=/path/to/checkpoint
  dataset.gen_subset=test \

KenLM / Fairseq-LM decoding

Suppose the pronunciation lexicon and the n-gram LM are saved at /path/to/lexicon and /path/to/arpa, respectively. Decoding results will be saved at /path/to/experiment/directory/decode/kenlm/test.

$ python examples/speech_recognition/new/infer.py \
  --config-dir /path/to/fairseq-py/examples/hubert/config/decode \
  --config-name infer_kenlm \
  task.data=/path/to/data \
  task.normalize=[true|false] \
  decoding.exp_dir=/path/to/experiment/directory \
  common_eval.path=/path/to/checkpoint
  dataset.gen_subset=test \
  decoding.decoder.lexicon=/path/to/lexicon \
  decoding.decoder.lmpath=/path/to/arpa

The command above uses the default decoding hyperparameter, which can be found in examples/speech_recognition/hydra/decoder.py. These parameters can be configured from the command line. For example, to search with a beam size of 500, we can append the command above with decoding.decoder.beam=500. Important parameters include:

decoding.decoder.beam
decoding.decoder.beamthreshold
decoding.decoder.lmweight
decoding.decoder.wordscore
decoding.decoder.silweight

To decode with a Fairseq LM, use --config-name infer_fsqlm instead, and change the path of lexicon and LM accordingly.