nvidia/parakeet-rnnt-1.1b · Transcription normalization

Jan 2, 2024

Thank you very much for your contribution to the community while sharing both models and training scripts.

You have mentioned that the training dataset consists of private subset with 40K hours of English speech plus 25K hours from the following public datasets:

Librispeech 960 hours of English speech
Fisher Corpus
Switchboard-1 Dataset
WSJ-0 and WSJ-1
National Speech Corpus (Part 1, Part 6)
VCTK
VoxPopuli (EN)
Europarl-ASR (EN)
Multilingual Librispeech (MLS EN) - 2,000 hour subset
Mozilla Common Voice (v7.0)
People's Speech - 12,000 hour subset

But you haven't mentioned any of the normalization steps applied to the transcriptions, while each corpus have its own annotation protocol. Do you share these pre-processing steps anywhere ? I cannot find them on the GitHub repository of NeMo.

Regards.

nithinraok

NVIDIA org Jan 3, 2024

•

edited Jan 3, 2024

Some of the dataset preprocessing scripts are made available here: https://github.com/NVIDIA/NeMo/tree/main/scripts/dataset_processing

Eventually we will make all public dataset pre processing scripts available.

smajumdar94 changed discussion status to closed Jan 22, 2024