nvidia
/

parakeet-rnnt-1.1b

Automatic Speech Recognition

NeMo

PyTorch

Model card Files Files and versions Community

nithinraok

vlavrukhin commited on Dec 28, 2023

Commit

72f931d

•

1 Parent(s): 5305dc0

Update README.md (#1)

Browse files

- Update README.md (6b85e86a3ced40934391192b5a23a8233fba4d8d)

Co-authored-by: Vitaly Lavrukhin <vlavrukhin@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +7 -7

README.md CHANGED Viewed

@@ -179,13 +179,13 @@ img {
 | [![Language](https://img.shields.io/badge/Language-en-lightgrey#model-badge)](#datasets)
-parakeet-rnnt-1.1b is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) team and [Suno.ai](https://www.suno.ai/).
-It is a "extra extra large" version of FastConformer Transducer[1] (around 1.1B parameters) model.
 See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
 ## NVIDIA NeMo: Training
-To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
 ```
 pip install nemo_toolkit['all']
 ```
@@ -221,7 +221,7 @@ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
 ### Input
-This model accepts 16000 Hz Mono-channel Audio (wav files) as input.
 ### Output
@@ -241,7 +241,7 @@ The tokenizers for these models were built using the text transcripts of the tra
 The model was trained on 65K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
-Dataset contains following Public English speech sets (25K hrs)
 - Librispeech 960 hours of English speech
 - Fisher Corpus
@@ -251,9 +251,9 @@ Dataset contains following Public English speech sets (25K hrs)
 - VCTK
 - VoxPopuli (EN)
 - Europarl-ASR (EN)
-- Multilingual Librispeech (MLS EN) - 2,000 hrs subset
 - Mozilla Common Voice (v7.0)
-- People's Speech  - 12,000 hrs subset
 ## Performance

 | [![Language](https://img.shields.io/badge/Language-en-lightgrey#model-badge)](#datasets)
+parakeet-rnnt-1.1b is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Suno.ai](https://www.suno.ai/) teams.
+It is an XXL version of FastConformer Transducer [1] (around 1.1B parameters) model.
 See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
 ## NVIDIA NeMo: Training
+To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
 ```
 pip install nemo_toolkit['all']
 ```
 ### Input
+This model accepts 16000 Hz mono-channel audio (wav files) as input.
 ### Output
 The model was trained on 65K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
+Dataset contains following Public English speech sets (25K hours)
 - Librispeech 960 hours of English speech
 - Fisher Corpus
 - VCTK
 - VoxPopuli (EN)
 - Europarl-ASR (EN)
+- Multilingual Librispeech (MLS EN) - 2,000 hour subset
 - Mozilla Common Voice (v7.0)
+- People's Speech  - 12,000 hour subset
 ## Performance