Update README.md
#1
by
vlavrukhin
- opened
README.md
CHANGED
@@ -179,13 +179,13 @@ img {
|
|
179 |
| [![Language](https://img.shields.io/badge/Language-en-lightgrey#model-badge)](#datasets)
|
180 |
|
181 |
|
182 |
-
parakeet-rnnt-1.1b is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)
|
183 |
-
It is
|
184 |
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
|
185 |
|
186 |
## NVIDIA NeMo: Training
|
187 |
|
188 |
-
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest
|
189 |
```
|
190 |
pip install nemo_toolkit['all']
|
191 |
```
|
@@ -221,7 +221,7 @@ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py
|
|
221 |
|
222 |
### Input
|
223 |
|
224 |
-
This model accepts 16000 Hz
|
225 |
|
226 |
### Output
|
227 |
|
@@ -241,7 +241,7 @@ The tokenizers for these models were built using the text transcripts of the tra
|
|
241 |
|
242 |
The model was trained on 65K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
|
243 |
|
244 |
-
Dataset contains following Public English speech sets (25K
|
245 |
|
246 |
- Librispeech 960 hours of English speech
|
247 |
- Fisher Corpus
|
@@ -251,9 +251,9 @@ Dataset contains following Public English speech sets (25K hrs)
|
|
251 |
- VCTK
|
252 |
- VoxPopuli (EN)
|
253 |
- Europarl-ASR (EN)
|
254 |
-
- Multilingual Librispeech (MLS EN) - 2,000
|
255 |
- Mozilla Common Voice (v7.0)
|
256 |
-
- People's Speech - 12,000
|
257 |
|
258 |
## Performance
|
259 |
|
|
|
179 |
| [![Language](https://img.shields.io/badge/Language-en-lightgrey#model-badge)](#datasets)
|
180 |
|
181 |
|
182 |
+
parakeet-rnnt-1.1b is an ASR model that transcribes speech in lower case English alphabet. This model is jointly developed by [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Suno.ai](https://www.suno.ai/) teams.
|
183 |
+
It is an XXL version of FastConformer Transducer [1] (around 1.1B parameters) model.
|
184 |
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#fast-conformer) for complete architecture details.
|
185 |
|
186 |
## NVIDIA NeMo: Training
|
187 |
|
188 |
+
To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest PyTorch version.
|
189 |
```
|
190 |
pip install nemo_toolkit['all']
|
191 |
```
|
|
|
221 |
|
222 |
### Input
|
223 |
|
224 |
+
This model accepts 16000 Hz mono-channel audio (wav files) as input.
|
225 |
|
226 |
### Output
|
227 |
|
|
|
241 |
|
242 |
The model was trained on 65K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
|
243 |
|
244 |
+
Dataset contains following Public English speech sets (25K hours)
|
245 |
|
246 |
- Librispeech 960 hours of English speech
|
247 |
- Fisher Corpus
|
|
|
251 |
- VCTK
|
252 |
- VoxPopuli (EN)
|
253 |
- Europarl-ASR (EN)
|
254 |
+
- Multilingual Librispeech (MLS EN) - 2,000 hour subset
|
255 |
- Mozilla Common Voice (v7.0)
|
256 |
+
- People's Speech - 12,000 hour subset
|
257 |
|
258 |
## Performance
|
259 |
|