Update README.md
Browse files
README.md
CHANGED
@@ -100,12 +100,14 @@ Conformer-Transducer model is an autoregressive variant of Conformer model [1] f
|
|
100 |
|
101 |
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
|
102 |
|
103 |
-
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
104 |
-
|
105 |
The vocabulary we use contains 28 characters:
|
106 |
```python
|
107 |
[' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
|
108 |
```
|
|
|
|
|
|
|
|
|
109 |
|
110 |
Full config can be found inside the .nemo files.
|
111 |
|
|
|
100 |
|
101 |
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_transducer_bpe.yaml).
|
102 |
|
|
|
|
|
103 |
The vocabulary we use contains 28 characters:
|
104 |
```python
|
105 |
[' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
|
106 |
```
|
107 |
+
Rare symbols with diacritics were replaced during preprocessing.
|
108 |
+
|
109 |
+
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
110 |
+
For vocabulary of size 1024 we restrict maximum subtoken length to 4 symbols to avoid populating vocabulary with specific frequent words from the dataset. This does not affect the model performance and potentially helps to adapt to other domain without retraining tokenizer.
|
111 |
|
112 |
Full config can be found inside the .nemo files.
|
113 |
|