Update README.md
Browse files
README.md
CHANGED
@@ -103,12 +103,14 @@ Conformer-CTC model is a non-autoregressive variant of Conformer model [1] for A
|
|
103 |
|
104 |
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml).
|
105 |
|
106 |
-
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
107 |
-
|
108 |
The vocabulary we use contains 28 characters:
|
109 |
```python
|
110 |
[' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
|
111 |
```
|
|
|
|
|
|
|
|
|
112 |
|
113 |
Full config can be found inside the .nemo files.
|
114 |
|
|
|
103 |
|
104 |
The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_ctc/speech_to_text_ctc_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/conformer/conformer_ctc_bpe.yaml).
|
105 |
|
|
|
|
|
106 |
The vocabulary we use contains 28 characters:
|
107 |
```python
|
108 |
[' ', "'", 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
|
109 |
```
|
110 |
+
Rare symbols with diacritics were replaced during preprocessing.
|
111 |
+
|
112 |
+
The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
|
113 |
+
For vocabulary of size 128 we restrict maximum subtoken length to 2 symbols to avoid populating vocabulary with specific frequent words from the dataset. This does not affect the model performance and potentially helps to adapt to other domain without retraining tokenizer.
|
114 |
|
115 |
Full config can be found inside the .nemo files.
|
116 |
|