Spaces:
Sleeping
Sleeping
File size: 11,993 Bytes
13aac28 4744f27 0b43409 a979b42 ecc051b 0b43409 13aac28 0b43409 f30f955 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 |
# voice-recognition-ua
This is a repository with aim to apply [DeepSpeech](https://github.com/mozilla/DeepSpeech "DeepSpeech") (state-of-the-art speech recognition model) on Ukrainian language.
You can see online demo here: https://voice-recognition-ua.herokuapp.com (your voice is not stored).
Source code is in this repository together with auto-deploy pipeline scripts.
P.S. Due to small size of dataset (20 hours), don't expect production-grade performance.
Contribute your voice to [Common Voice project](https://commonvoice.mozilla.org/uk "Common Voice") yourself, so we can improve model accuracy.
## Pre-run requirements
Make sure to download:
1. https://github.com/robinhad/voice-recognition-ua/releases/download/v0.2/uk.tflite
3. https://github.com/mozilla/DeepSpeech/releases/download/v0.9.1/deepspeech-0.9.1-models.tflite
## How to launch
```
export FLASK_APP=main.py
flask run
```
# How to train your own model
Most of the guide is took from there:
https://deepspeech.readthedocs.io/en/v0.9.1/TRAINING.html
## Steps:
1. Create g4dn.xlarge instance on AWS, Deep Learning AMI (Ubuntu 18.04), 150 GB of space.
2. Install Python requirements:
```
sudo apt-get install python3-dev sox libsox-fmt-mp3 # sox is used for audio reading
```
3. Clone DeepSpeech branch v0.9.1
```
git clone --branch v0.9.1 https://github.com/mozilla/DeepSpeech
```
4. Go into DeepSpeech directory:
```
cd DeepSpeech
```
5. Create virtual environment using conda (it will be easier to manage CUDA libraries):
```
conda create --prefix $HOME/tmp/deepspeech-train-venv/ python=3.7
```
6. Activate it:
```
conda activate /home/ubuntu/tmp/deepspeech-train-venv
```
7. Install DeepSpeech requirements:
```
pip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0
pip3 install --upgrade -e .
```
8. Install required CUDA libraries:
```
conda install cudnn=7.6=cuda10.1_0
pip3 install 'tensorflow-gpu==1.15.4'
```
9. Open https://commonvoice.mozilla.org/uk/datasets and copy link to Ukrainian dataset.
```
cd ..
wget <your_link_to_dataset>
tar -xf uk.tar.gz
```
You'll get a folder named `cv-corpus-5.1-2020-06-22`
10. Download alphabet, used for dataset.
Alphabet is a file with all possible symbols, that are going to be in a dataset. Outputs are directly formed from alphabet. Alphabet is also used for filtering, data, that contain symbols not in alphabet, will be skipped.
```
cd ./DeepSpeech
mkdir data_uk
cd ./data_uk
wget https://github.com/robinhad/voice-recognition-ua/releases/download/v0.2/alphabet.txt
```
NOTE: if you create your alphabet, make sure it's in UTF-8 format
11. Filter data, that contains symbols not in alphabet:
```
cd .. # DeepSpeech
bin/import_cv2.py --filter_alphabet ./data_uk/alphabet.txt ../cv-corpus-5.1-2020-06-22/uk
```
12. (Optional step if you want to create model from scratch, expect low performance because of small dataset (~20 hours for Ukrainian))
```
python3 DeepSpeech.py --train_files ../data/CV/en/clips/train.csv --dev_files ../data/CV/en/clips/dev.csv --test_files ../data/CV/en/clips/test.csv
```
13. Transfer Learning
Transfer learning is method of using existing, pre-trained model on one dataset and apply it on similar, but another. In example, if we do speech recognition, we can use a fact that with each layer model deals with more general concept. Starting layers recognize different sound and low-level patterns, whereas later layers are more involved in final output (letters). So in that case we freeze all the layers (they don't update during training) except the specified last ones, where we substitute English alphabet with Ukrainian one.
Below we will download English model checkpoint and create folder for Ukrainian one.
```
mkdir checkpoints
cd ./checkpoints
wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.1/deepspeech-0.9.1-checkpoint.tar.gz
tar -xf deepspeech-0.9.1-checkpoint.tar.gz
mkdir uk_transfer_checkpoint
cd ..
```
14. Start a training itself. (if you want to make changes to training parameters, run `python3 DeepSpeech.py --helpfull` for list of all parameters).
When model finishes training, there will be error due to bug in DeepSpeech that will prevent evaluating performance for now, we will fix it in the next step.
It will take a while, ~11 minutes per epoch.
```
python3 DeepSpeech.py \
--train_cudnn \
--drop_source_layers 2 \
--alphabet_config_path ./data_uk/alphabet.txt \
--save_checkpoint_dir ./checkpoints/uk_transfer_checkpoint \
--load_checkpoint_dir ./checkpoints/deepspeech-0.9.1-checkpoint \
--train_files ../cv-corpus-5.1-2020-06-22/uk/clips/train.csv \
--dev_files ../cv-corpus-5.1-2020-06-22/uk/clips/dev.csv \
--test_files ../cv-corpus-5.1-2020-06-22/uk/clips/test.csv \
--epochs 10 \
```
15. Evaluate model:
```
python3 DeepSpeech.py \
--train_cudnn \
--alphabet_config_path ./data_uk/alphabet.txt \
--load_checkpoint_dir ./checkpoints/uk_transfer_checkpoint \
--train_files ../cv-corpus-5.1-2020-06-22/uk/clips/train.csv \
--dev_files ../cv-corpus-5.1-2020-06-22/uk/clips/dev.csv \
--test_files ../cv-corpus-5.1-2020-06-22/uk/clips/test.csv \
--test_batch_size 40 \
--epochs 0
```
It will take a while, approximately 20-30 minutes.
You will get performance report:
WER - Word Error Rate, calculates how much characters were guessed correctly.
CER - Character Error Rate, calculates how much characters were guessed correctly.
Here we have WER 95% and CER 36%.
It is high because we don't use scorer (language model that maps chacter sequence to the closest word match) during training, you can improve performance if you create scorer for Ukrainian language. As a text corpus you can use Wikipedia articles.
```
Test on ../cv-corpus-5.1-2020-06-22/uk/clips/test.csv - WER: 0.950863, CER: 0.357779, loss: 59.444176
--------------------------------------------------------------------------------
Best WER:
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 2.696858
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21203420.wav
- src: "я замер"
- res: "я замер"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 1.772630
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21755897.wav
- src: "що саме"
- res: "що саме"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.269474
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21350648.wav
- src: "ні"
- res: "ні"
--------------------------------------------------------------------------------
WER: 0.250000, CER: 0.066667, loss: 7.652889
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_22161067.wav
- src: "і вухом не веде"
- res: "і вухом не виде"
--------------------------------------------------------------------------------
WER: 0.333333, CER: 0.142857, loss: 22.727850
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_20894315.wav
- src: "подробиці наразі уточнюються"
- res: "подробиці наразі удочнвітцся"
--------------------------------------------------------------------------------
Median WER:
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.408163, loss: 77.099953
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21565481.wav
- src: "це було висвітлено і в засобах масової інформації"
- res: "сцеболовистітоно ів засовавнасавинсерматції"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.304878, loss: 76.661797
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21568626.wav
- src: "всі ці зірки для тебе сказав хлопчик і ударив дівчинку металевим тазіком по голові"
- res: "сицізяртідлетебе сказавни хлобчик юдаревдів чимкуметалевимтазіком поговолі"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.261364, loss: 76.638161
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_22071941.wav
- src: "кабінет міністрів україни складає повноваження перед новообраною верховною радою україни"
- res: "кабіна міністрівукаїни колале повнваженя перебновообрануюварховли радийву країни"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.403846, loss: 76.634865
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21381457.wav
- src: "механізм формування агатів остаточно не встановлений"
- res: "махенізаформовання оатья востотачномистоновлими"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.415094, loss: 76.133347
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21567387.wav
- src: "засідання верховної ради україни проводяться відкрито"
- res: "засі веневорковмаградиукраїне проодізівікрипо"
--------------------------------------------------------------------------------
Worst WER:
--------------------------------------------------------------------------------
WER: 1.500000, CER: 0.266667, loss: 18.258444
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_20900153.wav
- src: "вона віддасться"
- res: "пона віддас ця"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 0.307692, loss: 15.984250
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_22247322.wav
- src: "ескулап лікар"
- res: "е скула лліка"
--------------------------------------------------------------------------------
WER: 1.500000, CER: 0.277778, loss: 15.076320
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21582521.wav
- src: "цензура заборонена"
- res: "зан зура забороонено"
--------------------------------------------------------------------------------
WER: 1.666667, CER: 0.478261, loss: 42.762665
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21568871.wav
- src: "пегас символізує поезію"
- res: "веляс це волі зуя поєсі"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.333333, loss: 10.796988
- wav: file://../cv-corpus-5.1-2020-06-22/uk/clips/common_voice_uk_21563967.wav
- src: "легітимність"
- res: "вегі пимнсть"
--------------------------------------------------------------------------------
```
16. To export model for later usage:
```
mkdir model
# export .pb file
python3 DeepSpeech.py \
--train_cudnn \
--alphabet_config_path ./data_uk/alphabet.txt \
--load_checkpoint_dir ./checkpoints/uk_transfer_checkpoint \
--export_dir ./model \
--epochs 0
# export .tflite file for embedded usage
python3 DeepSpeech.py \
--train_cudnn \
--alphabet_config_path ./data_uk/alphabet.txt \
--load_checkpoint_dir ./checkpoints/uk_transfer_checkpoint \
--export_tflite --export_dir ./model \
--epochs 0
```
For advanced usage please refer to https://deepspeech.readthedocs.io/en/v0.9.1/USING.html
|