|
--- |
|
base_model: facebook/w2v-bert-2.0 |
|
datasets: |
|
- common_voice_10_0 |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: w2v-bert-2.0-uk |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: common_voice_10_0 |
|
type: common_voice_10_0 |
|
config: uk |
|
split: test |
|
args: uk |
|
metrics: |
|
- name: Wer |
|
type: wer |
|
value: 0.0655 |
|
--- |
|
|
|
# wav2vec2-bert-uk |
|
|
|
๐บ๐ฆ Join our **Discord server** - https://discord.gg/nmUCXz55 - where we're talking about Data Science, ML, DL, and AI |
|
|
|
๐บ๐ฆ Join our Speech Recognition Group in Telegram: https://t.me/speech_recognition_uk |
|
|
|
## Metrics |
|
|
|
- AM: |
|
- WER: 0.0727 |
|
- CER: 0.0151 |
|
- Accuracy: 92.73% |
|
- AM + LM: |
|
- WER: 0.0655 |
|
- CER: 0.0139 |
|
- Accuracy: 93.45% |
|
|
|
## Hyperparameters |
|
|
|
This model was trained with the following hparams with 2 RTX A4000: |
|
|
|
``` |
|
torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \ |
|
--custom_set ~/cv10/train.csv \ |
|
--custom_set_eval ~/cv10/test.csv \ |
|
--num_train_epochs 15 \ |
|
--tokenize_config . \ |
|
--w2v2_bert_model facebook/w2v-bert-2.0 \ |
|
--batch 4 \ |
|
--num_proc 5 \ |
|
--grad_accum 1 \ |
|
--learning_rate 3e-5 \ |
|
--logging_steps 20 \ |
|
--eval_step 500 \ |
|
--group_by_length \ |
|
--attention_dropout 0.0 \ |
|
--activation_dropout 0.05 \ |
|
--feat_proj_dropout 0.05 \ |
|
--feat_quantizer_dropout 0.0 \ |
|
--hidden_dropout 0.05 \ |
|
--layerdrop 0.0 \ |
|
--final_dropout 0.0 \ |
|
--mask_time_prob 0.0 \ |
|
--mask_time_length 10 \ |
|
--mask_feature_prob 0.0 \ |
|
--mask_feature_length 10 |
|
``` |
|
|
|
## Usage |
|
|
|
```python |
|
# pip install -U torch soundfile transformers |
|
|
|
import torch |
|
import soundfile as sf |
|
from transformers import AutoModelForCTC, Wav2Vec2BertProcessor |
|
|
|
# Config |
|
model_name = 'Yehor/w2v-bert-2.0-uk' |
|
device = 'cuda:1' # or cpu |
|
sampling_rate = 16_000 |
|
|
|
# Load the model |
|
asr_model = AutoModelForCTC.from_pretrained(model_name).to(device) |
|
processor = Wav2Vec2BertProcessor.from_pretrained(model_name) |
|
|
|
paths = [ |
|
'sample1.wav', |
|
] |
|
|
|
# Extract audio |
|
audio_inputs = [] |
|
for path in paths: |
|
audio_input, _ = sf.read(path) |
|
audio_inputs.append(audio_input) |
|
|
|
# Transcribe the audio |
|
inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features |
|
features = torch.tensor(inputs).to(device) |
|
|
|
with torch.no_grad(): |
|
logits = asr_model(features).logits |
|
|
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
predictions = processor.batch_decode(predicted_ids) |
|
|
|
# Log outputs |
|
print('---') |
|
print('Predictions:') |
|
print(predictions) |
|
print('References:') |
|
print(references) |
|
print('---') |
|
``` |
|
|
|
### Licenses |
|
|
|
- Acoustic Model: Apache 2 |
|
- Language Model (from https://huggingface.co/Yehor/kenlm-ukrainian): cc-by-nc-sa-4.0 |
|
|