metadata
base_model: facebook/w2v-bert-2.0
datasets:
- common_voice_10_0
metrics:
- wer
model-index:
- name: w2v-bert-2.0-uk
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: common_voice_10_0
type: common_voice_10_0
config: uk
split: test
args: uk
metrics:
- name: Wer
type: wer
value: 0.0655
license: apache-2.0
π¨π¨π¨ ATTENTION! π¨π¨π¨
Use an updated model: https://huggingface.co/Yehor/w2v-bert-uk-v2.1
w2v-bert-uk v1
Community
- Discord: https://discord.gg/yVAjkBgmt4
- Speech Recognition: https://t.me/speech_recognition_uk
- Speech Synthesis: https://t.me/speech_synthesis_uk
Google Colab
You can run this model using a Google Colab notebook: https://colab.research.google.com/drive/1QoKw2DWo5a5XYw870cfGE3dJf1WjZgrj?usp=sharing
Metrics
- AM:
- WER: 0.0727
- CER: 0.0151
- Accuracy: 92.73%
- AM + LM:
- WER: 0.0655
- CER: 0.0139
- Accuracy: 93.45%
Hyperparameters
This model was trained with the following hparams using 2 RTX A4000:
torchrun --standalone --nnodes=1 --nproc-per-node=2 ../train_w2v2_bert.py \
--custom_set ~/cv10/train.csv \
--custom_set_eval ~/cv10/test.csv \
--num_train_epochs 15 \
--tokenize_config . \
--w2v2_bert_model facebook/w2v-bert-2.0 \
--batch 4 \
--num_proc 5 \
--grad_accum 1 \
--learning_rate 3e-5 \
--logging_steps 20 \
--eval_step 500 \
--group_by_length \
--attention_dropout 0.0 \
--activation_dropout 0.05 \
--feat_proj_dropout 0.05 \
--feat_quantizer_dropout 0.0 \
--hidden_dropout 0.05 \
--layerdrop 0.0 \
--final_dropout 0.0 \
--mask_time_prob 0.0 \
--mask_time_length 10 \
--mask_feature_prob 0.0 \
--mask_feature_length 10
Usage
# pip install -U torch soundfile transformers
import torch
import soundfile as sf
from transformers import AutoModelForCTC, Wav2Vec2BertProcessor
# Config
model_name = 'Yehor/w2v-bert-2.0-uk'
device = 'cuda:1' # or cpu
sampling_rate = 16_000
# Load the model
asr_model = AutoModelForCTC.from_pretrained(model_name).to(device)
processor = Wav2Vec2BertProcessor.from_pretrained(model_name)
paths = [
'sample1.wav',
]
# Extract audio
audio_inputs = []
for path in paths:
audio_input, _ = sf.read(path)
audio_inputs.append(audio_input)
# Transcribe the audio
inputs = processor(audio_inputs, sampling_rate=sampling_rate).input_features
features = torch.tensor(inputs).to(device)
with torch.no_grad():
logits = asr_model(features).logits
predicted_ids = torch.argmax(logits, dim=-1)
predictions = processor.batch_decode(predicted_ids)
# Log results
print('Predictions:')
print(predictions)