|
--- |
|
language: ar |
|
datasets: |
|
- common_voice |
|
metrics: |
|
- wer |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- speech |
|
- xlsr-fine-tuning-week |
|
license: apache-2.0 |
|
model-index: |
|
- name: Sinai Voice Arabic Speech Recognition Model |
|
results: |
|
- task: |
|
name: Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice ar |
|
type: common_voice |
|
args: ar |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 23.80 |
|
--- |
|
|
|
# Sinai Voice Arabic Speech Recognition Model |
|
# نموذج **صوت سيناء** للتعرف على الأصوات العربية الفصحى و تحويلها إلى نصوص |
|
Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) |
|
on Arabic using the [Common Voice](https://huggingface.co/datasets/common_voice) |
|
|
|
Most of evaluation codes in this documentation are INSPIRED by [elgeish/wav2vec2-large-xlsr-53-arabic](https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic) |
|
|
|
Please install: |
|
- [PyTorch](https://pytorch.org/) |
|
- `$ pip3 install jiwer lang_trans torchaudio datasets transformers pandas tqdm` |
|
|
|
## Benchmark |
|
|
|
We evaluated the model against different Arabic-STT Wav2Vec models. |
|
|
|
[**WER**: Word Error Rate] The Lowest score you get, the best model you have |
|
|
|
| | Model | [using transliteration](https://pypi.org/project/lang-trans/) | WER | Training Datasets | |
|
|---:|:--------------------------------------|:---------------------|---------:|---------:| |
|
| 1 | bakrianoo/sinai-voice-ar-stt | True | 0.238001 |Common Voice 6| |
|
| 2 | elgeish/wav2vec2-large-xlsr-53-arabic | True | 0.266527 |Common Voice 6 + Arabic Speech Corpus| |
|
| 3 | othrif/wav2vec2-large-xlsr-arabic | True | 0.298122 |Common Voice 6| |
|
| 4 | bakrianoo/sinai-voice-ar-stt | False | 0.448987 |Common Voice 6| |
|
| 5 | othrif/wav2vec2-large-xlsr-arabic | False | 0.464004 |Common Voice 6| |
|
| 6 | anas/wav2vec2-large-xlsr-arabic | True | 0.506191 |Common Voice 4| |
|
| 7 | anas/wav2vec2-large-xlsr-arabic | False | 0.622288 |Common Voice 4| |
|
|
|
|
|
<details> |
|
<summary>We used the following <b>CODE</b> to generate the above results</summary> |
|
|
|
```python |
|
import jiwer |
|
import torch |
|
from tqdm.auto import tqdm |
|
import torchaudio |
|
from datasets import load_dataset |
|
from lang_trans.arabic import buckwalter |
|
from transformers import set_seed, Wav2Vec2ForCTC, Wav2Vec2Processor |
|
import pandas as pd |
|
|
|
# load test dataset |
|
set_seed(42) |
|
test_split = load_dataset("common_voice", "ar", split="test") |
|
|
|
# init sample rate resamplers |
|
resamplers = { # all three sampling rates exist in test split |
|
48000: torchaudio.transforms.Resample(48000, 16000), |
|
44100: torchaudio.transforms.Resample(44100, 16000), |
|
32000: torchaudio.transforms.Resample(32000, 16000), |
|
} |
|
|
|
# WER composer |
|
transformation = jiwer.Compose([ |
|
# normalize some diacritics, remove punctuation, and replace Persian letters with Arabic ones |
|
jiwer.SubstituteRegexes({ |
|
r'[auiFNKo\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\~_،؟»\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\?;:\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\.؛«!"]': "", "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\u06D6": "", |
|
r"[\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\|\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\{]": "A", "p": "h", "ک": "k", "ی": "y"}), |
|
# default transformation below |
|
jiwer.RemoveMultipleSpaces(), |
|
jiwer.Strip(), |
|
jiwer.SentencesToListOfWords(), |
|
jiwer.RemoveEmptyStrings(), |
|
]) |
|
|
|
def prepare_example(example): |
|
speech, sampling_rate = torchaudio.load(example["path"]) |
|
if sampling_rate in resamplers: |
|
example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy() |
|
else: |
|
example["speech"] = resamplers[4800](speech).squeeze().numpy() |
|
return example |
|
|
|
def predict(batch): |
|
inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True) |
|
with torch.no_grad(): |
|
predicted = torch.argmax(model(inputs.input_values.to("cuda")).logits, dim=-1) |
|
predicted[predicted == -100] = processor.tokenizer.pad_token_id # see fine-tuning script |
|
batch["predicted"] = processor.batch_decode(predicted) |
|
return batch |
|
|
|
# prepare the test dataset |
|
test_split = test_split.map(prepare_example) |
|
|
|
stt_models = [ |
|
"elgeish/wav2vec2-large-xlsr-53-arabic", |
|
"othrif/wav2vec2-large-xlsr-arabic", |
|
"anas/wav2vec2-large-xlsr-arabic", |
|
"bakrianoo/sinai-voice-ar-stt" |
|
] |
|
|
|
stt_results = [] |
|
|
|
for model_path in tqdm(stt_models): |
|
processor = Wav2Vec2Processor.from_pretrained(model_path) |
|
model = Wav2Vec2ForCTC.from_pretrained(model_path).to("cuda").eval() |
|
|
|
test_split_preds = test_split.map(predict, batched=True, batch_size=56, remove_columns=["speech"]) |
|
|
|
orig_metrics = jiwer.compute_measures( |
|
truth=[s for s in test_split_preds["sentence"]], |
|
hypothesis=[s for s in test_split_preds["predicted"]], |
|
truth_transform=transformation, |
|
hypothesis_transform=transformation, |
|
) |
|
|
|
trans_metrics = jiwer.compute_measures( |
|
truth=[buckwalter.trans(s) for s in test_split_preds["sentence"]], # Buckwalter transliteration |
|
hypothesis=[buckwalter.trans(s) for s in test_split_preds["predicted"]], # Buckwalter transliteration |
|
truth_transform=transformation, |
|
hypothesis_transform=transformation, |
|
) |
|
|
|
stt_results.append({ |
|
"model": model_path, |
|
"using_transliation": True, |
|
"WER": trans_metrics["wer"] |
|
}) |
|
|
|
stt_results.append({ |
|
"model": model_path, |
|
"using_transliation": False, |
|
"WER": orig_metrics["wer"] |
|
}) |
|
|
|
del model |
|
del processor |
|
|
|
stt_results_df = pd.DataFrame(stt_results) |
|
stt_results_df = stt_results_df.sort_values('WER', axis=0, ascending=True) |
|
stt_results_df.head(n=50) |
|
|
|
``` |
|
</details> |
|
|
|
|
|
## Usage |
|
|
|
The model can be used directly (without a language model) as follows: |
|
```python |
|
import torch |
|
import torchaudio |
|
from datasets import load_dataset |
|
from lang_trans.arabic import buckwalter |
|
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor |
|
dataset = load_dataset("common_voice", "ar", split="test[:10]") |
|
resamplers = { # all three sampling rates exist in test split |
|
48000: torchaudio.transforms.Resample(48000, 16000), |
|
44100: torchaudio.transforms.Resample(44100, 16000), |
|
32000: torchaudio.transforms.Resample(32000, 16000), |
|
} |
|
|
|
def prepare_example(example): |
|
speech, sampling_rate = torchaudio.load(example["path"]) |
|
if sampling_rate in resamplers: |
|
example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy() |
|
else: |
|
example["speech"] = resamplers[4800](speech).squeeze().numpy() |
|
return example |
|
|
|
dataset = dataset.map(prepare_example) |
|
processor = Wav2Vec2Processor.from_pretrained("bakrianoo/sinai-voice-ar-stt") |
|
model = Wav2Vec2ForCTC.from_pretrained("bakrianoo/sinai-voice-ar-stt").eval() |
|
def predict(batch): |
|
inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True) |
|
with torch.no_grad(): |
|
predicted = torch.argmax(model(inputs.input_values).logits, dim=-1) |
|
predicted[predicted == -100] = processor.tokenizer.pad_token_id # see fine-tuning script |
|
batch["predicted"] = processor.tokenizer.batch_decode(predicted) |
|
return batch |
|
dataset = dataset.map(predict, batched=True, batch_size=1, remove_columns=["speech"]) |
|
for reference, predicted in zip(dataset["sentence"], dataset["predicted"]): |
|
print("reference:", reference) |
|
print("predicted:", predicted) |
|
print("--") |
|
``` |
|
Here's the output: |
|
``` |
|
reference: ألديك قلم ؟ |
|
predicted: ألديك قلم |
|
-- |
|
reference: ليست هناك مسافة على هذه الأرض أبعد من يوم أمس. |
|
predicted: ليست نارك مسافة على هذه الأرض أبعد من يوم أمس |
|
-- |
|
reference: إنك تكبر المشكلة. |
|
predicted: إنك تكبر المشكلة |
|
-- |
|
reference: يرغب أن يلتقي بك. |
|
predicted: يرغب أن يلتقي بك |
|
-- |
|
reference: إنهم لا يعرفون لماذا حتى. |
|
predicted: إنهم لا يعرفون لماذا حتى |
|
-- |
|
reference: سيسعدني مساعدتك أي وقت تحب. |
|
predicted: سيسعدن مساعثتك أي وقد تحب |
|
-- |
|
reference: أَحَبُّ نظريّة علمية إليّ هي أن حلقات زحل مكونة بالكامل من الأمتعة المفقودة. |
|
predicted: أحب نظرية علمية إلي هي أن أحلقتز حلم كوينا بالكامل من الأمت عن المفقودة |
|
-- |
|
reference: سأشتري له قلماً. |
|
predicted: سأشتري له قلما |
|
-- |
|
reference: أين المشكلة ؟ |
|
predicted: أين المشكل |
|
-- |
|
reference: وَلِلَّهِ يَسْجُدُ مَا فِي السَّمَاوَاتِ وَمَا فِي الْأَرْضِ مِنْ دَابَّةٍ وَالْمَلَائِكَةُ وَهُمْ لَا يَسْتَكْبِرُونَ |
|
predicted: ولله يسجد ما في السماوات وما في الأرض من دابة والملائكة وهم لا يستكبرون |
|
``` |
|
|
|
## Evaluation |
|
|
|
The model can be evaluated as follows on the Arabic test data of Common Voice: |
|
```python |
|
import jiwer |
|
import torch |
|
import torchaudio |
|
from datasets import load_dataset |
|
from lang_trans.arabic import buckwalter |
|
from transformers import set_seed, Wav2Vec2ForCTC, Wav2Vec2Processor |
|
set_seed(42) |
|
test_split = load_dataset("common_voice", "ar", split="test") |
|
resamplers = { # all three sampling rates exist in test split |
|
48000: torchaudio.transforms.Resample(48000, 16000), |
|
44100: torchaudio.transforms.Resample(44100, 16000), |
|
32000: torchaudio.transforms.Resample(32000, 16000), |
|
} |
|
|
|
def prepare_example(example): |
|
speech, sampling_rate = torchaudio.load(example["path"]) |
|
if sampling_rate in resamplers: |
|
example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy() |
|
else: |
|
example["speech"] = resamplers[4800](speech).squeeze().numpy() |
|
return example |
|
|
|
test_split = test_split.map(prepare_example) |
|
processor = Wav2Vec2Processor.from_pretrained("bakrianoo/sinai-voice-ar-stt") |
|
model = Wav2Vec2ForCTC.from_pretrained("bakrianoo/sinai-voice-ar-stt").to("cuda").eval() |
|
def predict(batch): |
|
inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True) |
|
with torch.no_grad(): |
|
predicted = torch.argmax(model(inputs.input_values.to("cuda")).logits, dim=-1) |
|
predicted[predicted == -100] = processor.tokenizer.pad_token_id # see fine-tuning script |
|
batch["predicted"] = processor.batch_decode(predicted) |
|
return batch |
|
test_split = test_split.map(predict, batched=True, batch_size=16, remove_columns=["speech"]) |
|
|
|
transformation = jiwer.Compose([ |
|
# normalize some diacritics, remove punctuation, and replace Persian letters with Arabic ones |
|
jiwer.SubstituteRegexes({ |
|
r'[auiFNKo\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\~_،؟»\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\?;:\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\-,\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\.؛«!"]': "", "\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\u06D6": "", |
|
r"[\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\|\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\{]": "A", "p": "h", "ک": "k", "ی": "y"}), |
|
# default transformation below |
|
jiwer.RemoveMultipleSpaces(), |
|
jiwer.Strip(), |
|
jiwer.SentencesToListOfWords(), |
|
jiwer.RemoveEmptyStrings(), |
|
]) |
|
|
|
metrics = jiwer.compute_measures( |
|
truth=[buckwalter.trans(s) for s in test_split["sentence"]], # Buckwalter transliteration |
|
hypothesis=[buckwalter.trans(s) for s in test_split["predicted"]], |
|
truth_transform=transformation, |
|
hypothesis_transform=transformation, |
|
) |
|
print(f"WER: {metrics['wer']:.2%}") |
|
``` |
|
**Test Result**: 23.80% |
|
|
|
[**WER**: Word Error Rate] The Lowest score you get, the best model you have |
|
|
|
|
|
## Other Arabic Voice recognition Models |
|
|
|
الكلمات لا تكفى لشكر أولئك الذين يؤمنون أن هنالك أمل, و يسعون من أجله |
|
|
|
- [elgeish/wav2vec2-large-xlsr-53-arabic](https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic) |
|
- [othrif/wav2vec2-large-xlsr-arabic](https://huggingface.co/othrif/wav2vec2-large-xlsr-arabic) |
|
- [anas/wav2vec2-large-xlsr-arabic](https://huggingface.co/anas/wav2vec2-large-xlsr-arabic) |
|
|
|
|
|
|