wav2vec2-large-xlsr-53-arabic / README.md

elgeish

Updated metadata

b5e6df1 over 2 years ago

preview code

raw

history blame contribute delete

No virus

8.86 kB

	---
	language: ar
	datasets:
	- arabic_speech_corpus
	- mozilla-foundation/common_voice_6_1
	metrics:
	- wer
	tags:
	- audio
	- automatic-speech-recognition
	- speech
	- xlsr-fine-tuning-week
	- hf-asr-leaderboard
	license: apache-2.0
	model-index:
	- name: elgeish-wav2vec2-large-xlsr-53-arabic
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: Common Voice 6.1 (Arabic)
	type: mozilla-foundation/common_voice_6_1
	config: ar
	split: test
	args:
	language: ar
	metrics:
	- name: Test WER
	type: wer
	value: 26.55
	- name: Validation WER
	type: wer
	value: 23.39
	---

	# Wav2Vec2-Large-XLSR-53-Arabic

	Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
	on Arabic using the `train` splits of [Common Voice](https://huggingface.co/datasets/common_voice)
	and [Arabic Speech Corpus](https://huggingface.co/datasets/arabic_speech_corpus).
	When using this model, make sure that your speech input is sampled at 16kHz.

	## Usage

	The model can be used directly (without a language model) as follows:

	```python
	import torch
	import torchaudio
	from datasets import load_dataset
	from lang_trans.arabic import buckwalter
	from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

	dataset = load_dataset("common_voice", "ar", split="test[:10]")
	resamplers = { # all three sampling rates exist in test split
	48000: torchaudio.transforms.Resample(48000, 16000),
	44100: torchaudio.transforms.Resample(44100, 16000),
	32000: torchaudio.transforms.Resample(32000, 16000),
	}

	def prepare_example(example):
	speech, sampling_rate = torchaudio.load(example["path"])
	example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
	return example

	dataset = dataset.map(prepare_example)
	processor = Wav2Vec2Processor.from_pretrained("elgeish/wav2vec2-large-xlsr-53-arabic")
	model = Wav2Vec2ForCTC.from_pretrained("elgeish/wav2vec2-large-xlsr-53-arabic").eval()

	def predict(batch):
	inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
	with torch.no_grad():
	predicted = torch.argmax(model(inputs.input_values).logits, dim=-1)
	predicted[predicted == -100] = processor.tokenizer.pad_token_id # see fine-tuning script
	batch["predicted"] = processor.tokenizer.batch_decode(predicted)
	return batch

	dataset = dataset.map(predict, batched=True, batch_size=1, remove_columns=["speech"])

	for reference, predicted in zip(dataset["sentence"], dataset["predicted"]):
	print("reference:", reference)
	print("predicted:", buckwalter.untrans(predicted))
	print("--")
	```

	Here's the output:

	```
	reference: ألديك قلم ؟
	predicted: هلديك قالر
	--
	reference: ليست هناك مسافة على هذه الأرض أبعد من يوم أمس.
	predicted: ليست نالك مسافة على هذه الأرض أبعد من يوم أمس
	--
	reference: إنك تكبر المشكلة.
	predicted: إنك تكبر المشكلة
	--
	reference: يرغب أن يلتقي بك.
	predicted: يرغب أن يلتقي بك
	--
	reference: إنهم لا يعرفون لماذا حتى.
	predicted: إنهم لا يعرفون لماذا حتى
	--
	reference: سيسعدني مساعدتك أي وقت تحب.
	predicted: سيسئدني مساعد سكرأي وقت تحب
	--
	reference: أَحَبُّ نظريّة علمية إليّ هي أن حلقات زحل مكونة بالكامل من الأمتعة المفقودة.
	predicted: أحب ناضريةً علمية إلي هي أنحل قتزح المكونا بالكامل من الأمت عن المفقودة
	--
	reference: سأشتري له قلماً.
	predicted: سأشتري له قلما
	--
	reference: أين المشكلة ؟
	predicted: أين المشكل
	--
	reference: وَلِلَّهِ يَسْجُدُ مَا فِي السَّمَاوَاتِ وَمَا فِي الْأَرْضِ مِنْ دَابَّةٍ وَالْمَلَائِكَةُ وَهُمْ لَا يَسْتَكْبِرُونَ
	predicted: ولله يسجد ما في السماوات وما في الأرض من دابة والملائكة وهم لا يستكبرون
	--
	```

	## Evaluation

	The model can be evaluated as follows on the Arabic test data of Common Voice:

	```python
	import jiwer
	import torch
	import torchaudio
	from datasets import load_dataset
	from lang_trans.arabic import buckwalter
	from transformers import set_seed, Wav2Vec2ForCTC, Wav2Vec2Processor

	set_seed(42)
	test_split = load_dataset("common_voice", "ar", split="test")
	resamplers = { # all three sampling rates exist in test split
	48000: torchaudio.transforms.Resample(48000, 16000),
	44100: torchaudio.transforms.Resample(44100, 16000),
	32000: torchaudio.transforms.Resample(32000, 16000),
	}

	def prepare_example(example):
	speech, sampling_rate = torchaudio.load(example["path"])
	example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
	return example

	test_split = test_split.map(prepare_example)
	processor = Wav2Vec2Processor.from_pretrained("elgeish/wav2vec2-large-xlsr-53-arabic")
	model = Wav2Vec2ForCTC.from_pretrained("elgeish/wav2vec2-large-xlsr-53-arabic").to("cuda").eval()

	def predict(batch):
	inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
	with torch.no_grad():
	predicted = torch.argmax(model(inputs.input_values.to("cuda")).logits, dim=-1)
	predicted[predicted == -100] = processor.tokenizer.pad_token_id # see fine-tuning script
	batch["predicted"] = processor.batch_decode(predicted)
	return batch

	test_split = test_split.map(predict, batched=True, batch_size=16, remove_columns=["speech"])
	transformation = jiwer.Compose([
	# normalize some diacritics, remove punctuation, and replace Persian letters with Arabic ones
	jiwer.SubstituteRegexes({
	r'[auiFNKo\~_،؟»\?;:\-,\.؛«!"]': "", "\u06D6": "",
	r"[\\|\{]": "A", "p": "h", "ک": "k", "ی": "y"}),
	# default transformation below
	jiwer.RemoveMultipleSpaces(),
	jiwer.Strip(),
	jiwer.SentencesToListOfWords(),
	jiwer.RemoveEmptyStrings(),
	])
	metrics = jiwer.compute_measures(
	truth=[buckwalter.trans(s) for s in test_split["sentence"]], # Buckwalter transliteration
	hypothesis=test_split["predicted"],
	truth_transform=transformation,
	hypothesis_transform=transformation,
	)
	print(f"WER: {metrics['wer']:.2%}")
	```

	Test Result: 26.55%

	## Training

	For more details, see [Fine-Tuning with Arabic Speech Corpus](https://github.com/huggingface/transformers/tree/1c06240e1b3477728129bb58e7b6c7734bb5074e/examples/research_projects/wav2vec2#fine-tuning-with-arabic-speech-corpus).

	This model represents Arabic in a format called [Buckwalter transliteration](https://en.wikipedia.org/wiki/Buckwalter_transliteration).
	The Buckwalter format only includes ASCII characters, some of which are non-alpha (e.g., `">"` maps to `"أ"`).
	The [lang-trans](https://github.com/kariminf/lang-trans) package is used to convert (transliterate) Arabic abjad.

	[This script](https://github.com/huggingface/transformers/blob/1c06240e1b3477728129bb58e7b6c7734bb5074e/examples/research_projects/wav2vec2/finetune_large_xlsr_53_arabic_speech_corpus.sh)
	was used to first fine-tune [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
	on the `train` split of the [Arabic Speech Corpus](https://huggingface.co/datasets/arabic_speech_corpus) dataset;
	the `test` split was used for model selection; the resulting model at this point is saved as [elgeish/wav2vec2-large-xlsr-53-levantine-arabic](https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-levantine-arabic).

	Training was then resumed using the `train` split of the [Common Voice](https://huggingface.co/datasets/common_voice) dataset;
	the `validation` split was used for model selection;
	training was stopped to meet the deadline of [Fine-Tune-XLSR Week](https://github.com/huggingface/transformers/blob/700229f8a4003c4f71f29275e0874b5ba58cd39d/examples/research_projects/wav2vec2/FINE_TUNE_XLSR_WAV2VEC2.md):
	this model is the checkpoint at 100k steps and a validation WER of 23.39%.

	<img src="https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic/raw/main/validation_wer.png" alt="Validation WER" width="100%" />

	It's worth noting that validation WER is trending down, indicating the potential of further training (resuming the decaying learning rate at 7e-6).

	## Future Work
	One area to explore is using `attention_mask` in model input, which is recommended [here](https://huggingface.co/blog/fine-tune-xlsr-wav2vec2).
	Also, exploring data augmentation using datasets used to train models listed [here](https://paperswithcode.com/sota/speech-recognition-on-common-voice-arabic).