--- language: fa datasets: - common_voice_6_1 tags: - audio - automatic-speech-recognition license: mit widget: - example_title: Common Voice Sample 1 src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/0/audio/audio.mp3 - example_title: Common Voice Sample 2 src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/1/audio/audio.mp3 model-index: - name: Sharif-wav2vec2 results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice Corpus 6.1 (clean) type: common_voice_6_1 config: clean split: test args: language: fa metrics: - name: Test WER type: wer value: 6.0 --- # Sharif-wav2vec2 This is the fine-tuned version of Sharif Wav2vec2 for Farsi. The base model was fine-tuned on 108 hours of Commonvoice's Farsi samples with a sampling rate equal to 16kHz. When using the model make sure that your speech input is also sampled at 16Khz. Prior to the usage, you may need to install the below dependencies: ```shell pip -q install pyctcdecode python -m pip -q install pypi-kenlm ``` For testing you can use the hoster API at the hugging face (There are provided examples from common voice) it may take a while to transcribe the given voice. Or you can use bellow code for local run: ```python import tensorflow import torchaudio import torch import numpy as np from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2") model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2") speech_array, sampling_rate = torchaudio.load("path/to/your.wav") speech_array = speech_array.squeeze().numpy() features = processor( speech_array, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True) with torch.no_grad(): logits = model( features.input_values, attention_mask=features.attention_mask).logits prediction = processor.batch_decode(logits.numpy()).text print(prediction[0]) # تست ``` # Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli # **Abstract** The original model can be found under https://github.com/pytorch/fairseq/tree/master/examples/wav2vec#wav2vec-20. # Usage To transcribe Persian audio files the model can be used as a standalone acoustic model as follows: ```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC from datasets import load_dataset import torch # load model and tokenizer processor = Wav2Vec2Processor.from_pretrained("SLPL/Sharif-wav2vec2") model = Wav2Vec2ForCTC.from_pretrained("SLPL/Sharif-wav2vec2") # load dummy dataset and read soundfiles # ds = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation") # tokenize input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1 # retrieve logits logits = model(input_values).logits # take argmax and decode predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) ``` ## Evaluation This code snippet shows how to evaluate **facebook/wav2vec2-base-960h** on LibriSpeech's "clean" and "other" test data. ```python from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import torch from jiwer import wer librispeech_eval = load_dataset("librispeech_asr", "clean", split="test") model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h").to("cuda") processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") def map_to_pred(batch): input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values with torch.no_grad(): logits = model(input_values.to("cuda")).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) batch["transcription"] = transcription return batch result = librispeech_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["audio"]) print("WER:", wer(result["text"], result["transcription"])) ``` *Result (WER)*: | "clean" | "other" | |---|---| | 3.4 | 8.6 |