--- language: fa datasets: - common_voice_6_1 tags: - audio - automatic-speech-recognition license: mit widget: - example_title: Common Voice Sample 1 src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/0/audio/audio.mp3 - example_title: Common Voice Sample 2 src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/1/audio/audio.mp3 model-index: - name: Sharif-wav2vec2 results: - task: name: Automatic Speech Recognition type: automatic-speech-recognition dataset: name: Common Voice Corpus 6.1 (clean) type: common_voice_6_1 config: clean split: test args: language: fa metrics: - name: Test WER type: wer value: 6.0 --- # Sharif-wav2vec2 This is the fine-tuned version of Sharif Wav2vec2 for Farsi. The base model was fine-tuned on 108 hours of Commonvoice's Farsi samples with a sampling rate equal to 16kHz. Afterward, we trained a 5gram using [kenlm](https://github.com/kpu/kenlm) toolkit and used it in the processor which increased our accuracy on online ASR. When using the model make sure that your speech input is sampled at 16Khz. Prior to the usage, you may need to install the below dependencies: ```shell pip install pyctcdecode pip install pypi-kenlm ``` For testing you can use the hoster API at the hugging face (There are provided examples from common voice) it may take a while to transcribe the given voice. Or you can use bellow code for local run: ```python import tensorflow import torchaudio import torch import numpy as np from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2") model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2") speech_array, sampling_rate = torchaudio.load("path/to/your.wav") speech_array = speech_array.squeeze().numpy() features = processor( speech_array, sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt", padding=True) with torch.no_grad(): logits = model( features.input_values, attention_mask=features.attention_mask).logits prediction = processor.batch_decode(logits.numpy()).text print(prediction[0]) # تست ``` *Result (WER)*: | "clean" | "other" | |---|---| | 3.4 | 8.6 |