SLPL
/

File size: 4,340 Bytes
8b4523b
aab5371
 
876dff9
aab5371
 
 
1e2b39a
 
 
 
 
 
aab5371
 
 
 
 
 
 
876dff9
 
aab5371
 
 
 
 
 
 
 
8b4523b
aab5371
 
 
9603413
 
 
 
 
64771ba
 
55c34da
 
64771ba
 
52867a6
b466f87
64771ba
 
 
 
 
 
f932e73
 
 
 
 
b96632b
64771ba
 
 
 
 
 
 
b466f87
64771ba
b466f87
 
 
64771ba
 
 
 
 
aab5371
feb891f
b2d9405
a585e5e
 
 
 
 
 
 
 
b2d9405
4a36ad9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b2d9405
4a36ad9
 
 
 
 
 
 
 
f9ab6fe
4a36ad9
 
 
 
 
b2d9405
4a36ad9
 
feb891f
64771ba
aab5371
 
9603413
aab5371
4a36ad9
828771f
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
language: fa
datasets:
- common_voice_6_1
tags:
- audio
- automatic-speech-recognition
license: mit
widget:
- example_title: Common Voice Sample 1
  src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/0/audio/audio.mp3
- example_title: Common Voice Sample 2
  src: https://datasets-server.huggingface.co/assets/common_voice/--/fa/train/1/audio/audio.mp3
model-index:
- name: Sharif-wav2vec2
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Common Voice Corpus 6.1 (clean)
      type: common_voice_6_1
      config: clean
      split: test
      args: 
        language: fa
    metrics:
    - name: Test WER
      type: wer
      value: 6.0
---

# Sharif-wav2vec2

This is the fine-tuned version of Sharif Wav2vec2 for Farsi. The base model was fine-tuned on 108 hours of Commonvoice's Farsi samples with a sampling rate equal to 16kHz. Afterward, we trained a 5gram using [kenlm](https://github.com/kpu/kenlm) toolkit and used it in the processor which increased our accuracy on online ASR.

## Usage 

When using the model make sure that your speech input is sampled at 16Khz. Prior to the usage, you may need to install the below dependencies:

```shell
pip install pyctcdecode
pip install pypi-kenlm
```

For testing you can use the hosted inference API at the hugging face (There are provided examples from common voice) it may take a while to transcribe the given voice. Or you can use the bellow code for a local run:

```python
import tensorflow
import torchaudio
import torch
import numpy as np

from transformers import AutoProcessor, AutoModelForCTC

processor = AutoProcessor.from_pretrained("SLPL/Sharif-wav2vec2")
model = AutoModelForCTC.from_pretrained("SLPL/Sharif-wav2vec2")

speech_array, sampling_rate = torchaudio.load("path/to/your.wav")
speech_array = speech_array.squeeze().numpy()

features = processor(
    speech_array,
    sampling_rate=processor.feature_extractor.sampling_rate,
    return_tensors="pt",
    padding=True)

with torch.no_grad():
    logits = model(
        features.input_values,
        attention_mask=features.attention_mask).logits
    prediction = processor.batch_decode(logits.numpy()).text

print(prediction[0])
# تست
```

## Evaluation
For the evaluation use the code below:
```
*Input csv files format:*

| path| reference|
|---|---|
| path to audio files | corresponding transcription|

```
```python
import torch
import torchaudio
import librosa
from datasets import load_dataset,load_metric
import numpy as np
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
from transformers import Wav2Vec2ProcessorWithLM

model = Wav2Vec2ForCTC.from_pretrained("SLPL/Sharif-wav2vec2") 
processor = Wav2Vec2ProcessorWithLM.from_pretrained("SLPL/Sharif-wav2vec2") 

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    speech_array = speech_array.squeeze().numpy()
    speech_array = librosa.resample(np.asarray(speech_array), sampling_rate, processor.feature_extractor.sampling_rate)
    batch["speech"] = speech_array
    return batch

def predict(batch):
    features = processor(
        batch["speech"], 
        sampling_rate=processor.feature_extractor.sampling_rate, 
        return_tensors="pt", 
        padding=True
    )
    
    input_values = features.input_values
    attention_mask = features.attention_mask

    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits #when we are trying to load model with LM we have to use logits instead of argmax(logits)
    batch["prediction"] = processor.batch_decode(logits.numpy()).text
    return batch
    
dataset = load_dataset("csv", data_files={"test":"path/to/your.csv"}, delimiter=",")["test"] 
dataset = dataset.map(speech_file_to_array_fn)

result = dataset.map(predict, batched=True, batch_size=4)
wer = load_metric("wer")
cer = load_metric("cer")

print("WER: {:.2f}".format(100 * wer.compute(predictions=result["prediction"], references=result["reference"])))
print("CER: {:.2f}".format(100 * cer.compute(predictions=result["prediction"], references=result["reference"])))
```

*Result (WER)*:

| clean | other |
|---|---|
| 6.0 | 16.4 |


## Citation
If you want to cite this model you can use this:

```bibtex
?
```