File size: 3,182 Bytes
dd9561c
 
2f23015
 
 
93df1ac
0299e04
437f470
 
 
 
ef424a0
0299e04
437f470
 
 
 
 
 
 
ef424a0
437f470
ef424a0
437f470
 
 
91b7359
d62e767
dd9561c
 
d62e767
 
 
 
 
dd9561c
 
 
 
 
 
 
 
 
28f1581
 
dd9561c
d62e767
 
dd9561c
d62e767
 
 
 
 
 
 
dd9561c
 
 
 
 
60e858d
 
7044354
 
 
 
 
 
 
 
 
5016634
 
7044354
5016634
 
7044354
 
 
 
 
 
 
 
 
 
 
 
 
 
5016634
7044354
 
 
 
60e858d
 
 
ded7488
 
 
 
 
60e858d
93df1ac
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: cc-by-4.0
metrics:
- cer
pipeline_tag: automatic-speech-recognition
datasets:
- ivangtorre/second_americas_nlp_2022
tags:
- audio
- automatic-speech-recognition
- speech
- quechua
- xlsr-fine-tuning
model-index:
- name: Wav2Vec2 XLSR 300M Quechua Model by M Romero and Ivan G Torre
  results:
  - task:
      name: Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Americas NLP 2022 Quechua
      type: second_americas_nlp_2022
      args: Quechua
    metrics:
    - name: Test CER
      type: cer
      value: 49.2

---

This model was finetuned from a Wav2vec2.0 XLS-R model: 300M with the Quechua train parition of the Americas NLP 2022 dataset. This challenge took place during NeurIPSS 2022.



## Example of usage

The model can be used directly (without a language model) as follows:

```python
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
import torchaudio

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("ivangtorre/wav2vec2-xlsr-300m-quechua")
model = Wav2Vec2ForCTC.from_pretrained("ivangtorre/wav2vec2-xlsr-300m-quechua")

# Pat to wav file
pathfile = "/path/to/wavfile"

# Load and normalize the file
wav, curr_sample_rate = sf.read(pathfile, dtype="float32")
feats = torch.from_numpy(wav).float()
with torch.no_grad():
    feats = F.layer_norm(feats, feats.shape)
feats = torch.unsqueeze(feats, 0)
logits = model(feats).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print("HF prediction: ", transcription)
```


This code snipnet shows how to Evaluate the wav2vec2-xlsr-300m-quechua in [Second Americas NLP 2022 Quechua dev set](https://huggingface.co/datasets/ivangtorre/second_americas_nlp_2022)

```python
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import cer
import torch.nn.functional as F
from datasets import load_dataset
import soundfile as sf

americasnlp = load_dataset("ivangtorre/second_americas_nlp_2022", "quechua", split="dev")
quechua = americasnlp.filter(lambda language: language['subset']=='quechua')

model = Wav2Vec2ForCTC.from_pretrained("ivangtorre/wav2vec2-xlsr-300m-quechua")
processor = Wav2Vec2Processor.from_pretrained("ivangtorre/wav2vec2-xlsr-300m-quechua")

def map_to_pred(batch):
    wav = batch["audio"][0]["array"]
    feats = torch.from_numpy(wav).float()
    feats = F.layer_norm(feats, feats.shape) # Normalization performed during finetuning
    feats = torch.unsqueeze(feats, 0)
    logits = model(feats).logits
    predicted_ids = torch.argmax(logits, dim=-1)
    batch["transcription"] = processor.batch_decode(predicted_ids)
    return batch

result = quechua.map(map_to_pred, batched=True, batch_size=1)

print("CER:", cer(result["source_processed"], result["transcription"]))
```

## Citation

```bibtex
@article{romero2024asr,
  title={ASR advancements for indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana},
  author={Romero, Monica and Gomez, Sandra and Torre, Iv{\'a}n G},
  journal={arXiv preprint arXiv:2404.08368},
  year={2024}
}
```