File size: 3,243 Bytes
3cbadb9
 
 
 
 
 
 
 
 
a092095
 
 
 
71f9b03
 
 
a092095
 
 
71f9b03
a092095
aa6de74
 
 
 
 
 
 
 
 
 
 
 
71f9b03
aa6de74
 
 
 
 
 
 
 
71f9b03
aa6de74
 
 
 
 
 
 
71f9b03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
metrics:
- wer
- cer
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- Pomak
- Slavic
---

# wav2vec2-xls-r-slavic-pomak

Pomak is an endangered South East Slavic language variety spoken in Nothern Greece. 
This is the first automatic speech recognition (ASR) model for Pomak. 
To train the model, we fine-tuned a Slavic model ([classla/wav2vec2-large-slavic-parlaspeech-hr](https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr)) on 11h of recorded Pomak speech.

## Recordings

Four native Pomak speakers (2 female and 2 male) agreed to read Pomak texts at the ILSP audio-visual studio in Xanthi, Greece, resulting in a corpus of 14h.

|Speaker|Gender|Total recorded hours|
|---|---|---|
|NK9dIF | F | 4h 44m 45s|
|xoVY9q | M | 4h 36m 12s|
|9G75fk | F | 1h 44m 03s|
|n5WzHj | M | 3h 44m 04s|

To fine-tune the model, we split the long recordings into smaller segments of a maximum of 25 seconds each.
This removed the majority of pauses and resulted in a total dataset duration of 11h 8m.

## Metrics

We evaluated the model on the test set split, which consists of 10% of the dataset recordings.

|Model|CER|WER|
|---|---|---|
|pre-trained|87.31%|31.47%|
|fine-tuned|9.06%|3.12%|

## Training hyperparameters

We fine-tuned the baseline model (`wav2vec2-large-slavic-parlaspeech-hr`) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:

| arg                           | value |
|-------------------------------|-------|
| `per_device_train_batch_size` | 8     |
| `gradient_accumulation_steps` | 2     |
| `num_train_epochs`            | 35    |
| `learning_rate`               | 3e-4  |
| `warmup_steps`                | 500   |

## Citation

To cite this work or read more about the training pipeline, see [this paper](https://aclanthology.org/2023.fieldmatters-1.5/)

```
@inproceedings{tsoukala-etal-2023-asr,
    title = "{ASR} pipeline for low-resourced languages: A case study on Pomak",
    author = "Tsoukala, Chara  and
      Kritsis, Kosmas  and
      Douros, Ioannis  and
      Katsamanis, Athanasios  and
      Kokkas, Nikolaos  and
      Arampatzakis, Vasileios  and
      Sevetlidis, Vasileios  and
      Markantonatou, Stella  and
      Pavlidis, George",
    booktitle = "Proceedings of the Second Workshop on NLP Applications to Field Linguistics",
    month = may,
    year = "2023",
    address = "Dubrovnik, Croatia",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.fieldmatters-1.5",
    doi = "10.18653/v1/2023.fieldmatters-1.5",
    pages = "40--45",
    abstract = "Automatic Speech Recognition (ASR) models can aid field linguists by facilitating the creation of text corpora from oral material. Training ASR systems for low-resource languages can be a challenging task not only due to lack of resources but also due to the work required for the preparation of a training dataset. We present a pipeline for data processing and ASR model training for low-resourced languages, based on the language family. As a case study, we collected recordings of Pomak, an endangered South East Slavic language variety spoken in Greece. Using the proposed pipeline, we trained the first Pomak ASR model.",
}
```