File size: 2,769 Bytes
6d8d409
f5c8056
6d8d409
 
f5c8056
 
6d8d409
f5c8056
 
6d8d409
fd1675d
f5c8056
6d8d409
 
fd1675d
6d8d409
fd1675d
6d8d409
fd1675d
6d8d409
f5c8056
6d8d409
fd1675d
 
 
6d8d409
f5c8056
6d8d409
f5c8056
6d8d409
f5c8056
 
 
6d8d409
 
 
 
 
 
f5c8056
 
 
 
 
 
 
 
 
 
 
6d8d409
 
 
f5c8056
 
 
 
 
 
 
 
 
 
 
 
 
6d8d409
f5c8056
6d8d409
f5c8056
6d8d409
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
---
language: en
license: apache-2.0
tags:
  - audio-classification
  - generated_from_trainer
metrics:
  - accuracy
  - f1
model-index:
  - name: distil-wav2vec2-xls-r-adult-child-cls-89m
    results: []
---

# DistilWav2Vec2 XLS-R Adult/Child Speech Classifier 89M

DistilWav2Vec2 XLS-R Adult/Child Speech Classifier is an audio classification model based on the [XLS-R](https://arxiv.org/abs/2111.09296) architecture. This model is a distilled version of [wav2vec2-xls-r-adult-child-cls](https://huggingface.co/bookbot/wav2vec2-xls-r-adult-child-cls) on a private adult/child speech classification dataset.

This model was trained using HuggingFace's PyTorch framework. All training was done on a Tesla P100, provided by Kaggle. Training metrics were logged via Tensorboard.

## Model

| Model                                       | #params | Arch. | Training/Validation data (text)           |
| ------------------------------------------- | ------- | ----- | ----------------------------------------- |
| `distil-wav2vec2-xls-r-adult-child-cls-89m` | 89M     | XLS-R | Adult/Child Speech Classification Dataset |

## Evaluation Results

The model achieves the following results on evaluation:

| Dataset                           | Loss   | Accuracy | F1     |
| --------------------------------- | ------ | -------- | ------ |
| Adult/Child Speech Classification | 0.3048 | 93.54%   | 0.9420 |

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:

- `learning_rate`: 3e-05
- `train_batch_size`: 32
- `eval_batch_size`: 32
- `seed`: 42
- `gradient_accumulation_steps`: 4
- `total_train_batch_size`: 128
- `optimizer`: Adam with `betas=(0.9,0.999)` and `epsilon=1e-08`
- `lr_scheduler_type`: linear
- `lr_scheduler_warmup_ratio`: 0.1
- `num_epochs`: 5

### Training results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |   F1   |
| :-----------: | :---: | :--: | :-------------: | :------: | :----: |
|    0.7711     |  1.0  |  96  |     0.5413      |  0.9017  | 0.9156 |
|    0.5551     |  2.0  | 192  |     0.4627      |  0.9164  | 0.9272 |
|    0.4166     |  3.0  | 288  |     0.3832      |  0.9261  | 0.9352 |
|    0.3928     |  4.0  | 384  |     0.3242      |  0.9331  | 0.9406 |
|    0.3622     |  5.0  | 480  |     0.3048      |  0.9354  | 0.9420 |

## Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

## Authors

DistilWav2Vec2 XLS-R Adult/Child Speech Classifier was trained and evaluated by [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on Kaggle.

## Framework versions

- Transformers 4.17.0.dev0
- Pytorch 1.10.2+cu102
- Datasets 1.18.3
- Tokenizers 0.11.0