File size: 3,004 Bytes
f142d8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
language: en
datasets:
- timit_asr
tags:
- audio
- automatic-speech-recognition
license: apache-2.0
widget:
- label: Sample 1 (from LibriSpeech)
  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
---

# Wav2Vec2-Base-TIMIT

Fine-tuned [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base)
on the [timit_asr dataset](https://huggingface.co/datasets/timit_asr).
When using this model, make sure that your speech input is sampled at 16kHz.

## Usage

The model can be used directly (without a language model) as follows:

```python
import torch
from datasets import load_dataset
import soundfile as sf
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "elgeish/wav2vec2-base-timit"
processor = Wav2Vec2Processor.from_pretrained(model_name, do_lower_case=True)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
dataset = load_dataset("timit_asr", split="test[:10]")

def prepare_example(example):
    example["speech"], _ = sf.read(example["file"])
    return example

dataset = dataset.map(prepare_example, remove_columns=["file"])
inputs = processor(dataset["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")

with torch.no_grad():
    predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1)
predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids)
for reference, predicted in zip(dataset["text"], predicted_transcripts):
    print("reference:", reference)
    print("predicted:", predicted)
    print("--")
```

Here's the output:

```
reference: The bungalow was pleasantly situated near the shore.
predicted: the bunglow was plesntly situated near the shor
--
reference: Don't ask me to carry an oily rag like that.
predicted: don't ask me to carry an oily rag like that
--
reference: Are you looking for employment?
predicted: are you oking for employment
--
reference: She had your dark suit in greasy wash water all year.
predicted: she had your dark suit in greasy wash water all year
--
reference: At twilight on the twelfth day we'll have Chablis.
predicted: at twilight on the twelfth day we'll have shiple
--
reference: Eating spinach nightly increases strength miraculously.
predicted: eating spanage nightly increases strength moraculously
--
reference: Got a heck of a buy on this, dirt cheap.
predicted: got a heck of a by on this dert cheep
--
reference: The scalloped edge is particularly appealing.
predicted: the scaliped edge iuse particularly appeling
--
reference: A big goat idly ambled through the farmyard.
predicted: a big goat idely ambled through the farmyard
--
reference: This group is secularist and their program tends to be technological.
predicted: this croup is secularist and their program tens to be technological
--
```

## Fine-Tuning Script

You can find the script used to produce this model
[here](https://github.com/elgeish/transformers/blob/f2b98f876b040bab3c3db8561ec39c1abb2c733c/examples/research_projects/wav2vec2/finetune_base_timit_asr.sh).