|
--- |
|
language: en |
|
datasets: |
|
- timit_asr |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
license: apache-2.0 |
|
widget: |
|
- label: Sample 1 (from LibriSpeech) |
|
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
|
--- |
|
|
|
# Wav2Vec2-Base-TIMIT |
|
|
|
Fine-tuned [facebook/wav2vec2-base](https://huggingface.co/facebook/wav2vec2-base) |
|
on the [timit_asr dataset](https://huggingface.co/datasets/timit_asr). |
|
When using this model, make sure that your speech input is sampled at 16kHz. |
|
|
|
## Usage |
|
|
|
The model can be used directly (without a language model) as follows: |
|
|
|
```python |
|
import torch |
|
from datasets import load_dataset |
|
import soundfile as sf |
|
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor |
|
|
|
model_name = "elgeish/wav2vec2-base-timit" |
|
processor = Wav2Vec2Processor.from_pretrained(model_name, do_lower_case=True) |
|
model = Wav2Vec2ForCTC.from_pretrained(model_name) |
|
dataset = load_dataset("timit_asr", split="test[:10]") |
|
|
|
def prepare_example(example): |
|
example["speech"], _ = sf.read(example["file"]) |
|
return example |
|
|
|
dataset = dataset.map(prepare_example, remove_columns=["file"]) |
|
inputs = processor(dataset["speech"], sampling_rate=16000, return_tensors="pt", padding="longest") |
|
|
|
with torch.no_grad(): |
|
predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1) |
|
predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids) |
|
for reference, predicted in zip(dataset["text"], predicted_transcripts): |
|
print("reference:", reference) |
|
print("predicted:", predicted) |
|
print("--") |
|
``` |
|
|
|
Here's the output: |
|
|
|
``` |
|
reference: The bungalow was pleasantly situated near the shore. |
|
predicted: the bunglow was plesntly situated near the shor |
|
-- |
|
reference: Don't ask me to carry an oily rag like that. |
|
predicted: don't ask me to carry an oily rag like that |
|
-- |
|
reference: Are you looking for employment? |
|
predicted: are you oking for employment |
|
-- |
|
reference: She had your dark suit in greasy wash water all year. |
|
predicted: she had your dark suit in greasy wash water all year |
|
-- |
|
reference: At twilight on the twelfth day we'll have Chablis. |
|
predicted: at twilight on the twelfth day we'll have shiple |
|
-- |
|
reference: Eating spinach nightly increases strength miraculously. |
|
predicted: eating spanage nightly increases strength moraculously |
|
-- |
|
reference: Got a heck of a buy on this, dirt cheap. |
|
predicted: got a heck of a by on this dert cheep |
|
-- |
|
reference: The scalloped edge is particularly appealing. |
|
predicted: the scaliped edge iuse particularly appeling |
|
-- |
|
reference: A big goat idly ambled through the farmyard. |
|
predicted: a big goat idely ambled through the farmyard |
|
-- |
|
reference: This group is secularist and their program tends to be technological. |
|
predicted: this croup is secularist and their program tens to be technological |
|
-- |
|
``` |
|
|
|
## Fine-Tuning Script |
|
|
|
You can find the script used to produce this model |
|
[here](https://github.com/elgeish/transformers/blob/f2b98f876b040bab3c3db8561ec39c1abb2c733c/examples/research_projects/wav2vec2/finetune_base_timit_asr.sh). |
|
|