|
--- |
|
language: |
|
- en |
|
datasets: |
|
- mozilla-foundation/common_voice_13_0 |
|
- facebook/voxpopuli |
|
- LIUM/tedlium |
|
- librispeech_asr |
|
- fisher_corpus |
|
- WSJ-0 |
|
metrics: |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
model-index: |
|
- name: tbd |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: LibriSpeech (clean) |
|
type: librispeech_asr |
|
config: clean |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 3.4 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: LibriSpeech (other) |
|
type: librispeech_asr |
|
config: other |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 7.7 |
|
name: Test WER |
|
- task: |
|
type: Automatic Speech Recognition |
|
name: automatic-speech-recognition |
|
dataset: |
|
name: tedlium-v3 |
|
type: LIUM/tedlium |
|
config: release1 |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 5.5 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Vox Populi |
|
type: facebook/voxpopuli |
|
config: en |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 8.3 |
|
name: Test WER |
|
- task: |
|
type: Automatic Speech Recognition |
|
name: automatic-speech-recognition |
|
dataset: |
|
name: Mozilla Common Voice 13.0 |
|
type: mozilla-foundation/common_voice_13_0 |
|
config: en |
|
split: test |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 16.1 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: FLEURS |
|
type: google/fleurs |
|
split: test |
|
args: |
|
language: en_us |
|
metrics: |
|
- type: wer |
|
value: 9.9 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Switchboard |
|
type: unk |
|
split: eval2000 |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 12.5 |
|
name: Test WER |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: Wall Street Journal |
|
type: unk |
|
split: eval92 |
|
args: |
|
language: en |
|
metrics: |
|
- type: wer |
|
value: 2.4 |
|
name: Test WER |
|
--- |
|
# DeCRED-base |
|
This is a **39M encoder-decoder Ebranchformer model** trained on 6,000 hours of open-source normalised English data. |
|
|
|
Architecture details, training hyperparameters, and a description of the proposed technique will be added soon. |
|
|
|
*Disclaimer: The model currently hallucinates on segments containing silence only, as it was previously not trained on such data. The fix will be added soon.* |
|
|
|
The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline) |
|
class to transcribe audio files of arbitrary length. |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_id = "BUT-FIT/ED-small" |
|
pipe = pipeline("automatic-speech-recognition", model=model_id, feature_extractor=model_id, trust_remote_code=True) |
|
# In newer versions of transformers (>4.31.0), there is a bug in the pipeline inference type. |
|
# The warning can be ignored. |
|
pipe.type = "seq2seq" |
|
|
|
# Run beam search decoding with joint CTC-attention scorer |
|
result_beam = pipe("audio.wav") |
|
|
|
# Run greedy decoding without joint CTC-attention scorer |
|
pipe.model.generation_config.ctc_weight = 0.0 |
|
pipe.model.generation_config.num_beams = 1 |
|
|
|
result_greedy = pipe("audio.wav") |
|
|
|
``` |