File size: 7,543 Bytes
5b56cf9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
from pathlib import Path

# Directory where request by models are stored
DIR_OUTPUT_REQUESTS = Path("requested_models")
EVAL_REQUESTS_PATH = Path("eval_requests")

##########################
# Text definitions       #
##########################

banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png"
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>'

TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> 🤗 Open Automatic Speech Recognition Leaderboard </b> </body> </html>"

INTRODUCTION_TEXT = "📐 The 🤗 Open ASR Leaderboard ranks and evaluates speech recognition models \
    on the Hugging Face Hub. \
    \nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (⬇️) and [RTF](https://openvoice-tech.net/index.php/Real-time-factor) (⬇️) - lower the better. Models are ranked based on their Average WER, from lowest to highest. Check the 📈 Metrics tab to understand how the models are evaluated. \
    \nIf you want results for a model that is not listed here, you can submit a request for it to be included ✉️✨. \
    \nThe leaderboard currently focuses on English speech recognition, and will be expanded to multilingual evaluation in later versions."

CITATION_TEXT = """@misc{open-asr-leaderboard,
	title        = {Open Automatic Speech Recognition Leaderboard},
	author       = {Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and Hugging Face Team and Nvidia NeMo Team and SpeechBrain Team},
	year         = 2023,
	publisher    = {Hugging Face},
	howpublished = "\\url{https://huggingface.co/spaces/huggingface.co/spaces/open-asr-leaderboard/leaderboard}"
}
"""

METRICS_TAB_TEXT = """
Here you will find details about the speech recognition metrics and datasets reported in our leaderboard.

## Metrics

🎯 Word Error Rate (WER) and Real-Time Factor (RTF) are popular metrics for evaluating the accuracy of speech recognition 
models by estimating how accurate the predictions from the models are and how fast they are returned. We explain them each
below.

### Word Error Rate (WER)

Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage 
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**.

```
Example: If the reference transcript is "I really love cats," and the ASR system outputs "I don't love dogs,".
The WER would be `50%` because 2 out of 4 words are incorrect.
```

For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation).

### Real Time Factor (RTF)

Real Time Factor is a measure of  the **latency** of automatic speech recognition systems, i.e. how long it takes an 
model to process a given amount of speech. It's usually expressed as a multiple of real time. An RTF of 1 means it processes 
speech as fast as it's spoken, while an RTF of 2 means it takes twice as long. Thus, **a lower RTF value indicates lower latency**.

```
Example: If it takes an ASR system 10 seconds to transcribe 10 seconds of speech, the RTF is 1. 
If it takes 20 seconds to transcribe the same 10 seconds of speech, the RTF is 2.
```

For the benchmark, we report RTF averaged over a 10 minute audio sample with 5 warm up batches followed 3 graded batches.

## How to reproduce our results

The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible. 
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations.
For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard 

P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! ♥️

## Benchmark datasets

Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the 
[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model.

ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad 
set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, 
acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how 
a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone.

The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard
are ranked based on their average WER scores, from lowest to highest.

| Dataset                                                                                 | Domain                      | Speaking Style        | Train (h) | Dev (h) | Test (h) | Transcriptions     | License         |
|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------|
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr)                          | Audiobook                   | Narrated              | 960       | 11      | 11       | Normalised         | CC-BY-4.0       |
| [Common Voice 9](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0)   | Wikipedia                   | Narrated              | 1409      | 27      | 27       | Punctuated & Cased | CC0-1.0         |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli)                         | European Parliament         | Oratory               | 523       | 5       | 5        | Punctuated         | CC0             |
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium)                                | TED talks                   | Oratory               | 454       | 2       | 3        | Normalised         | CC-BY-NC-ND 3.0 |
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech)                    | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500      | 12      | 40       | Punctuated         | apache-2.0      |
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech)                         | Fincancial meetings         | Oratory, spontaneous  | 4900      | 100     | 100      | Punctuated & Cased | User Agreement  |
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22)                     | Fincancial meetings         | Oratory, spontaneous  | 105       | 5       | 5        | Punctuated & Cased | CC-BY-SA-4.0    |
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami)                                | Meetings                    | Spontaneous           | 78        | 9       | 9        | Punctuated & Cased | CC-BY-4.0       |

For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352).
"""