Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
from pathlib import Path | |
# Directory where request by models are stored | |
DIR_OUTPUT_REQUESTS = Path("requested_models") | |
EVAL_REQUESTS_PATH = Path("eval_requests") | |
########################## | |
# Text definitions # | |
########################## | |
banner_url = "https://huggingface.co/datasets/reach-vb/random-images/resolve/main/asr_leaderboard.png" | |
BANNER = f'<div style="display: flex; justify-content: space-around;"><img src="{banner_url}" alt="Banner" style="width: 40vw; min-width: 300px; max-width: 600px;"> </div>' | |
TITLE = "<html> <head> <style> h1 {text-align: center;} </style> </head> <body> <h1> π€ Open Automatic Speech Recognition Leaderboard </b> </body> </html>" | |
INTRODUCTION_TEXT = "π The π€ Open ASR Leaderboard ranks and evaluates speech recognition models \ | |
on the Hugging Face Hub. \ | |
\nWe report the Average [WER](https://huggingface.co/spaces/evaluate-metric/wer) (β¬οΈ) and [RTF](https://openvoice-tech.net/index.php/Real-time-factor) (β¬οΈ) - lower the better. Models are ranked based on their Average WER, from lowest to highest. Check the π Metrics tab to understand how the models are evaluated. \ | |
\nIf you want results for a model that is not listed here, you can submit a request for it to be included βοΈβ¨. \ | |
\nThe leaderboard currently focuses on English speech recognition, and will be expanded to multilingual evaluation in later versions." | |
CITATION_TEXT = """@misc{open-asr-leaderboard, | |
title = {Open Automatic Speech Recognition Leaderboard}, | |
author = {Srivastav, Vaibhav and Majumdar, Somshubra and Koluguri, Nithin and Moumen, Adel and Gandhi, Sanchit and {Hugging Face Team} and {Nvidia NeMo Team} and {SpeechBrain Team}}, | |
year = 2023, | |
publisher = {Hugging Face}, | |
howpublished = "\\url{https://huggingface.co/spaces/hf-audio/open_asr_leaderboard}" | |
} | |
""" | |
METRICS_TAB_TEXT = """ | |
Here you will find details about the speech recognition metrics and datasets reported in our leaderboard. | |
## Metrics | |
π― Word Error Rate (WER) and Real-Time Factor (RTF) are popular metrics for evaluating the accuracy of speech recognition | |
models by estimating how accurate the predictions from the models are and how fast they are returned. We explain them each | |
below. | |
### Word Error Rate (WER) | |
Word Error Rate is used to measure the **accuracy** of automatic speech recognition systems. It calculates the percentage | |
of words in the system's output that differ from the reference (correct) transcript. **A lower WER value indicates higher accuracy**. | |
``` | |
Example: If the reference transcript is "I really love cats," and the ASR system outputs "I don't love dogs,". | |
The WER would be `50%` because 2 out of 4 words are incorrect. | |
``` | |
For a fair comparison, we calculate **zero-shot** (i.e. pre-trained models only) *normalised WER* for all the model checkpoints. You can find the evaluation code on our [Github repository](https://github.com/huggingface/open_asr_leaderboard). To read more about how the WER is computed, refer to the [Audio Transformers Course](https://huggingface.co/learn/audio-course/chapter5/evaluation). | |
### Real Time Factor (RTF) | |
Real Time Factor is a measure of the **latency** of automatic speech recognition systems, i.e. how long it takes an | |
model to process a given amount of speech. It's usually expressed as a multiple of real time. An RTF of 1 means it processes | |
speech as fast as it's spoken, while an RTF of 2 means it takes twice as long. Thus, **a lower RTF value indicates lower latency**. | |
``` | |
Example: If it takes an ASR system 10 seconds to transcribe 10 seconds of speech, the RTF is 1. | |
If it takes 20 seconds to transcribe the same 10 seconds of speech, the RTF is 2. | |
``` | |
For the benchmark, we report RTF averaged over a 10 minute audio sample with 5 warm up batches followed 3 graded batches. | |
## How to reproduce our results | |
The ASR Leaderboard will be a continued effort to benchmark open source/access speech recognition models where possible. | |
Along with the Leaderboard we're open-sourcing the codebase used for running these evaluations. | |
For more details head over to our repo at: https://github.com/huggingface/open_asr_leaderboard | |
P.S. We'd love to know which other models you'd like us to benchmark next. Contributions are more than welcome! β₯οΈ | |
## Benchmark datasets | |
Evaluating Speech Recognition systems is a hard problem. We use the multi-dataset benchmarking strategy proposed in the | |
[ESB paper](https://arxiv.org/abs/2210.13352) to obtain robust evaluation scores for each model. | |
ESB is a benchmark for evaluating the performance of a single automatic speech recognition (ASR) system across a broad | |
set of speech datasets. It comprises eight English speech recognition datasets, capturing a broad range of domains, | |
acoustic conditions, speaker styles, and transcription requirements. As such, it gives a better indication of how | |
a model is likely to perform on downstream ASR compared to evaluating it on one dataset alone. | |
The ESB score is calculated as a macro-average of the WER scores across the ESB datasets. The models in the leaderboard | |
are ranked based on their average WER scores, from lowest to highest. | |
| Dataset | Domain | Speaking Style | Train (h) | Dev (h) | Test (h) | Transcriptions | License | | |
|-----------------------------------------------------------------------------------------|-----------------------------|-----------------------|-----------|---------|----------|--------------------|-----------------| | |
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | Audiobook | Narrated | 960 | 11 | 11 | Normalised | CC-BY-4.0 | | |
| [Common Voice 9](https://huggingface.co/datasets/mozilla-foundation/common_voice_9_0) | Wikipedia | Narrated | 1409 | 27 | 27 | Punctuated & Cased | CC0-1.0 | | |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | European Parliament | Oratory | 523 | 5 | 5 | Punctuated | CC0 | | |
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | TED talks | Oratory | 454 | 2 | 3 | Normalised | CC-BY-NC-ND 3.0 | | |
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | Audiobook, podcast, YouTube | Narrated, spontaneous | 2500 | 12 | 40 | Punctuated | apache-2.0 | | |
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) | Financial meetings | Oratory, spontaneous | 4900 | 100 | 100 | Punctuated & Cased | User Agreement | | |
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22) | Financial meetings | Oratory, spontaneous | 105 | 5 | 5 | Punctuated & Cased | CC-BY-SA-4.0 | | |
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | Meetings | Spontaneous | 78 | 9 | 9 | Punctuated & Cased | CC-BY-4.0 | | |
For more details on the individual datasets and how models are evaluated to give the ESB score, refer to the [ESB paper](https://arxiv.org/abs/2210.13352). | |
""" | |