Spaces:
Running
Running
from textwrap import dedent | |
from iso639 import Lang | |
BANNER_TEXT = """ | |
<div style="text-align: center;"> | |
<h1><a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit Benchmarks</a></h1> | |
</div> | |
""" | |
INTRO_LABEL = """We present comprehensive benchmarks for WhisperKit, our on-device ASR solution, compared against a reference implementation. These benchmarks aim to help developers and enterprises make informed decisions when choosing optimized or compressed variants of machine learning models for production use. Show more.""" | |
INTRO_TEXT = """ | |
<h3 style="display: flex; | |
justify-content: center; | |
align-items: center; | |
"></h2> | |
\n📈 Key Metrics: | |
Word Error Rate (WER) (⬇️): The percentage of words incorrectly transcribed. Lower is better. | |
Quality of Inference (QoI) (⬆️): Percentage of examples where WhisperKit performs no worse than the reference model. Higher is better. | |
Tokens per Second (⬆️): The number of output tokens generated per second. Higher is better. | |
Speed (⬆️): Input audio seconds transcribed per second. Higher is better. | |
🎯 WhisperKit is evaluated across different datasets, with a focus on per-example no-regressions (QoI) and overall accuracy (WER). | |
\n💻 Our benchmarks include: | |
Reference: <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> (OpenAI's Whisper API) | |
On-device: <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> (various versions and optimizations) | |
ℹ️ Reference Implementation: | |
<a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> sets the reference standard. We assume it uses the equivalent of openai/whisper-large-v2 in float16 precision, along with additional undisclosed optimizations from OpenAI. As of 02/29/24, it costs $0.36 per hour of audio and has a 25MB file size limit per request. | |
\n🔍 We use two primary datasets: | |
<a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a>: ~5 hours of short English audio clips | |
<a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a>: ~120 hours of English audio from earnings calls | |
🌐 Multilingual Benchmarks: | |
These benchmarks aim to demonstrate WhisperKit's capabilities across diverse languages, helping developers assess its suitability for multilingual applications. | |
\nDataset: | |
<a href='https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual'>Common Voice 17.0</a>: Short-form audio files (<30s/clip) for a maximum of 400 samples per language from Common Voice 17.0. Test set covers a wide range of languages to test model's versatility. | |
\nMetrics: | |
Average WER: Provides an overall measure of model performance across all languages. | |
Language-specific WER: Allows for detailed analysis of model performance for each supported language. | |
Language Detection Accuracy: Measured using a confusion matrix, showing the model's ability to identify the correct language. | |
Results are shown for both forced (correct language given as input) and unforced (model detects language) scenarios. | |
🔄 Results are periodically updated using our automated evaluation pipeline on Apple Silicon Macs. | |
\n🛠️ Developers can use <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> to reproduce these results or run evaluations on their own custom datasets. | |
🔗 Links: | |
- <a href='https://github.com/argmaxinc/WhisperKit'>WhisperKit</a> | |
- <a href='https://github.com/argmaxinc/whisperkittools'>whisperkittools</a> | |
- <a href='https://huggingface.co/datasets/argmaxinc/librispeech'>LibriSpeech</a> | |
- <a href='https://huggingface.co/datasets/argmaxinc/earnings22'>Earnings22</a> | |
- <a href='https://huggingface.co/datasets/argmaxinc/whisperkit-evals-multilingual'>Common Voice 17.0</a> | |
- <a href='https://platform.openai.com/docs/guides/speech-to-text'>WhisperOpenAIAPI</a> | |
""" | |
METHODOLOGY_TEXT = dedent( | |
""" | |
# Methodology | |
## Overview | |
WhisperKit Benchmarks is the one-stop shop for on-device performance and quality testing of WhisperKit models across supported devices, OS versions and audio datasets. | |
## Metrics | |
- **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second. | |
- **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time. | |
- This metric varies with input data given that the pace of speech changes the text decoder % of overall latency. This metric should not be confused with the reciprocal of the text decoder latency which is constant across input files. | |
- **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy. | |
- **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model. | |
- This metric does not capture improvements to the reference. It only measures potential regressions. | |
- **Multilingual results**: Separated into "language hinted" and "language predicted" categories to evaluate performance with and without prior knowledge of the input language. | |
## Data | |
- **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). Proxy for average streaming performance. | |
- **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). Proxy for average from-file performance. | |
- Full datasets are used for English Quality tests and random 10-minute subsets are used for Performance tests. | |
- **Multilingual**: Max 400 samples per language with <30s/clip from [Common Voice 17.0 Test Set](https://huggingface.co/datasets/argmaxinc/common_voice_17_0-argmax_subset-400). Common Voice covers 77 of the 99 languages supported by Whisper. | |
## Performance Measurement | |
1. On-device testing is conducted with [WhisperKit Regression Test Automations](https://github.com/argmaxinc/WhisperKit/blob/main/BENCHMARKS.md) on iPhones, iPads, and Macs, across different iOS and macOS versions. | |
2. Performance is recorded on 10-minute datasets described above for short- and long-form | |
3. Quality metrics are recorded on full datasets on Apple M2 Ultra Mac Studios to allow for fast processing of many configurations and providing a consistent, high-performance baseline for all evaluations displayed in the English Quality tab. | |
4. Quality is also sanity-checked on 10-minute datasets in order to catch potential correctness regressions across different device and OS combinations despite running the same version of WhisperKit. | |
5. Results are aggregated and presented in the dashboard, allowing for easy comparison and analysis. | |
## Dashboard Features | |
- Performance: Interactive filtering by model, device, OS, and performance metrics | |
- Timeline: Visualizations of performance trends | |
- English Quality: English transcription quality on short- and long-form audio | |
- Multilingual Quality: Multilingual (77) transcription quality on short-form audio with and without language prediction | |
- Device Support: Matrix of supported device, OS and model version combinations. Unsupported combinations are marked with :warning:. | |
- This methodology ensures a comprehensive and fair evaluation of speech recognition models supported by WhisperKit across a wide range of scenarios and use cases. | |
""" | |
) | |
PERFORMANCE_TEXT = dedent( | |
""" | |
## Metrics | |
- **Speed factor** (⬆️): Computed as the ratio of input audio length to end-to-end WhisperKit latency for transcribing that audio. A speed factor of N means N seconds of input audio was transcribed in 1 second. | |
- **Tok/s (Tokens per second)** (⬆️): Total number of text decoder forward passes divided by the end-to-end processing time. | |
## Data | |
- **Short-form**: 5 hours of English audiobook clips with 30s/clip comprising the [librispeech test set](https://huggingface.co/datasets/argmaxinc/librispeech). | |
- **Long-form**: 12 hours of earnings call recordings with ~1hr/clip in English with various accents. Built by randomly selecting 10% of the [earnings22 test set](https://huggingface.co/datasets/argmaxinc/earnings22-12hours). | |
""" | |
) | |
QUALITY_TEXT = dedent( | |
""" | |
## Metrics | |
- **WER (Word Error Rate)** (⬇️): The ratio of words incorrectly transcribed when comparing the model's output to reference transcriptions, with lower values indicating better accuracy. | |
- **QoI (Quality of Inference)** (⬆️): The ratio of examples where WhisperKit performs no worse than the reference model. | |
- This metric does not capture improvements to the reference. It only measures potential regressions. | |
""" | |
) | |
COL_NAMES = { | |
"model.model_version": "Model", | |
"device.product_name": "Device", | |
"device.os": "OS", | |
"average_wer": "Average WER", | |
"qoi": "QoI", | |
"speed": "Speed", | |
"tokens_per_second": "Tok / s", | |
"model": "Model", | |
"device": "Device", | |
"os": "OS", | |
"english_wer": "English WER", | |
"multilingual_wer": "Multilingual WER", | |
} | |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" | |
CITATION_BUTTON_TEXT = r"""@misc{whisperkit-argmax, | |
title = {WhisperKit}, | |
author = {Argmax, Inc.}, | |
year = {2024}, | |
URL = {https://github.com/argmaxinc/WhisperKit} | |
}""" | |
HEADER = """<div align="center"> | |
<div position: relative> | |
<img | |
src="" | |
style="display:block;width:7%;height:auto;" | |
/> | |
</div> | |
</div>""" | |
EARNINGS22_URL = ( | |
"https://huggingface.co/datasets/argmaxinc/earnings22-debug/resolve/main/{0}" | |
) | |
LIBRISPEECH_URL = ( | |
"https://huggingface.co/datasets/argmaxinc/librispeech-debug/resolve/main/{0}" | |
) | |
AUDIO_URL = ( | |
"https://huggingface.co/datasets/argmaxinc/whisperkit-test-data/resolve/main/" | |
) | |
WHISPER_OPEN_AI_LINK = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals/tree/main/WhisperKit/{}/{}" | |
BASE_WHISPERKIT_BENCHMARK_URL = "https://huggingface.co/datasets/argmaxinc/whisperkit-evals-dataset/blob/main/benchmark_data" | |
AVAILABLE_LANGUAGES = [ | |
"af", | |
"am", | |
"ar", | |
"as", | |
"az", | |
"ba", | |
"be", | |
"bg", | |
"bn", | |
"br", | |
"ca", | |
"cs", | |
"cy", | |
"da", | |
"de", | |
"el", | |
"en", | |
"es", | |
"et", | |
"eu", | |
"fa", | |
"fi", | |
"fr", | |
"gl", | |
"ha", | |
"he", | |
"hi", | |
"hu", | |
"hy", | |
"id", | |
"it", | |
"ja", | |
"ka", | |
"kk", | |
"ko", | |
"lo", | |
"lt", | |
"lv", | |
"mk", | |
"ml", | |
"mn", | |
"mr", | |
"mt", | |
"ne", | |
"nl", | |
"nn", | |
"oc", | |
"pa", | |
"pl", | |
"ps", | |
"pt", | |
"ro", | |
"ru", | |
"sk", | |
"sl", | |
"sq", | |
"sr", | |
"sv", | |
"sw", | |
"ta", | |
"te", | |
"th", | |
"tk", | |
"tr", | |
"tt", | |
"uk", | |
"ur", | |
"uz", | |
"vi", | |
"yi", | |
"yo", | |
"yue", | |
"zh", | |
] | |
LANGUAGE_MAP = {lang: Lang(lang).name for lang in AVAILABLE_LANGUAGES} | |