whisper/data/README.md · widj509/video-dubbing at bdb2db16cc6cc1b172b3b11129187e91130a440f

This directory supplements the paper with more details on how we prepared the data for evaluation, to help replicate our experiments.

Short-form English-only datasets

LibriSpeech

We used the test-clean and test-other splits from the LibriSpeech ASR corpus.

TED-LIUM 3

We used the test split of TED-LIUM Release 3, using the segmented manual transcripts included in the release.

Common Voice 5.1

We downloaded the English subset of Common Voice Corpus 5.1 from the official website

Artie

We used the Artie bias corpus. This is a subset of the Common Voice dataset.

CallHome & Switchboard

We used the two corpora from LDC2002S09 and LDC2002T43 and followed the eval2000_data_prep.sh script for preprocessing. The wav.scp files can be converted to WAV files with the following bash commands:

mkdir -p wav
while read name cmd; do
    echo $name
    echo ${cmd/\|/} wav/$name.wav | bash
done < wav.scp

WSJ

We used LDC93S6B and LDC94S13B and followed the s5 recipe to preprocess the dataset.

CORAAL

We used the 231 interviews from CORAAL (v. 2021.07) and used the segmentations from the FairSpeech project.

CHiME-6

We downloaded the CHiME-5 dataset and followed the stage 0 of the s5_track1 recipe to create the CHiME-6 dataset which fixes synchronization. We then used the binaural recordings (*_P??.wav) and the corresponding transcripts.

AMI-IHM, AMI-SDM1

We preprocessed the AMI Corpus by following the stage 0 ad 2 of the s5b recipe.

Long-form English-only datasets

TED-LIUM 3

To create a long-form transcription dataset from the TED-LIUM3 dataset, we sliced the audio between the beginning of the first labeled segment and the end of the last labeled segment of each talk, and we used the concatenated text as the label. Below are the timestamps used for slicing each of the 11 TED talks in the test split.

Filename	Begin time (s)	End time (s)
DanBarber_2010	16.09	1116.24
JaneMcGonigal_2010	15.476	1187.61
BillGates_2010	15.861	1656.94
TomWujec_2010U	16.26	402.17
GaryFlake_2010	16.06	367.14
EricMead_2009P	18.434	536.44
MichaelSpecter_2010	16.11	979.312
DanielKahneman_2010	15.8	1199.44
AimeeMullins_2009P	17.82	1296.59
JamesCameron_2010	16.75	1010.65
RobertGupta_2010U	16.8	387.03

Meanwhile

This dataset consists of 64 segments from The Late Show with Stephen Colbert. The YouTube video ID, start and end timestamps, and the labels can be found in meanwhile.json. The labels are collected from the closed-caption data for each video and corrected with manual inspection.

Rev16

We use a subset of 16 files from the 30 podcast episodes in Rev.AI's Podcast Transcription Benchmark, after finding that there are multiple cases where a significant portion of the audio and the labels did not match, mostly on the parts introducing the sponsors. We selected 16 episodes that do not have this error, whose "file number" are:

3 4 9 10 11 14 17 18 20 21 23 24 26 27 29 32

Kincaid46

This dataset consists of 46 audio files and the corresponding transcripts compiled in the blog article Which automatic transcription service is the most accurate - 2018 by Jason Kincaid. We used the 46 audio files and reference transcripts from the Airtable widget in the article.

For the human transcription benchmark in the paper, we use a subset of 25 examples from this data, whose "Ref ID" are:

2 4 5 8 9 10 12 13 14 16 19 21 23 25 26 28 29 30 33 35 36 37 42 43 45

Earnings-21, Earnings-22

For these datasets, we used the files available in the speech-datasets repository, as of their 202206 version.

CORAAL

We used the 231 interviews from CORAAL (v. 2021.07) and used the full-length interview files and transcripts.

Multilingual datasets

Multilingual LibriSpeech

We used the test splits from each language in the Multilingual LibriSpeech (MLS) corpus.

Fleurs

We collected audio files and transcripts using the implementation available as HuggingFace datasets. To use as a translation dataset, we matched the numerical utterance IDs to find the corresponding transcript in English.

VoxPopuli

We used the get_asr_data.py script from the official repository to collect the ASR data in 14 languages.

Common Voice 9

We downloaded the Common Voice Corpus 9 from the official website

CoVOST 2

We collected the X into English data collected using the official repository.