hf-audio/open_asr_leaderboard · [Average WER Calculation] Drop Common Voice WER.

Hugging Face for Audio org Mar 7, 2024

Hey hey!

Starting this discussion to discuss dropping CV WER from the overall calculation of the Avg. WER. The reason for removing it is as follows:

Common Voice (CV) does not maintain the integrity of train and test splits over each CV generation. Each new generation of CV essentially samples train, test, and validation splits randomly. This results in test data leakage for models trained on later generations of CV.

ref: https://discourse.mozilla.org/t/how-are-the-dev-test-train-datasets-split/36381/4

I recomputed the WER and you can see them in the below datasets:

Without CV: https://huggingface.co/datasets/reach-vb/open-asr-leaderboard-evals-ex-cv/viewer
With CV: https://huggingface.co/datasets/reach-vb/open-asr-leaderboard-evals-all

What do we think? @smajumdar94 @NithinK @sanchit-gandhi

Cheers!
VB

sanchit-gandhi

Mar 8, 2024

+1 agree on the above - it's impossible to guarantee no data leakage if we include the MCV series. Would be in favour removing this from the overall score, but maybe still keeping it as a column in the leaderboard with an asterisk to highlight that it's a leaked dataset

reach-vb

Hugging Face for Audio org Mar 8, 2024

Would be in favour removing this from the overall score, but maybe still keeping it as a column in the leaderboard with an asterisk to highlight that it's a leaked dataset

I wonder if this will cause some confusion in the community?

sanchit-gandhi

Mar 11, 2024

True - unless it was a "hidden" column that you expanded out upon click, but I think even this is too complex. Happy to exclude it and add a note in the README explaining why the dataset was removed post-release

nithinraok

Mar 11, 2024

I would prefer to remove MCV numbers all together from leaderboard, as we are not using them for average wer calculation and ranking.