[Average WER Calculation] Drop Common Voice WER.

#14
by reach-vb HF staff - opened
Hugging Face for Audio org

Hey hey!

Starting this discussion to discuss dropping CV WER from the overall calculation of the Avg. WER. The reason for removing it is as follows:

Common Voice (CV) does not maintain the integrity of train and test splits over each CV generation. Each new generation of CV essentially samples train, test, and validation splits randomly. This results in test data leakage for models trained on later generations of CV.

ref: https://discourse.mozilla.org/t/how-are-the-dev-test-train-datasets-split/36381/4

I recomputed the WER and you can see them in the below datasets:

Without CV: https://huggingface.co/datasets/reach-vb/open-asr-leaderboard-evals-ex-cv/viewer
With CV: https://huggingface.co/datasets/reach-vb/open-asr-leaderboard-evals-all

What do we think? @smajumdar94 @NithinK @sanchit-gandhi

Cheers!
VB

Hugging Face for Audio org

+1 agree on the above - it's impossible to guarantee no data leakage if we include the MCV series. Would be in favour removing this from the overall score, but maybe still keeping it as a column in the leaderboard with an asterisk to highlight that it's a leaked dataset

Hugging Face for Audio org

Would be in favour removing this from the overall score, but maybe still keeping it as a column in the leaderboard with an asterisk to highlight that it's a leaked dataset

I wonder if this will cause some confusion in the community?

Hugging Face for Audio org

True - unless it was a "hidden" column that you expanded out upon click, but I think even this is too complex. Happy to exclude it and add a note in the README explaining why the dataset was removed post-release

I would prefer to remove MCV numbers all together from leaderboard, as we are not using them for average wer calculation and ranking.

Sign up or log in to comment