Late fusion ensembles for speech recognition on diverse input audio representations
Abstract
We explore diverse representations of <PRE_TAG>speech audio</POST_TAG>, and their effect on a performance of <PRE_TAG>late fusion</POST_TAG> <PRE_TAG>ensemble</POST_TAG> of E-Branchformer models, applied to Automatic Speech Recognition (ASR) task. Although it is generally known that <PRE_TAG><PRE_TAG>ensemble</POST_TAG> methods</POST_TAG> often improve the performance of the system even for speech recognition, it is very interesting to explore how <PRE_TAG>ensemble</POST_TAG>s of complex <PRE_TAG>state-of-the-art</POST_TAG> models, such as medium-sized and large E-Branchformers, cope in this setting when their base models are trained on diverse representations of the input <PRE_TAG>speech audio</POST_TAG>. The results are evaluated on four widely-used benchmark datasets: <PRE_TAG>Librispeech</POST_TAG>, Aishell, Gigaspeech, TEDLIUMv2 and show that improvements of 1% - 14% can still be achieved over the <PRE_TAG>state-of-the-art</POST_TAG> models trained using comparable techniques on these datasets. A noteworthy observation is that such <PRE_TAG>ensemble</POST_TAG> offers improvements even with the use of <PRE_TAG><PRE_TAG>language models</POST_TAG></POST_TAG>, although the gap is closing.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper