sadrasabouri commited on
Commit
a0b45c2
1 Parent(s): 6671ee0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -51,11 +51,11 @@ model-index:
51
  The base model fine-tuned on 108 hours of Commonvoice on 16kHz sampled speech audio. When using the model
52
  make sure that your speech input is also sampled at 16Khz.
53
 
54
- #[Paper](https://arxiv.org/abs/2006.11477)
55
 
56
- #Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
57
 
58
- #**Abstract**
59
 
60
  #We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can #outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and #solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all #labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec #2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of #labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech #recognition with limited amounts of labeled data.
61
 
 
51
  The base model fine-tuned on 108 hours of Commonvoice on 16kHz sampled speech audio. When using the model
52
  make sure that your speech input is also sampled at 16Khz.
53
 
54
+ # [Paper](https://arxiv.org/abs/2006.11477)
55
 
56
+ # Authors: Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli
57
 
58
+ # **Abstract**
59
 
60
  #We show for the first time that learning powerful representations from speech audio alone followed by fine-tuning on transcribed speech can #outperform the best semi-supervised methods while being conceptually simpler. wav2vec 2.0 masks the speech input in the latent space and #solves a contrastive task defined over a quantization of the latent representations which are jointly learned. Experiments using all #labeled data of Librispeech achieve 1.8/3.3 WER on the clean/other test sets. When lowering the amount of labeled data to one hour, wav2vec #2.0 outperforms the previous state of the art on the 100 hour subset while using 100 times less labeled data. Using just ten minutes of #labeled data and pre-training on 53k hours of unlabeled data still achieves 4.8/8.2 WER. This demonstrates the feasibility of speech #recognition with limited amounts of labeled data.
61