Wav2Vec2-Large-XLSR-53-German-GPT2
This is an encoder-decoder model for automatic speech recognition trained on on the MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - DE dataset. The encoder was initialized from jonatasgrosman/wav2vec2-large-xlsr-53-german and the decoder from dbmdz/german-gpt2.
It was trained using a two step process:
- fine-tuning only the cross-attention weights and the decoder using the pre-computed outputs of the Wav2Vec-Modell
- relatively fast training
- also works on small GPU (eg. 8 GB)
- but may take a lot of disk space
- should already yield decent results
- fine-tuning the model end-to-end
- much slower
- needs a bigger GPU
There is also one trick, which seemed to improve performance significantly: adding position embeddings to the
encoder outputs and initializing them with the pre-trained position embeddings of the GPT2 model (See eval.py
).
The training notebooks are still early drafts. Also results can probably improved a lot by using for example a learning rate schedule.
- Downloads last month
- 14
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Dataset used to train jsnfly/wav2vec2-large-xlsr-53-german-gpt2
Evaluation results
- Test WER on Common Voice 7self-reported10.020
- Test CER on Common Voice 7self-reported4.700