Edit model card

Wav2Vec2-Large-XLSR-53-German-GPT2

This is an encoder-decoder model for automatic speech recognition trained on on the MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - DE dataset. The encoder was initialized from jonatasgrosman/wav2vec2-large-xlsr-53-german and the decoder from dbmdz/german-gpt2.

It was trained using a two step process:

  • fine-tuning only the cross-attention weights and the decoder using the pre-computed outputs of the Wav2Vec-Modell
    • relatively fast training
    • also works on small GPU (eg. 8 GB)
    • but may take a lot of disk space
    • should already yield decent results
  • fine-tuning the model end-to-end
    • much slower
    • needs a bigger GPU

There is also one trick, which seemed to improve performance significantly: adding position embeddings to the encoder outputs and initializing them with the pre-trained position embeddings of the GPT2 model (See eval.py).

The training notebooks are still early drafts. Also results can probably improved a lot by using for example a learning rate schedule.

Downloads last month
14
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train jsnfly/wav2vec2-large-xlsr-53-german-gpt2

Evaluation results