metadata

language:
  - de
license: apache-2.0
tags:
  - automatic-speech-recognition
  - mozilla-foundation/common_voice_7_0
  - de
datasets:
  - mozilla-foundation/common_voice_7_0
model-index:
  - name: Wav2Vec2-Large-XLSR-53-German-GPT2
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Common Voice 7
          type: mozilla-foundation/common_voice_7_0
          args: de
        metrics:
          - name: Test WER
            type: wer
            value: 10.02
          - name: Test CER
            type: cer
            value: 4.7

Wav2Vec2-Large-XLSR-53-German-GPT2

This is an encoder-decoder model for automatic speech recognition trained on on the MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - DE dataset. The encoder was initialized from jonatasgrosman/wav2vec2-large-xlsr-53-german and the decoder from dbmdz/german-gpt2.

It was trained using a two step process:

fine-tuning only the cross-attention weights and the decoder using the pre-computed outputs of the Wav2Vec-Modell
fine-tuning the model end-to-end

There is also one trick, which seemed to improve performance significantly: adding position embeddings to the encoder outputs and initializing them with the pre-trained position embeddings of the GPT2 model (See eval.py).