File size: 1,742 Bytes
9da7e2a fa8d49f 3cb1ad3 9da7e2a fa8d49f 9da7e2a 004e907 9da7e2a 004e907 9da7e2a 004e907 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
language:
- de
license: apache-2.0
tags:
- automatic-speech-recognition
- de
- hf-asr-leaderboard
- mozilla-foundation/common_voice_7_0
- robust-speech-event
datasets:
- mozilla-foundation/common_voice_7_0
model-index:
- name: Wav2Vec2-Large-XLSR-53-German-GPT2
results:
- task:
name: Automatic Speech Recognition
type: automatic-speech-recognition
dataset:
name: Common Voice 7
type: mozilla-foundation/common_voice_7_0
args: de
metrics:
- name: Test WER
type: wer
value: 10.02
- name: Test CER
type: cer
value: 4.7
---
# Wav2Vec2-Large-XLSR-53-German-GPT2
This is an encoder-decoder model for automatic speech recognition trained on on the
MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - DE dataset. The encoder was initialized from
[jonatasgrosman/wav2vec2-large-xlsr-53-german](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german) and
the decoder from [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2).
It was trained using a two step process:
* fine-tuning only the cross-attention weights and the decoder using the pre-computed outputs of the Wav2Vec-Modell
* relatively fast training
* also works on small GPU (eg. 8 GB)
* but may take a lot of disk space
* should already yield decent results
* fine-tuning the model end-to-end
* much slower
* needs a bigger GPU
There is also one trick, which seemed to improve performance significantly: adding position embeddings to the
encoder outputs and initializing them with the pre-trained position embeddings of the GPT2 model (See `eval.py`).
The training notebooks are still early drafts. Also results can probably improved a lot by using for example a learning
rate schedule. |