|
--- |
|
language: |
|
- de |
|
license: apache-2.0 |
|
tags: |
|
- automatic-speech-recognition |
|
- de |
|
- hf-asr-leaderboard |
|
- mozilla-foundation/common_voice_7_0 |
|
- robust-speech-event |
|
datasets: |
|
- mozilla-foundation/common_voice_7_0 |
|
model-index: |
|
- name: Wav2Vec2-Large-XLSR-53-German-GPT2 |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: Common Voice 7 |
|
type: mozilla-foundation/common_voice_7_0 |
|
args: de |
|
metrics: |
|
- name: Test WER |
|
type: wer |
|
value: 10.02 |
|
- name: Test CER |
|
type: cer |
|
value: 4.7 |
|
--- |
|
|
|
# Wav2Vec2-Large-XLSR-53-German-GPT2 |
|
|
|
This is an encoder-decoder model for automatic speech recognition trained on on the |
|
MOZILLA-FOUNDATION/COMMON_VOICE_7_0 - DE dataset. The encoder was initialized from |
|
[jonatasgrosman/wav2vec2-large-xlsr-53-german](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german) and |
|
the decoder from [dbmdz/german-gpt2](https://huggingface.co/dbmdz/german-gpt2). |
|
|
|
It was trained using a two step process: |
|
* fine-tuning only the cross-attention weights and the decoder using the pre-computed outputs of the Wav2Vec-Modell |
|
* relatively fast training |
|
* also works on small GPU (eg. 8 GB) |
|
* but may take a lot of disk space |
|
* should already yield decent results |
|
* fine-tuning the model end-to-end |
|
* much slower |
|
* needs a bigger GPU |
|
|
|
There is also one trick, which seemed to improve performance significantly: adding position embeddings to the |
|
encoder outputs and initializing them with the pre-trained position embeddings of the GPT2 model (See `eval.py`). |
|
|
|
The training notebooks are still early drafts. Also results can probably improved a lot by using for example a learning |
|
rate schedule. |