metadata

language: cs
tags:
  - Czech
  - KKY
  - FAV
license: cc-by-nc-sa-4.0

wav2vec2-base-cs-80k-ClTRUS

Czech language TRransformer from Unlabeled Speech (ClTRUS) is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 80 thousand hours of Czech speech.

This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.

Note: This is a checkpoint of the model after 4 epochs over the whole dataset. If you want some earlier or later checkpoints, please feel free to contact the author (jlehecka(at)kky.zcu.cz).

Pretraining data

More than 80 thousand hours of unlabeled Czech speech:

recordings from radio (22k hours),
unlabeled data from VoxPopuli dataset (18.7k hours),
TV shows (15k hours),
shadow speakers (12k hours),
sports (5k hours),
telephone data (2k hours),
and a smaller amount of data from several other domains including the public CommonVoice dataset.

Usage

Inputs must be 16kHz mono audio files.

This model can be used e.g. to extract per-frame contextual embeddings from audio:

from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
import torchaudio

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-cs-80k-ClTRUS")
model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-cs-80k-ClTRUS")

speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
inputs = feature_extractor(
    speech_array, 
    sampling_rate=16_000, 
    return_tensors="pt"
)["input_values"][0]

output = model(inputs)
embeddings = output.last_hidden_state.detach().numpy()[0]

Speech recognition results

After fine-tuning, the model scored the following results on public datasets:

Czech portion of CommonVoice v7.0: WER = 3.8%
Czech portion of VoxPopuli: WER = 8.8%

See our paper for details.

Paper

The preprint of our paper (accepted to INTERSPEECH 2022) is available at http://arxiv.org/abs/2206.07627

Citation

If you find this model useful, please cite our paper:

@inproceedings{wav2vec2-base-cs-80k-ClTRUS,
  title = {{Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of Czech}},
  author = {
    Jan Lehe\v{c}ka and 
    Jan \v{S}vec and 
    Ale\v{s} Pra\v{z}\'ak and 
    Josef V. Psutka
  },
  booktitle={Proc. Interspeech 2022},
  pages={1831--1835},
  year = {2022},
  doi={10.21437/Interspeech.2022-10439}
}

fav-kky
/

wav2vec2-base-cs-80k-ClTRUS