--- license: apache-2.0 datasets: - denis-berezutskiy-lad/ru_transcription_punctuation language: - ru metrics: - f1 - precision - recall library_name: nemo pipeline_tag: token-classification --- # About This is a punctuator/capitalizer model for Russian language, trained via NeMo scripts (https://github.com/NVIDIA/NeMo) on a dataset of continuous professional transcriptions (mostly legislative instances, and some OpenSubtitles as well) - see dataset https://huggingface.co/datasets/denis-berezutskiy-lad/ru_transcription_punctuation for details. Note that even though the model was prepaired using NeMo, the standard inference scripts of making result text don't work well with this model, because it has some advanced labels, which require custom handling. That's why a set of ipynb scripts was created (covers both the model training and inference as well as creating the above mentioned dataset): https://github.com/denis-berezutskiy-lad/transcription-bert-ru-punctuator-scripts/tree/main The underlying base model is https://huggingface.co/DeepPavlov/rubert-base-cased-conversational # Why one more punctuator The idea behind the project is to use large continous professional transcriptions for training rather than relying on short low-quality samples consisting of 1-2 sentences (which is typical for the most popular datasets in Russian). Our experiments show significant improvements comparing to BERTs trained on the standard Ru datasets (social comments, omnia russica etc.). That's why we mainly use transcriptions published by Russian legislatures (Gosduma, Mosgorduma) with some addition of film subtitles from OpenSubtitles project. # Supported labels Please note that some new labels are not supported by NeMo scripts out of the box (-, —, T), so we need to add special handling for them. See the inference notebook for details. ## Punctuation O,.?!:;…⁈-— ## Capitalization OUT (T means abbreviation ("total" uppercase))