metadata

license: apache-2.0
datasets:
  - denis-berezutskiy-lad/ru_transcription_punctuation
language:
  - ru
metrics:
  - f1
  - precision
  - recall
library_name: nemo
pipeline_tag: token-classification

About

This is a punctuator/capitalizer model for Russian language, trained via NeMo scripts (https://github.com/NVIDIA/NeMo) on a dataset of continuous professional transcriptions (mostly legislative instances, and some OpenSubtitles as well) - see dataset https://huggingface.co/datasets/denis-berezutskiy-lad/ru_transcription_punctuation for details.

Note that even though the model was prepaired using NeMo, the standard inference scripts of making result text don't work well with this model, because it has some advanced labels, which require custom handling. That's why a set of ipynb scripts was created (covers both the model training and inference as well as creating the above mentioned dataset):

https://github.com/denis-berezutskiy-lad/transcription-bert-ru-punctuator-scripts/tree/main

The underlying base model is https://huggingface.co/DeepPavlov/rubert-base-cased-conversational

Why one more punctuator

The idea behind the project is to use large continous professional transcriptions for training rather than relying on short low-quality samples consisting of 1-2 sentences (which is typical for the most popular datasets in Russian). Our experiments show significant improvements comparing to BERTs trained on the standard Ru datasets (social comments, omnia russica etc.). That's why we mainly use transcriptions published by Russian legislatures (Gosduma, Mosgorduma) with some addition of film subtitles from OpenSubtitles project.

Supported labels

Please note that some new labels are not supported by NeMo scripts out of the box (-, —, T), so we need to add special handling for them. See the inference notebook for details.

Punctuation

O,.?!:;…⁈-—

Capitalization

OUT

(T means abbreviation ("total" uppercase))