Japanese Punctuation Restoration with BERT

Overview

This is a simple BERT model to insert punctuations into Japanese sentences. The model is trained on the Kaggle dataset of the transcript of the audio data of "Meian" by Natsume Soseki.

https://www.kaggle.com/datasets/bryanpark/japanese-single-speaker-speech-dataset/data?select=transcript.txt

The model is based on tohoku-nlp/bert-base-japanese-char-v3 and a linear layer is added to the output of the hidden layer to predict whether a token is in front of the punctuation("、" or "。") or not.

To use the model, you can simply call process_long_text function from insert_punctuation.py and input a long Japanese text without punctuations.

from insert_punctuation import process_long_text

process_long_text("女は昨夕艶めかしい姿をして彼の浴室の戸を開けた人に違なかった風呂場で彼を驚ろかした大きな髷をいつの間にか崩して尋常の束髪に結い更えたので彼はつい同じ人と気がつかずにいた彼はさらに声を聴いただけで顔を知らなかった伴の男の方をよそながらの初対面といった風に女と眺め比べた")

# -> 女は昨夕艶めかしい姿をして彼の浴室の戸を開けた人に違なかった。風呂場で彼を驚ろかした大きな髷をいつの間にか崩して尋常の束髪に結い更えたので彼はつい同じ人と気がつかずにいた。彼はさらに声を聴いただけで顔を知らなかった伴の男の方を、よそながらの初対面といった風に女と眺め比べた。