idoco/MenakBERT · Hugging Face

MenakBERT

A Hebrew BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020). A Hebrew diacritizer based on a BERT-style char-level backbone. Predicts diacritical marks in a seq2seq fashion.

Model Description

This model is takes tau/tavbert-he and adds a three headed classification head that outputs 3 sequences corresponding to 3 types of Hebrew Niqqud (diacritics). It was finetuned on the dataset generously provided by Elazar Gershuni of Nakdimon.

Developed by: Jacob Gidron, Ido Cohen and Idan Pinto
Model type: Bert
Language: Hebrew
Finetuned from model: tau/tavbert-he

Repository: https://github.com/jacobgidron/MenakBert

Use

The model expects undotted Hebrew text, that may contain numbers and punctuation.

The output is three sequences of diacritical marks, corresponding with:

Dot distinguishing the letters Shin vs Sin.
The dot in the center of a letter that in some case changes pronunciation of certain letters, and in other cases creating a similar affect as an emphasis on the letter, or gemination.
All the rest of the marks, used mostly for vocalization.

The length of each sequence is the same as the input - each mark corresponding with the char at the same possition in the input.

The provided script weaves the sequences together.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Data

The backbone tau/tavber-he was trained on OSCAR (Ortiz, 2019) Hebrew section (10 GB text, 20 million sentences). The fine tuning was done on the Nakdimon dataset, which can be found at https://github.com/elazarg/hebrew_diacritized and contains 274,436 dotted Hebrew tokens across 413 documents. For more information see https://arxiv.org/abs/2105.05209

Model Card Contact

Ido Cohen - its.ido@gmail.com