the BEREL model is a model for rabbinic hebrew that was introduces in the paper Introducing BEREL: BERT Embeddings for Rabbinic-Encoded Language, By Avi Shmidman, Joshua Guedalia, Shaltiel Shmidman, Cheyn Shmuel Shmidman, Eli Handel, Moshe Koppel.
the abstract of the paper is:
We present a new pre-trained language model (PLM) for Rabbinic Hebrew, termed Berel (BERT Embeddings for Rabbinic-Encoded Language). Whilst other PLMs exist for processing Hebrew texts (e.g., HeBERT, AlephBert), they are all trained on modern Hebrew texts, which diverges substantially from Rabbinic Hebrew in terms of its lexicographical, morphological, syntactic and orthographic norms. We demonstrate the superiority of Berel on Rabbinic texts via a challenge set of Hebrew homographs. We release the new model and homograph challenge set for unrestricted use.
usage:
In general, BEREL usage follows standard usage for BERT models, although it is crucial to use our modified word-piece tokenizer, enclosed here (rabtokenizer.py). A python script using BEREL should thus include the following imports and initializations:
from rabtokenizer import RabbinicTokenizer
from transformers import BertTokenizer, BertForMaskedLM
tokenizer = RabbinicTokenizer(BertTokenizer.from_pretrained(os.path.join(bert_path, 'vocab.txt')))
model = BertForMaskedLM.from_pretrained(bert_path)
Demo site:
You can experiment with the model in a GUI interface here: https://dicta-bert-demo.netlify.app/?genre=rabbinic
- The main part of the GUI consists of word buttons visualizing the tokenization of the sentences. Clicking on a button masks it, and then three BEREL word predictions are shown. Clicking on that bubble expands it to 10 predictions; alternatively, ctrl-clicking on that initial bubble expands to 30 predictions.
- Ctrl-clicking adjacent word buttons combines them into a single token for the mask.
- The edit box on top contains the input sentence; this can be modified at will, and the word-buttons will adjust as relevant.
- Downloads last month
- 5