Legal-HeBERT
Legal-HeBERT is a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. We release two versions of Legal-HeBERT. The first version is a fine-tuned model of HeBERT applied on legal and legislative documents. The second version uses HeBERT's architecture guidlines to train a BERT model from scratch.
We continue collecting legal data, examining different architectural designs, and performing tagged datasets and legal tasks for evaluating and to development of a Hebrew legal tools.
Training Data
Our training datasets are:
| Name | Hebrew Description | Size (GB) | Documents | Sentences | Words | Notes | |----------------------------------------------------------------------------------------------------------------------------------- |-------------------------------------------------------------------------- |----------- |----------- |------------ |------------- |----------------------------------------- | | The Israeli Law Book | ספר החוקים הישראלי | 0.05 | 2338 | 293352 | 4851063 | | | Judgments of the Supreme Court | מאגר פסקי הדין של בית המשפט העליון | 0.7 | 212348 | 5790138 | 79672415 | | | custody courts | החלטות בתי הדין למשמורת | 2.46 | 169,708 | 8,555,893 | 213,050,492 | | | Law memoranda, drafts of secondary legislation and drafts of support tests that have been distributed to the public for comment | תזכירי חוק, טיוטות חקיקת משנה וטיוטות מבחני תמיכה שהופצו להערות הציבור | 0.4 | 3,291 | 294,752 | 7,218,960 | | | Supervisors of Land Registration judgments | מאגר פסקי דין של המפקחים על רישום המקרקעין | 0.02 | 559 | 67,639 | 1,785,446 | | | Decisions of the Labor Court - Corona | מאגר החלטות בית הדין לעניין שירות התעסוקה – קורונה | 0.001 | 146 | 3505 | 60195 | | | Decisions of the Israel Lands Council | החלטות מועצת מקרקעי ישראל | | 118 | 11283 | 162692 | aggregate file | | Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal | פסקי דין של בית הדין למשמעת ובית הדין לערעורים של משטרת ישראל | 0.02 | 54 | 83724 | 1743419 | aggregate files | | Disciplinary Appeals Committee in the Ministry of Health | ועדת ערר לדין משמעתי במשרד הבריאות | 0.004 | 252 | 21010 | 429807 | 465 files are scanned and didn't parser | | Attorney General's Positions | מאגר התייצבויות היועץ המשפטי לממשלה | 0.008 | 281 | 32724 | 813877 | | | Legal-Opinion of the Attorney General | מאגר חוות דעת היועץ המשפטי לממשלה | 0.002 | 44 | 7132 | 188053 | | | | | | | | | | | total | | 3.665 | 389,139 | 15,161,152 | 309,976,419 | |
We thank Yair Gardin for the referring to the governance data, Elhanan Schwarts for collecting and parsing The Israeli law book, and Jonathan Schler for collecting the judgments of the supreme court.
Training process
- Vocabulary size: 50,000 tokens
- 4 epochs (1M steps±)
- lr=5e-5
- mlm_probability=0.15
- batch size = 32 (for each gpu)
- NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)
Additional training settings:
Fine-tuned HeBERT model: The first eight layers were freezed (like Lee et al. (2019) suggest)
Legal-HeBERT trained from scratch: The training process is similar to HeBERT and inspired by Chalkidis et al. (2020)
How to use
The models can be found in huggingface hub and can be fine-tunned to any down-stream task:
# !pip install transformers==4.14.1
from transformers import AutoTokenizer, AutoModel
model_name = 'avichr/Legal-heBERT_ft' # for the fine-tuned HeBERT model
model_name = 'avichr/Legal-heBERT' # for legal HeBERT model trained from scratch
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
from transformers import pipeline
fill_mask = pipeline(
"fill-mask",
model=model_name,
)
fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")
Stay tuned!
We are still working on our models and the datasets. We will edit this page as we progress. We are open for collaborations.
If you used this model please cite us as :
Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai, Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts (June 27, 2022). Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4147127
@article{chriqui2021hebert,
title={Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts},
author={Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai},
journal={SSRN preprint:4147127},
year={2022}
}
Contact us
Avichay Chriqui, The Coller AI Lab
Inbal yahav, The Coller AI Lab
Ittai Bar-Siman-Tov, the BIU Innovation Lab for Law, Data-Science and Digital Ethics
Thank you, תודה, شكرا