Legal-heBERT / README.md

Update README.md

dd77503 over 2 years ago

8.22 kB

	# Legal-HeBERT
	Legal-HeBERT is a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. We release two versions of Legal-HeBERT. The first version is a fine-tuned model of [HeBERT](https://github.com/avichaychriqui/HeBERT) applied on legal and legislative documents. The second version uses [HeBERT](https://github.com/avichaychriqui/HeBERT)'s architecture guidlines to train a BERT model from scratch. <br>
	We continue collecting legal data, examining different architectural designs, and performing tagged datasets and legal tasks for evaluating and to development of a Hebrew legal tools.

	## Training Data
	Our training datasets are:

	\| Name \| Hebrew Description \| Size (GB) \| Documents \| Sentences \| Words \| Notes \|
	\|----------------------------------------------------------------------------------------------------------------------------------- \|-------------------------------------------------------------------------- \|----------- \|----------- \|------------ \|------------- \|----------------------------------------- \|
	\| The Israeli Law Book \| ספר החוקים הישראלי \| 0.05 \| 2338 \| 293352 \| 4851063 \| \|
	\| Judgments of the Supreme Court \| מאגר פסקי הדין של בית המשפט העליון \| 0.7 \| 212348 \| 5790138 \| 79672415 \| \|
	\| custody courts \| החלטות בתי הדין למשמורת \| 2.46 \| 169,708 \| 8,555,893 \| 213,050,492 \| \|
	\| Law memoranda, drafts of secondary legislation and drafts of support tests that have been distributed to the public for comment \| תזכירי חוק, טיוטות חקיקת משנה וטיוטות מבחני תמיכה שהופצו להערות הציבור \| 0.4 \| 3,291 \| 294,752 \| 7,218,960 \| \|
	\| Supervisors of Land Registration judgments \| מאגר פסקי דין של המפקחים על רישום המקרקעין \| 0.02 \| 559 \| 67,639 \| 1,785,446 \| \|
	\| Decisions of the Labor Court - Corona \| מאגר החלטות בית הדין לעניין שירות התעסוקה – קורונה \| 0.001 \| 146 \| 3505 \| 60195 \| \|
	\| Decisions of the Israel Lands Council \| החלטות מועצת מקרקעי ישראל \| \| 118 \| 11283 \| 162692 \| aggregate file \|
	\| Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal \| פסקי דין של בית הדין למשמעת ובית הדין לערעורים של משטרת ישראל \| 0.02 \| 54 \| 83724 \| 1743419 \| aggregate files \|
	\| Disciplinary Appeals Committee in the Ministry of Health \| ועדת ערר לדין משמעתי במשרד הבריאות \| 0.004 \| 252 \| 21010 \| 429807 \| 465 files are scanned and didn't parser \|
	\| Attorney General's Positions \| מאגר התייצבויות היועץ המשפטי לממשלה \| 0.008 \| 281 \| 32724 \| 813877 \| \|
	\| Legal-Opinion of the Attorney General \| מאגר חוות דעת היועץ המשפטי לממשלה \| 0.002 \| 44 \| 7132 \| 188053 \| \|
	\| \| \| \| \| \| \| \|
	\| total \| \| 3.665 \| 389,139 \| 15,161,152 \| 309,976,419 \| \|

	We thank <b>Yair Gardin</b> for the referring to the governance data, <b>Elhanan Schwarts</b> for collecting and parsing The Israeli law book, and <b>Jonathan Schler</b> for collecting the judgments of the supreme court.


	## Training process
	* Vocabulary size: 50,000 tokens
	* 4 epochs (1M steps±)
	* lr=5e-5
	* mlm_probability=0.15
	* batch size = 32 (for each gpu)
	* NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)

	### Additional training settings:
	<b>Fine-tuned [HeBERT](https://github.com/avichaychriqui/HeBERT) model:</b> The first eight layers were freezed (like [Lee et al. (2019)](https://arxiv.org/abs/1911.03090) suggest)<br>
	<b>Legal-HeBERT trained from scratch:</b> The training process is similar to [HeBERT](https://github.com/avichaychriqui/HeBERT) and inspired by [Chalkidis et al. (2020)](https://arxiv.org/abs/2010.02559) <br>

	## How to use
	The models can be found in huggingface hub and can be fine-tunned to any down-stream task:
	```
	# !pip install transformers==4.14.1
	from transformers import AutoTokenizer, AutoModel

	model_name = 'avichr/Legal-heBERT_ft' # for the fine-tuned HeBERT model
	model_name = 'avichr/Legal-heBERT' # for legal HeBERT model trained from scratch

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)

	from transformers import pipeline
	fill_mask = pipeline(
	"fill-mask",
	model=model_name,
	)
	fill_mask("הקורונה לקחה את [MASK] ולנו לא נשאר דבר.")
	```
	## Stay tuned!
	We are still working on our models and the datasets. We will edit this page as we progress. We are open for collaborations.

	## If you used this model please cite us as :
	Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai, Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts (June 27, 2022). Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4147127

	```
	@article{chriqui2021hebert,
	title={Legal HeBERT: A BERT-based NLP Model for Hebrew Legal, Judicial and Legislative Texts},
	author={Chriqui, Avihay, Yahav, Inbal and Bar-Siman-Tov, Ittai},
	journal={SSRN preprint:4147127},
	year={2022}
	}
	```

	## Contact us
	[Avichay Chriqui](mailto:avichayc@mail.tau.ac.il), The Coller AI Lab <br>
	[Inbal yahav](mailto:inbalyahav@tauex.tau.ac.il), The Coller AI Lab <br>
	[Ittai Bar-Siman-Tov](mailto:Ittai.Bar-Siman-Tov@biu.ac.il), the BIU Innovation Lab for Law, Data-Science and Digital Ethics <br>

	Thank you, תודה, شكرا <br>