--- license: cc-by-sa-4.0 datasets: - HaifaCLGroup/KnessetCorpus language: - he tags: - hebrew - nlp - masked-language-model - transformers - BERT - parliamentary-proceedings - language-model - Knesset - DictaBERT - fine-tuning --- # Knesset-DictaBERT **Knesset-DictaBERT** is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus), which comprises Israeli parliamentary proceedings. This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context. ## Model Details - **Model type**: BERT-based (Bidirectional Encoder Representations from Transformers) - **Language**: Hebrew - **Training Data**: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings) - **Base Model**: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) ## Training Procedure The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words. ## Usage ```python from transformers import AutoModelForMaskedLM, AutoTokenizer import torch tokenizer = AutoTokenizer.from_pretrained("GiliGold/Knesset-DictaBERT") model = AutoModelForMaskedLM.from_pretrained("GiliGold/Knesset-DictaBERT") model.eval() sentence = "הכנסת היא הרשות [MASK] של מדינת ישראל." # Tokenize the input sentence and get predictions inputs = tokenizer.encode(sentence, return_tensors='pt') output = model(inputs) # The [MASK] token is the 5th token in the sentence (including [CLS]) mask_token_index = 5 top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1] # Convert token IDs to tokens and print them print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens))) # Example output: המבצעת / המחוקקת ``` ## Evaluation The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences. The perplexity was calculated on this full test set. Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 3 million sentences (approximately 520 million tokens). #### Perplexity The perplexity of the original DictaBERT on the full test set is 22.87. The perplexity of Knesset-DictaBERT on the full test set is 6.60. #### Accuracy - **1-accuracy results** Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases. The original DictaBERT model achieved a top-1 accuracy of 48.02%. - **2-accuracy results** Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases. The original Dicta model achieved a top-2 accuracy of 58.60%. - **5-accuracy results** Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases. The original Dicta model achieved a top-5 accuracy of 68.98%. ## Acknowledgments This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged. ## Citation If you use this model in your work, please cite: ```bibtex @misc{Knesset-DictaBERT, author = {Gili Goldin}, title = {Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/GiliGold/Knesset-DictaBERT}}, } ```