GiliGold
/

Knesset-DictaBERT

masked-language-model

parliamentary-proceedings

Inference Endpoints

Model card Files Files and versions Community

Knesset-DictaBERT / README.md

GiliGold's picture

Update README.md (#1)

49108b4 verified 4 months ago

|

history blame contribute delete

3.6 kB

	---
	license: cc-by-sa-4.0
	datasets:
	- HaifaCLGroup/KnessetCorpus
	language:
	- he
	tags:
	- hebrew
	- nlp
	- masked-language-model
	- transformers
	- BERT
	- parliamentary-proceedings
	- language-model
	- Knesset
	- DictaBERT
	- fine-tuning

	---
	# Knesset-DictaBERT
	Knesset-DictaBERT is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus),
	which comprises Israeli parliamentary proceedings.

	This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture
	and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context.


	## Model Details

	- Model type: BERT-based (Bidirectional Encoder Representations from Transformers)
	- Language: Hebrew
	- Training Data: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings)
	- Base Model: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert)

	## Training Procedure

	The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words.

	## Usage
	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer
	import torch

	tokenizer = AutoTokenizer.from_pretrained("GiliGold/Knesset-DictaBERT")
	model = AutoModelForMaskedLM.from_pretrained("GiliGold/Knesset-DictaBERT")
	model.eval()
	sentence = "יש לנו [MASK] על זה בשבוע הבא"

	# Tokenize the input sentence and get predictions
	inputs = tokenizer.encode(sentence, return_tensors='pt')
	output = model(inputs)

	mask_token_index = 3
	top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1]

	# Convert token IDs to tokens and print them
	print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens)))

	# Example output: ישיבה / דיון
	```

	## Evaluation
	The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences.
	The perplexity was calculated on this full test set.
	Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 300,000 sentences (approximately 3.5 million tokens).

	#### Perplexity
	The perplexity of the original DictaBERT on the full test set is 22.87.

	The perplexity of Knesset-DictaBERT on the full test set is 6.60.

	#### Accuracy

	- 1-accuracy results

	Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases.

	The original DictaBERT model achieved a top-1 accuracy of 48.02%.


	- 2-accuracy results

	Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases.

	The original DictaBERT model achieved a top-2 accuracy of 58.60%.


	- 5-accuracy results
	-
	Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases.

	The original DictaBERT model achieved a top-5 accuracy of 68.98%.

	## Acknowledgments
	This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged.

	## Citation
	If you use this model in your work, please cite:
	```bibtex
	@misc{goldin2024knessetdictaberthebrewlanguagemodel,
	title={Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings},
	author={Gili Goldin and Shuly Wintner},
	year={2024},
	eprint={2407.20581},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2407.20581},
	}
	```