|
--- |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- HaifaCLGroup/KnessetCorpus |
|
language: |
|
- he |
|
tags: |
|
- hebrew |
|
- nlp |
|
- masked-language-model |
|
- transformers |
|
- BERT |
|
- parliamentary-proceedings |
|
- language-model |
|
- Knesset |
|
- DictaBERT |
|
- fine-tuning |
|
|
|
--- |
|
# Knesset-DictaBERT |
|
**Knesset-DictaBERT** is a Hebrew language model fine-tuned on the [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus), |
|
which comprises Israeli parliamentary proceedings. |
|
|
|
This model is based on the [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) architecture |
|
and is designed to understand and generate text in Hebrew, with a specific focus on parliamentary language and context. |
|
|
|
|
|
## Model Details |
|
|
|
- **Model type**: BERT-based (Bidirectional Encoder Representations from Transformers) |
|
- **Language**: Hebrew |
|
- **Training Data**: [Knesset Corpus](https://huggingface.co/datasets/HaifaCLGroup/KnessetCorpus) (Israeli parliamentary proceedings) |
|
- **Base Model**: [Dicta-BERT](https://huggingface.co/dicta-il/dictabert) |
|
|
|
## Training Procedure |
|
|
|
The model was fine-tuned using the masked language modeling (MLM) task on the Knesset Corpus. The MLM task involves predicting masked words in a sentence, allowing the model to learn contextual representations of words. |
|
|
|
## Usage |
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("GiliGold/Knesset-DictaBERT") |
|
model = AutoModelForMaskedLM.from_pretrained("GiliGold/Knesset-DictaBERT") |
|
model.eval() |
|
sentence = "ืืฉ ืื ื [MASK] ืขื ืื ืืฉืืืข ืืื" |
|
|
|
# Tokenize the input sentence and get predictions |
|
inputs = tokenizer.encode(sentence, return_tensors='pt') |
|
output = model(inputs) |
|
|
|
mask_token_index = 3 |
|
top_2_tokens = torch.topk(output.logits[0, mask_token_index, :], 2)[1] |
|
|
|
# Convert token IDs to tokens and print them |
|
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2_tokens))) |
|
|
|
# Example output: ืืฉืืื / ืืืื |
|
``` |
|
|
|
## Evaluation |
|
The evaluation was conducted on a 10% test set of the Knesset Corpus, consisting of approximately 3.2 million sentences. |
|
The perplexity was calculated on this full test set. |
|
Due to time constraints, accuracy measures were calculated on a subset of this test set, consisting of approximately 300,000 sentences (approximately 3.5 million tokens). |
|
|
|
#### Perplexity |
|
The perplexity of the original DictaBERT on the full test set is 22.87. |
|
|
|
The perplexity of Knesset-DictaBERT on the full test set is 6.60. |
|
|
|
#### Accuracy |
|
|
|
- **1-accuracy results** |
|
|
|
Knesset-DictaBERT identified the correct token in the top-1 prediction in 52.55% of the cases. |
|
|
|
The original DictaBERT model achieved a top-1 accuracy of 48.02%. |
|
|
|
|
|
- **2-accuracy results** |
|
|
|
Knesset-DictaBERT identified the correct token within the top-2 predictions in 63.07% of the cases. |
|
|
|
The original DictaBERT model achieved a top-2 accuracy of 58.60%. |
|
|
|
|
|
- **5-accuracy results** |
|
- |
|
Knesset-DictaBERT identified the correct token within the top-5 predictions in 73.59% of the cases. |
|
|
|
The original DictaBERT model achieved a top-5 accuracy of 68.98%. |
|
|
|
## Acknowledgments |
|
This model is built upon the work of the Dicta team, and their contributions are gratefully acknowledged. |
|
|
|
## Citation |
|
If you use this model in your work, please cite: |
|
```bibtex |
|
@misc{goldin2024knessetdictaberthebrewlanguagemodel, |
|
title={Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings}, |
|
author={Gili Goldin and Shuly Wintner}, |
|
year={2024}, |
|
eprint={2407.20581}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2407.20581}, |
|
} |
|
``` |
|
|
|
|