Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-4.0
|
3 |
+
language:
|
4 |
+
- he
|
5 |
+
---
|
6 |
+
# MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts
|
7 |
+
|
8 |
+
A new pretrained dedicated BERT model, dubbed MsBERT (short for: Manuscript BERT), designed from the ground up to handle Hebrew manuscript text.
|
9 |
+
MsBERT substantially outperforms all existing Hebrew BERT models regarding the prediction of missing words in fragmentary Hebrew manuscript transcriptions in multiple genres, as well as regarding the task of differentiating between quoted passages and exegetical elaborations.
|
10 |
+
We provide MsBERT for free download and unrestricted use, and we also provide an interactive and user-friendly website to allow manuscript scholars to leverage the power of MsBERT in their scholarly work of reconstructing fragmentary Hebrew manuscripts.
|
11 |
+
|
12 |
+
You can try out the website here: [https://msbert.dicta.org.il](https://msbert.dicta.org.il).
|
13 |
+
|
14 |
+
Sample usage:
|
15 |
+
|
16 |
+
```python
|
17 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
18 |
+
|
19 |
+
tokenizer = AutoTokenizer.from_pretrained('dicta-il/MsBERT')
|
20 |
+
model = AutoModelForMaskedLM.from_pretrained('dicta-il/MsBERT')
|
21 |
+
|
22 |
+
model.eval()
|
23 |
+
|
24 |
+
text = '''ืืืฆืคืื ืืื [MASK] ืืจืืื ืืจืืื ืจ' [MASK] ื' [MASK] ืคืจืืืช ืืกืืืื ืื ืจ' ืืืื ื' ืื [MASK] ืฉืืืื ืืืื ืืืจืืื'''
|
25 |
+
|
26 |
+
output = model(tokenizer.encode(text, return_tensors='pt'))
|
27 |
+
# the first [MASK] is the token #4 (including [CLS])
|
28 |
+
import torch
|
29 |
+
top_2 = torch.topk(output.logits[0, 4, :], 2)[1]
|
30 |
+
print('\n'.join(tokenizer.convert_ids_to_tokens(top_2))) # should print ืืืืจ / ืกืืืจ
|
31 |
+
```
|
32 |
+
|
33 |
+
|
34 |
+
## Citation
|
35 |
+
|
36 |
+
If you use MsBERT in your research, please cite ```MsBERT: A New Model for the Reconstruction of Lacunae in Hebrew Manuscripts```
|
37 |
+
|
38 |
+
**BibTeX:**
|
39 |
+
|
40 |
+
```bibtex
|
41 |
+
to fill in
|
42 |
+
```
|
43 |
+
|
44 |
+
## License
|
45 |
+
|
46 |
+
Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
|
47 |
+
|
48 |
+
This work is licensed under a
|
49 |
+
[Creative Commons Attribution 4.0 International License][cc-by].
|
50 |
+
|
51 |
+
[![CC BY 4.0][cc-by-image]][cc-by]
|
52 |
+
|
53 |
+
[cc-by]: http://creativecommons.org/licenses/by/4.0/
|
54 |
+
[cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
|
55 |
+
[cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg
|