Shaltiel commited on
Commit
bf54b0f
โ€ข
1 Parent(s): e4c44b3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -3
README.md CHANGED
@@ -1,3 +1,56 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - he
5
+ ---
6
+ # OtoBERT: Identifying Suffixed Verbal Forms in Modern Hebrew Literature
7
+
8
+ New language model for Hebrew designed specifically for identifying suffixed verbal forms in Modern Hebrew, released [here](https://arxiv.org/abs/2308.16687).
9
+
10
+ This is the base model pretrained with the masked-language-modeling objective.
11
+
12
+ This model was trained with a special tokenizer which combines the bound suffix of an object pronoun into a single unit (e.g., `ืจืื™ืชื™ ืื•ืชื•` becomes one unit), and was trained to predict those items during the mask prediction stage as well. For more details, please check out the paper listed on this page.
13
+
14
+ Sample usage:
15
+
16
+ ```python
17
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
18
+
19
+ tokenizer = AutoTokenizer.from_pretrained('dicta-il/otobert')
20
+ model = AutoModelForMaskedLM.from_pretrained('dicta-il/otobert')
21
+
22
+ model.eval()
23
+
24
+ sentence = 'ืื ื™ ืœื ื™ื›ื•ืœ ืœื”ื’ื™ื“ ืœืš ืžืชื™ [MASK] ืœืื—ืจื•ื ื”.' # Supposed to be ืจืื™ืชื™ ืื•ืชื•
25
+
26
+ output = model(tokenizer.encode(sentence, return_tensors='pt'))
27
+ # the [MASK] is the 7th token (including [CLS])
28
+ import torch
29
+ top_2 = torch.topk(output.logits[0, 7, :], 2)[1]
30
+ print('\n'.join(tokenizer.convert_ids_to_tokens(top_2))) # should print ื ืคื’ืฉื ื• / ืจืื™ืชื™_ืื•ืชื•
31
+
32
+ ```
33
+
34
+
35
+ ## Citation
36
+
37
+ If you use OtoBERT in your research, please cite ```OtoBERT: Identifying Suffixed Verbal Forms in Modern Hebrew Literature```
38
+
39
+ **BibTeX:**
40
+
41
+ ```bibtex
42
+ tbd
43
+ ```
44
+
45
+ ## License
46
+
47
+ Shield: [![CC BY 4.0][cc-by-shield]][cc-by]
48
+
49
+ This work is licensed under a
50
+ [Creative Commons Attribution 4.0 International License][cc-by].
51
+
52
+ [![CC BY 4.0][cc-by-image]][cc-by]
53
+
54
+ [cc-by]: http://creativecommons.org/licenses/by/4.0/
55
+ [cc-by-image]: https://i.creativecommons.org/l/by/4.0/88x31.png
56
+ [cc-by-shield]: https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg