aviadrom commited on
Commit
19ead9e
1 Parent(s): 7f0b259

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +23 -0
README.md ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - oscar
4
+ language:
5
+ - he
6
+ - ar
7
+ ---
8
+ # HeArBERT
9
+
10
+ A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus.
11
+
12
+ In order to process Arabic with this model, one would have to transliterate it to Hebrew script. The code for doing so is available on the [preprocessing](./preprocessing.py) file and can be used as follows:
13
+
14
+ ```python
15
+ from transformers import AutoTokenizer
16
+ from preprocessing import transliterate_arabic_to_hebrew
17
+
18
+ tokenizer = AutoTokenizer.from_pretrained("aviadrom/HeArBERT")
19
+
20
+ text_ar = "مرحبا"
21
+ text_he = transliterate_arabic_to_hebrew(text_ar)
22
+ tokenizer(text_he)
23
+ ```