File size: 941 Bytes
19ead9e
 
 
 
 
 
 
 
 
 
 
fb98f6a
19ead9e
 
 
 
 
 
 
 
 
 
8e0b867
 
 
 
 
 
 
 
 
 
 
 
 
19ead9e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
datasets:
- oscar
language:
- he
- ar
---
# HeArBERT

A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus.

In order to process Arabic with this model, one would have to transliterate it to Hebrew script. The code for doing so is available on the [preprocessing file](./preprocessing.py) and can be used as follows:

```python
from transformers import AutoTokenizer
from preprocessing import transliterate_arabic_to_hebrew

tokenizer = AutoTokenizer.from_pretrained("aviadrom/HeArBERT")

text_ar = "مرحبا"
text_he = transliterate_arabic_to_hebrew(text_ar)
tokenizer(text_he)
```


# Citation
If you find our work useful in your research, please consider citing:

```
@article{rom2024training,
  title={Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space},
  author={Rom, Aviad and Bar, Kfir},
  journal={arXiv preprint arXiv:2402.16065},
  year={2024}
}
```