|
--- |
|
datasets: |
|
- oscar |
|
language: |
|
- he |
|
- ar |
|
--- |
|
# HeArBERT |
|
|
|
A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus. |
|
|
|
In order to process Arabic with this model, one would have to transliterate it to Hebrew script. The code for doing so is available on the [preprocessing](./preprocessing.py) file and can be used as follows: |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
from preprocessing import transliterate_arabic_to_hebrew |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("aviadrom/HeArBERT") |
|
|
|
text_ar = "مرحبا" |
|
text_he = transliterate_arabic_to_hebrew(text_ar) |
|
tokenizer(text_he) |
|
``` |