HeArBERT / README.md
aviadrom's picture
Create README.md
19ead9e
|
raw
history blame
No virus
630 Bytes
---
datasets:
- oscar
language:
- he
- ar
---
# HeArBERT
A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus.
In order to process Arabic with this model, one would have to transliterate it to Hebrew script. The code for doing so is available on the [preprocessing](./preprocessing.py) file and can be used as follows:
```python
from transformers import AutoTokenizer
from preprocessing import transliterate_arabic_to_hebrew
tokenizer = AutoTokenizer.from_pretrained("aviadrom/HeArBERT")
text_ar = "مرحبا"
text_he = transliterate_arabic_to_hebrew(text_ar)
tokenizer(text_he)
```