tarekeldeeb
commited on
Commit
·
5d048d3
1
Parent(s):
1e9677b
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,14 @@
|
|
1 |
---
|
2 |
license: other
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: other
|
3 |
+
language:
|
4 |
+
- ar
|
5 |
---
|
6 |
+
Arabic BPE Tokenization Using Google Sentance Piece.
|
7 |
+
|
8 |
+
Natural Language Processing is a branch of AI. One of the first steps in any NLP system is language model encoding. The challenge is how to present/encode the words efficiently. Sub-word encoding is very suitable to arabic. For example the word مدرساتهم will not be considered a single token/word, but split into three; مدرس, ات, and هم. This is the basic intuition. This process is done automatically without any rules or preprocessing.
|
9 |
+
|
10 |
+
Vocab size: 8000 (32K also available)
|
11 |
+
|
12 |
+
Project: https://github.com/tarekeldeeb/arabic_byte_pair_encoding
|
13 |
+
|
14 |
+
License: [Waqf v2](https://github.com/ojuba-org/waqf/tree/master/2.0)
|