tarekeldeeb
/

arabic_bpe_8k

Model card Files Files and versions Community

arabic_bpe_8k / README.md

tarekeldeeb's picture

Update README.md

5d048d3 almost 2 years ago

|

history blame contribute delete

722 Bytes

	---
	license: other
	language:
	- ar
	---
	Arabic BPE Tokenization Using Google Sentance Piece.

	Natural Language Processing is a branch of AI. One of the first steps in any NLP system is language model encoding. The challenge is how to present/encode the words efficiently. Sub-word encoding is very suitable to arabic. For example the word مدرساتهم will not be considered a single token/word, but split into three; مدرس, ات, and هم. This is the basic intuition. This process is done automatically without any rules or preprocessing.

	Vocab size: 8000 (32K also available)

	Project: https://github.com/tarekeldeeb/arabic_byte_pair_encoding

	License: [Waqf v2](https://github.com/ojuba-org/waqf/tree/master/2.0)