Wajdi Ghezaiel's picture

Wajdi Ghezaiel

Wajdi1976
ยท

AI & ML interests

Speaker recognition Speaker diarization Speech processing Speech recognition

Recent Activity

Organizations

LinTO's profile picture LINAGORA Labs's profile picture

Wajdi1976's activity

reacted to Salama1429's post with ๐Ÿ‘ 4 months ago
view post
Post
1424
๐Ÿ“š Introducing the 101 Billion Arabic Words Dataset

๐ŸŒ Exciting Milestone in Arabic Language Technology! hashtag#NLP hashtag#ArabicLLM hashtag#LanguageModels

๐Ÿš€ Why It Matters:
1. ๐ŸŒŸ Large Language Models (LLMs) have brought transformative changes, primarily in English. It's time for Arabic to shine!
2. ๐ŸŽฏ This project addresses the critical challenge of bias in Arabic LLMs due to reliance on translated datasets.

๐Ÿ” Approach:
1. ๐Ÿ’ช Undertook a massive data mining initiative focusing exclusively on Arabic from Common Crawl WET files.
2. ๐Ÿงน Employed state-of-the-art cleaning and deduplication processes to maintain data quality and uniqueness.

๐Ÿ“ˆ Impact:
1. ๐Ÿ† Created the largest Arabic dataset to date with 101 billion words.
2. ๐Ÿ“ Enables the development of Arabic LLMs that are linguistically and culturally accurate.
3. ๐ŸŒ Sets a global benchmark for future Arabic language research.


๐Ÿ”— Paper: https://lnkd.in/dGAiaygn
๐Ÿ”— Dataset: https://lnkd.in/dGTMe5QV

- ๐Ÿ”„ Share your thoughts and let's drive the future of Arabic NLP together!

hashtag#DataScience hashtag#MachineLearning hashtag#ArtificialIntelligence hashtag#Innovation hashtag#ArabicData