--- base_model: - aubmindlab/bert-base-arabertv02 language: - ar model-index: - name: omarelshehy/Arabic-Retrieval-v1.0 results: - dataset: config: ar name: MTEB MIRACLRetrieval (ar) revision: main split: dev type: miracl/mmteb-miracl metrics: - type: main_score value: 58.664 - type: map_at_1 value: 32.399 - type: map_at_10 value: 50.236000000000004 - type: map_at_100 value: 51.87199999999999 - type: map_at_1000 value: 51.926 - type: ndcg_at_1 value: 48.376999999999995 - type: ndcg_at_10 value: 58.664 - type: ndcg_at_100 value: 63.754999999999995 - type: ndcg_at_1000 value: 64.672 - type: ndcg_at_20 value: 61.111000000000004 - type: ndcg_at_3 value: 51.266 - type: ndcg_at_5 value: 54.529 task: type: Retrieval - dataset: config: ar name: MTEB MIRACLRetrievalHardNegatives (ar) revision: 95c8db7d4a6e9c1d8a60601afd63d553ae20a2eb split: dev type: mteb/miracl-hard-negatives metrics: - type: main_score value: 60.026 - type: map_at_1 value: 32.547 - type: map_at_10 value: 51.345 - type: map_at_100 value: 53.190000000000005 - type: map_at_1000 value: 53.237 - type: ndcg_at_1 value: 48.3 - type: ndcg_at_10 value: 60.026 - type: ndcg_at_100 value: 65.62400000000001 - type: ndcg_at_1000 value: 66.282 - type: ndcg_at_20 value: 62.856 - type: ndcg_at_3 value: 52.1 - type: ndcg_at_5 value: 55.627 task: type: Retrieval - dataset: config: ara-ara name: MTEB MLQARetrieval (ara-ara) revision: 397ed406c1a7902140303e7faf60fff35b58d285 split: test type: facebook/mlqa metrics: - type: main_score value: 56.032000000000004 - type: map_at_1 value: 45.218 - type: map_at_10 value: 52.32599999999999 - type: map_at_100 value: 53.001 - type: map_at_1000 value: 53.047999999999995 - type: ndcg_at_1 value: 45.228 - type: ndcg_at_10 value: 56.032000000000004 - type: ndcg_at_100 value: 59.486000000000004 - type: ndcg_at_1000 value: 60.938 - type: ndcg_at_20 value: 57.507 - type: ndcg_at_3 value: 52.05800000000001 - type: ndcg_at_5 value: 54.005 task: type: Retrieval - dataset: config: ara-ara name: MTEB MLQARetrieval (ara-ara) revision: 397ed406c1a7902140303e7faf60fff35b58d285 split: validation type: facebook/mlqa metrics: - type: main_score value: 71.11 - type: map_at_1 value: 58.221000000000004 - type: map_at_10 value: 67.089 - type: map_at_100 value: 67.62700000000001 - type: map_at_1000 value: 67.648 - type: ndcg_at_1 value: 58.221000000000004 - type: ndcg_at_10 value: 71.11 - type: ndcg_at_100 value: 73.824 - type: ndcg_at_1000 value: 74.292 - type: ndcg_at_20 value: 72.381 - type: ndcg_at_3 value: 67.472 - type: ndcg_at_5 value: 69.803 task: type: Retrieval - dataset: config: ar name: MTEB MintakaRetrieval (ar) revision: efa78cc2f74bbcd21eff2261f9e13aebe40b814e split: test type: jinaai/mintakaqa metrics: - type: main_score value: 22.778000000000002 - type: map_at_1 value: 13.345 - type: map_at_10 value: 19.336000000000002 - type: map_at_100 value: 20.116999999999997 - type: map_at_1000 value: 20.246 - type: ndcg_at_1 value: 13.345 - type: ndcg_at_10 value: 22.778000000000002 - type: ndcg_at_100 value: 26.997 - type: ndcg_at_1000 value: 31.564999999999998 - type: ndcg_at_20 value: 24.368000000000002 - type: ndcg_at_3 value: 18.622 - type: ndcg_at_5 value: 20.72 task: type: Retrieval - dataset: config: arabic name: MTEB MrTidyRetrieval (arabic) revision: fc24a3ce8f09746410daee3d5cd823ff7a0675b7 split: test type: mteb/mrtidy metrics: - type: main_score value: 55.584999999999994 - type: map_at_1 value: 34.197 - type: map_at_10 value: 48.658 - type: map_at_100 value: 49.491 - type: map_at_1000 value: 49.518 - type: ndcg_at_1 value: 36.91 - type: ndcg_at_10 value: 55.584999999999994 - type: ndcg_at_100 value: 59.082 - type: ndcg_at_1000 value: 59.711000000000006 - type: ndcg_at_20 value: 57.537000000000006 - type: ndcg_at_3 value: 48.732 - type: ndcg_at_5 value: 52.834 task: type: Retrieval - dataset: config: default name: MTEB SadeemQuestionRetrieval (default) revision: 3cb0752b182e5d5d740df547748b06663c8e0bd9 split: test type: sadeem-ai/sadeem-ar-eval-retrieval-questions metrics: - type: main_score value: 67.916 - type: map_at_1 value: 31.785999999999998 - type: map_at_10 value: 58.18600000000001 - type: map_at_100 value: 58.287 - type: map_at_1000 value: 58.29 - type: ndcg_at_1 value: 31.785999999999998 - type: ndcg_at_10 value: 67.916 - type: ndcg_at_100 value: 68.44200000000001 - type: ndcg_at_1000 value: 68.53399999999999 - type: ndcg_at_20 value: 68.11 - type: ndcg_at_3 value: 66.583 - type: ndcg_at_5 value: 67.5 task: type: Retrieval - dataset: config: ara-ara name: MTEB XPQARetrieval (ara-ara) revision: c99d599f0a6ab9b85b065da6f9d94f9cf731679f split: test type: jinaai/xpqa metrics: - type: main_score value: 43.622 - type: map_at_1 value: 19.236 - type: map_at_10 value: 37.047000000000004 - type: map_at_100 value: 38.948 - type: map_at_1000 value: 39.054 - type: ndcg_at_1 value: 35.333 - type: ndcg_at_10 value: 43.622 - type: ndcg_at_100 value: 50.761 - type: ndcg_at_1000 value: 52.932 - type: ndcg_at_20 value: 46.686 - type: ndcg_at_3 value: 37.482 - type: ndcg_at_5 value: 39.635999999999996 task: type: Retrieval tags: - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - loss:MultipleNegativesRankingLoss - retrieval - mteb pipeline_tag: sentence-similarity library_name: sentence-transformers license: apache-2.0 --- ### 🚀 Arabic-Retrieval-v1.0 This is a high-performance Arabic information retrieval built using the robust **sentence-transformers** framework, it delivers **state-of-the-art performance** and is tailored to the richness and complexity of the Arabic language. ## 🔑 Key Features - **🔥 Outstanding Performance**: Matches the accuracy of top-tier multilingual models like `e5-multilingual-large`. See [evaluation](https://huggingface.co/omarelshehy/Arabic-retrieval-v1.0#evaluation) - **💡 Arabic-Focused**: Designed specifically for the nuances and dialects of Arabic, ensuring more accurate and context-aware results. - **📉 Lightweight Efficiency**: Requires **25%-50% less memory**, making it ideal for environments with limited resources or edge deployments. ## 🌍 Why This Model? Multilingual models are powerful, but they’re often bulky and not optimized for specific languages. This model bridges that gap, offering Arabic-native capabilities without sacrificing performance or efficiency. Whether you’re working on search engines, chatbots, or large-scale NLP pipelines, this model provides a **fast, accurate, and resource-efficient solution**. ## Model Details ### Model Description - **Model Type:** Sentence Transformer - **Maximum Sequence Length:** 512 tokens - **Output Dimensionality:** 768 tokens - **Similarity Function:** Cosine Similarity ### Full Model Architecture ``` SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) ) ``` ## Usage (Sentence Transformers) First install the Sentence Transformers library: ```bash pip install -U sentence-transformers ``` Then you can load this model and run inference. It is important to add the prefixes \: and \: to your queries and passages while retrieving in the folllowing way: ```python from sentence_transformers import SentenceTransformer # Download from the 🤗 Hub model = SentenceTransformer("omarelshehy/Arabic-Retrieval-v1.0") # Query query = ": كيف يمكن للذكاء الاصطناعي تحسين طرق التدريس التقليدية؟" # Passages passages = [ ": طرق التدريس التقليدية تستفيد من الذكاء الاصطناعي عبر تحسين عملية المتابعة وتخصيص التجربة التعليمية. يقوم الذكاء الاصطناعي بتحليل بيانات الطلاب وتقديم توصيات فعالة للمعلمين حول طرق التدريس الأفضل.", ": تطوير التعليم الشخصي يعتمد بشكل كبير على الذكاء الاصطناعي، الذي يقوم بمتابعة تقدم الطلاب بشكل فردي. يقدم الذكاء الاصطناعي حلولاً تعليمية مخصصة لكل طالب بناءً على مستواه وأدائه.", ": الدقة في تقييم الطلاب تتزايد بفضل الذكاء الاصطناعي الذي يقارن النتائج مع معايير متقدمة. بالرغم من التحديات التقليدية، الذكاء الاصطناعي يوفر أدوات تحليل تتيح تقييماً أدق لأداء الطلاب." ] # Encode query and passages embeddings_query = model.encode(queries) embeddings_passages = model.encode(passages) # Get the similarity scores for the embeddings similarities = model.similarity(embeddings_query, embeddings_passages) # Get best matching passage to query best_match = passages[similarities.argmax().item()] print(f"Best matching passage is {best_match}") ``` ## Evaluation This model has been ealuated using 3 different datasets and the NDCG@10 metric - Dataset 1: [castorini/mr-tydi](https://huggingface.co/datasets/castorini/mr-tydi) - Dataset 2: [Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-finanical-rag-embedding-dataset) - Dataset 3: [sadeem-ai/sadeem-ar-eval-retrieval-questions](https://huggingface.co/datasets/sadeem-ai/sadeem-ar-eval-retrieval-questions) and is compared to other highly performant models: | **model** | **1** | **2** | **3** | |-------------------------------------|-----------|--------------|-------------| | Arabic-Retrieval-v1.0 | 0.875 | **0.72** | 0.679 | | intfloat/multilingual-e5-large | **0.89** | 0.719 | **0.698** | | intfloat/multilingual-e5-base | 0.87 | 0.69 | 0.686 | ## Citation ### BibTeX ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084", } ``` #### MultipleNegativesRankingLoss ```bibtex @misc{henderson2017efficient, title={Efficient Natural Language Response Suggestion for Smart Reply}, author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, year={2017}, eprint={1705.00652}, archivePrefix={arXiv}, primaryClass={cs.CL} } ```