GATE-AraBert-V1

This is GATE | General Arabic Text Embedding trained using SentenceTransformers in a multi-task setup. The system trains on the AllNLI and on the STS dataset.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768 tokens
Similarity Function: Cosine Similarity
Training Datasets:
- all-nli
- sts
Language: ar

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Omartificial-Intelligence-Space/GATE-AraBert-v1")
# Run inference
sentences = [
    'الكلب البني مستلقي على جانبه على سجادة بيج، مع جسم أخضر في المقدمة.',
    'لقد مات الكلب',
    'شخص طويل القامة',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Dataset: sts-dev
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.8391
spearman_cosine	0.841
pearson_manhattan	0.8277
spearman_manhattan	0.8361
pearson_euclidean	0.8274
spearman_euclidean	0.8358
pearson_dot	0.8154
spearman_dot	0.818
pearson_max	0.8391
spearman_max	0.841

Semantic Similarity

Dataset: sts-test
Evaluated with EmbeddingSimilarityEvaluator

Metric	Value
pearson_cosine	0.813
spearman_cosine	0.8173
pearson_manhattan	0.8114
spearman_manhattan	0.8164
pearson_euclidean	0.8103
spearman_euclidean	0.8158
pearson_dot	0.7908
spearman_dot	0.7887
pearson_max	0.813
spearman_max	0.8173

Acknowledgments

The author would like to thank Prince Sultan University for their invaluable support in this project. Their contributions and resources have been instrumental in the development and fine-tuning of these models.

## Citation

If you use the GATE, please cite it as follows:

@misc{nacar2025GATE,
      title={GATE: General Arabic Text Embedding for Enhanced Semantic Textual Similarity with Hybrid Loss Training}, 
      author={Omer Nacar, Anis Koubaa, Serry Taiseer Sibaee and Lahouari Ghouti},
      year={2025},
      note={Submitted to COLING 2025},
      url={https://huggingface.co/Omartificial-Intelligence-Space/GATE-AraBert-v1},
}

Downloads last month: 3,381

Safetensors

Model size

135M params

Tensor type

F32

Inference Examples

Sentence Similarity

This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Omartificial-Intelligence-Space/GATE-AraBert-v1

Base model

aubmindlab/bert-base-arabertv02

Finetuned

Omartificial-Intelligence-Space/Arabic-Triplet-Matryoshka-V2

Finetuned

(4)

this model

Finetunes

1 model

Datasets used to train Omartificial-Intelligence-Space/GATE-AraBert-v1

Space using Omartificial-Intelligence-Space/GATE-AraBert-v1 1

Collection including Omartificial-Intelligence-Space/GATE-AraBert-v1

GATE: General Arabic Text Embedding Models

Collection

This Collection includes GATE Models, a new series of trained sentence transformer models trained on multi-task datasets and using different losses. • 2 items • Updated Aug 29, 2024 • 1

Evaluation results

main_score on MTEB MIRACLRetrievalHardNegatives (ar)
self-reported

57.737
main_score on MTEB MLQARetrieval (ara-ara)
test set self-reported

62.580
main_score on MTEB MintakaRetrieval (ar)
test set self-reported

19.130
main_score on MTEB SadeemQuestionRetrieval (default)
test set self-reported

63.155
cosine_pearson on MTEB STS17 (ar-ar)
test set self-reported

82.066
cosine_spearman on MTEB STS17 (ar-ar)
test set self-reported

82.781
euclidean_pearson on MTEB STS17 (ar-ar)
test set self-reported

79.240
euclidean_spearman on MTEB STS17 (ar-ar)
test set self-reported

81.529
main_score on MTEB STS17 (ar-ar)
test set self-reported

82.781
manhattan_pearson on MTEB STS17 (ar-ar)
test set self-reported

78.954

View on Papers With Code