File size: 6,769 Bytes

---

datasets:
- Omartificial-Intelligence-Space/Arabic-NLi-Triplet
language:
- ar
base_model: "intfloat/multilingual-e5-small"
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- arabic
- triplet-loss
widget: []
---


# Arabic NLI Triplet - Sentence Transformer Model

This repository contains a fine-tuned Sentence Transformer model trained on the "Omartificial-Intelligence-Space/Arabic-NLi-Triplet" dataset. The model is trained to generate 384-dimensional embeddings for semantic similarity tasks like paraphrase mining, sentence similarity, and clustering in Arabic.

## Model Overview

- **Model Type:** Sentence Transformer
- **Base Model:** `intfloat/multilingual-e5-small`
- **Training Dataset:** [Omartificial-Intelligence-Space/Arabic-NLi-Triplet](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-NLi-Triplet)
- **Similarity Function:** Cosine Similarity
- **Embedding Dimensionality:** 384 dimensions
- **Maximum Sequence Length:** 128 tokens
- **Performance Improvement:** The model achieved around 10% improvement when tested on the test set of the provided dataset, compared to the base model's performance.

## Dataset

### Arabic NLI Triplet Dataset
The dataset contains triplets of sentences in Arabic: an anchor sentence, a positive sentence (semantically similar to the anchor), and a negative sentence (semantically dissimilar to the anchor). The dataset is designed for learning sentence representations through triplet margin loss.

Dataset Link: [Omartificial-Intelligence-Space/Arabic-NLi-Triplet](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-NLi-Triplet)

## Training Process

### Loss Function: Triplet Margin Loss

We used the Triplet Margin Loss with a margin of `1.0`. The model is trained to minimize the distance between anchor and positive embeddings, while maximizing the distance between anchor and negative embeddings.

### Training Loss Progress:
Below is the training loss recorded at various steps during the training process:

| Step  | Training Loss |
|-------|---------------|
| 500   | 0.136500      |
| 1000  | 0.126500      |
| 1500  | 0.127300      |
| 2000  | 0.114500      |
| 2500  | 0.110600      |
| 3000  | 0.102300      |
| 3500  | 0.101300      |
| 4000  | 0.106900      |
| 4500  | 0.097200      |
| 5000  | 0.091700      |
| 5500  | 0.092400      |
| 6000  | 0.095500      |

## Model Training Code

The model was trained using the following code (without resuming from checkpoints):

```python

from datasets import load_dataset

from transformers import AutoTokenizer, AutoModel, TrainingArguments, Trainer

from torch.nn import TripletMarginLoss



# Load dataset

dataset = load_dataset("Omartificial-Intelligence-Space/Arabic-NLi-Triplet")



# Load tokenizer

tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-small")



# Tokenize function

def tokenize_function(examples):

    anchor_encodings = tokenizer(examples['anchor'], truncation=True, padding='max_length', max_length=128)

    positive_encodings = tokenizer(examples['positive'], truncation=True, padding='max_length', max_length=128)

    negative_encodings = tokenizer(examples['negative'], truncation=True, padding='max_length', max_length=128)



    return {

        'anchor_input_ids': anchor_encodings['input_ids'],

        'anchor_attention_mask': anchor_encodings['attention_mask'],

        'positive_input_ids': positive_encodings['input_ids'],

        'positive_attention_mask': positive_encodings['attention_mask'],

        'negative_input_ids': negative_encodings['input_ids'],

        'negative_attention_mask': negative_encodings['attention_mask'],

    }



tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names)



# Define triplet loss

triplet_loss = TripletMarginLoss(margin=1.0)



def compute_loss(anchor_embedding, positive_embedding, negative_embedding):

    return triplet_loss(anchor_embedding, positive_embedding, negative_embedding)



# Load model

model = AutoModel.from_pretrained("intfloat/multilingual-e5-small")



class TripletTrainer(Trainer):

    def compute_loss(self, model, inputs, return_outputs=False):

        anchor_input_ids = inputs['anchor_input_ids'].to(self.args.device)

        anchor_attention_mask = inputs['anchor_attention_mask'].to(self.args.device)

        positive_input_ids = inputs['positive_input_ids'].to(self.args.device)

        positive_attention_mask = inputs['positive_attention_mask'].to(self.args.device)

        negative_input_ids = inputs['negative_input_ids'].to(self.args.device)

        negative_attention_mask = inputs['negative_attention_mask'].to(self.args.device)



        anchor_embeds = model(input_ids=anchor_input_ids, attention_mask=anchor_attention_mask).last_hidden_state.mean(dim=1)

        positive_embeds = model(input_ids=positive_input_ids, attention_mask=positive_attention_mask).last_hidden_state.mean(dim=1)

        negative_embeds = model(input_ids=negative_input_ids, attention_mask=negative_attention_mask).last_hidden_state.mean(dim=1)



        return compute_loss(anchor_embeds, positive_embeds, negative_embeds)



# Training arguments

training_args = TrainingArguments(

    output_dir="/content/drive/MyDrive/results",

    learning_rate=2e-5,

    per_device_train_batch_size=16,

    num_train_epochs=3,

    weight_decay=0.01,

    logging_dir='/content/drive/MyDrive/logs',

    remove_unused_columns=False,

    fp16=True,

    save_total_limit=3,

)



# Initialize trainer

trainer = TripletTrainer(

    model=model,

    args=training_args,

    train_dataset=tokenized_datasets['train'],

)



# Start training

trainer.train()



# Save model and evaluate

trainer.save_model("/content/drive/MyDrive/fine-tuned-multilingual-e5")

results = trainer.evaluate()

print(results)

```

## Framework Versions

- Python: 3.10.11
- Sentence Transformers: 3.0.1
- Transformers: 4.44.2
- PyTorch: 2.4.0
- Datasets: 2.21.0

## How to Use

To use the model, install the required libraries and load the model with the following code:

```bash

pip install -U sentence-transformers

```

```python

from sentence_transformers import SentenceTransformer



# Load the fine-tuned model

model = SentenceTransformer("gimmeursocks/ara-e5-small")



# Run inference

sentences = ['أنا سعيد', 'الجو جميل اليوم', 'هذا كلب كبير']

embeddings = model.encode(sentences)

print(embeddings.shape)

```

## Citation

If you use this model or dataset, please cite the corresponding paper or dataset source.