File size: 6,769 Bytes
981de30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e0c680
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---

datasets:
- Omartificial-Intelligence-Space/Arabic-NLi-Triplet
language:
- ar
base_model: "intfloat/multilingual-e5-small"
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- arabic
- triplet-loss
widget: []
---


# Arabic NLI Triplet - Sentence Transformer Model

This repository contains a fine-tuned Sentence Transformer model trained on the "Omartificial-Intelligence-Space/Arabic-NLi-Triplet" dataset. The model is trained to generate 384-dimensional embeddings for semantic similarity tasks like paraphrase mining, sentence similarity, and clustering in Arabic.

## Model Overview

- **Model Type:** Sentence Transformer
- **Base Model:** `intfloat/multilingual-e5-small`
- **Training Dataset:** [Omartificial-Intelligence-Space/Arabic-NLi-Triplet](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-NLi-Triplet)
- **Similarity Function:** Cosine Similarity
- **Embedding Dimensionality:** 384 dimensions
- **Maximum Sequence Length:** 128 tokens
- **Performance Improvement:** The model achieved around 10% improvement when tested on the test set of the provided dataset, compared to the base model's performance.

## Dataset

### Arabic NLI Triplet Dataset
The dataset contains triplets of sentences in Arabic: an anchor sentence, a positive sentence (semantically similar to the anchor), and a negative sentence (semantically dissimilar to the anchor). The dataset is designed for learning sentence representations through triplet margin loss.

Dataset Link: [Omartificial-Intelligence-Space/Arabic-NLi-Triplet](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-NLi-Triplet)

## Training Process

### Loss Function: Triplet Margin Loss

We used the Triplet Margin Loss with a margin of `1.0`. The model is trained to minimize the distance between anchor and positive embeddings, while maximizing the distance between anchor and negative embeddings.

### Training Loss Progress:
Below is the training loss recorded at various steps during the training process:

| Step  | Training Loss |
|-------|---------------|
| 500   | 0.136500      |
| 1000  | 0.126500      |
| 1500  | 0.127300      |
| 2000  | 0.114500      |
| 2500  | 0.110600      |
| 3000  | 0.102300      |
| 3500  | 0.101300      |
| 4000  | 0.106900      |
| 4500  | 0.097200      |
| 5000  | 0.091700      |
| 5500  | 0.092400      |
| 6000  | 0.095500      |

## Model Training Code

The model was trained using the following code (without resuming from checkpoints):

```python

from datasets import load_dataset

from transformers import AutoTokenizer, AutoModel, TrainingArguments, Trainer

from torch.nn import TripletMarginLoss



# Load dataset

dataset = load_dataset("Omartificial-Intelligence-Space/Arabic-NLi-Triplet")



# Load tokenizer

tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-small")



# Tokenize function

def tokenize_function(examples):

    anchor_encodings = tokenizer(examples['anchor'], truncation=True, padding='max_length', max_length=128)

    positive_encodings = tokenizer(examples['positive'], truncation=True, padding='max_length', max_length=128)

    negative_encodings = tokenizer(examples['negative'], truncation=True, padding='max_length', max_length=128)



    return {

        'anchor_input_ids': anchor_encodings['input_ids'],

        'anchor_attention_mask': anchor_encodings['attention_mask'],

        'positive_input_ids': positive_encodings['input_ids'],

        'positive_attention_mask': positive_encodings['attention_mask'],

        'negative_input_ids': negative_encodings['input_ids'],

        'negative_attention_mask': negative_encodings['attention_mask'],

    }



tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names)



# Define triplet loss

triplet_loss = TripletMarginLoss(margin=1.0)



def compute_loss(anchor_embedding, positive_embedding, negative_embedding):

    return triplet_loss(anchor_embedding, positive_embedding, negative_embedding)



# Load model

model = AutoModel.from_pretrained("intfloat/multilingual-e5-small")



class TripletTrainer(Trainer):

    def compute_loss(self, model, inputs, return_outputs=False):

        anchor_input_ids = inputs['anchor_input_ids'].to(self.args.device)

        anchor_attention_mask = inputs['anchor_attention_mask'].to(self.args.device)

        positive_input_ids = inputs['positive_input_ids'].to(self.args.device)

        positive_attention_mask = inputs['positive_attention_mask'].to(self.args.device)

        negative_input_ids = inputs['negative_input_ids'].to(self.args.device)

        negative_attention_mask = inputs['negative_attention_mask'].to(self.args.device)



        anchor_embeds = model(input_ids=anchor_input_ids, attention_mask=anchor_attention_mask).last_hidden_state.mean(dim=1)

        positive_embeds = model(input_ids=positive_input_ids, attention_mask=positive_attention_mask).last_hidden_state.mean(dim=1)

        negative_embeds = model(input_ids=negative_input_ids, attention_mask=negative_attention_mask).last_hidden_state.mean(dim=1)



        return compute_loss(anchor_embeds, positive_embeds, negative_embeds)



# Training arguments

training_args = TrainingArguments(

    output_dir="/content/drive/MyDrive/results",

    learning_rate=2e-5,

    per_device_train_batch_size=16,

    num_train_epochs=3,

    weight_decay=0.01,

    logging_dir='/content/drive/MyDrive/logs',

    remove_unused_columns=False,

    fp16=True,

    save_total_limit=3,

)



# Initialize trainer

trainer = TripletTrainer(

    model=model,

    args=training_args,

    train_dataset=tokenized_datasets['train'],

)



# Start training

trainer.train()



# Save model and evaluate

trainer.save_model("/content/drive/MyDrive/fine-tuned-multilingual-e5")

results = trainer.evaluate()

print(results)

```

## Framework Versions

- Python: 3.10.11
- Sentence Transformers: 3.0.1
- Transformers: 4.44.2
- PyTorch: 2.4.0
- Datasets: 2.21.0

## How to Use

To use the model, install the required libraries and load the model with the following code:

```bash

pip install -U sentence-transformers

```

```python

from sentence_transformers import SentenceTransformer



# Load the fine-tuned model

model = SentenceTransformer("gimmeursocks/ara-e5-small")



# Run inference

sentences = ['أنا سعيد', 'الجو جميل اليوم', 'هذا كلب كبير']

embeddings = model.encode(sentences)

print(embeddings.shape)

```

## Citation

If you use this model or dataset, please cite the corresponding paper or dataset source.