|
---
|
|
datasets:
|
|
- Omartificial-Intelligence-Space/Arabic-NLi-Triplet
|
|
language:
|
|
- ar
|
|
base_model: "intfloat/multilingual-e5-small"
|
|
library_name: sentence-transformers
|
|
pipeline_tag: sentence-similarity
|
|
tags:
|
|
- sentence-transformers
|
|
- sentence-similarity
|
|
- feature-extraction
|
|
- arabic
|
|
- triplet-loss
|
|
widget: []
|
|
---
|
|
|
|
# Arabic NLI Triplet - Sentence Transformer Model
|
|
|
|
This repository contains a fine-tuned Sentence Transformer model trained on the "Omartificial-Intelligence-Space/Arabic-NLi-Triplet" dataset. The model is trained to generate 384-dimensional embeddings for semantic similarity tasks like paraphrase mining, sentence similarity, and clustering in Arabic.
|
|
|
|
## Model Overview
|
|
|
|
- **Model Type:** Sentence Transformer
|
|
- **Base Model:** `intfloat/multilingual-e5-small`
|
|
- **Training Dataset:** [Omartificial-Intelligence-Space/Arabic-NLi-Triplet](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-NLi-Triplet)
|
|
- **Similarity Function:** Cosine Similarity
|
|
- **Embedding Dimensionality:** 384 dimensions
|
|
- **Maximum Sequence Length:** 128 tokens
|
|
- **Performance Improvement:** The model achieved around 10% improvement when tested on the test set of the provided dataset, compared to the base model's performance.
|
|
|
|
## Dataset
|
|
|
|
### Arabic NLI Triplet Dataset
|
|
The dataset contains triplets of sentences in Arabic: an anchor sentence, a positive sentence (semantically similar to the anchor), and a negative sentence (semantically dissimilar to the anchor). The dataset is designed for learning sentence representations through triplet margin loss.
|
|
|
|
Dataset Link: [Omartificial-Intelligence-Space/Arabic-NLi-Triplet](https://huggingface.co/datasets/Omartificial-Intelligence-Space/Arabic-NLi-Triplet)
|
|
|
|
## Training Process
|
|
|
|
### Loss Function: Triplet Margin Loss
|
|
|
|
We used the Triplet Margin Loss with a margin of `1.0`. The model is trained to minimize the distance between anchor and positive embeddings, while maximizing the distance between anchor and negative embeddings.
|
|
|
|
### Training Loss Progress:
|
|
Below is the training loss recorded at various steps during the training process:
|
|
|
|
| Step | Training Loss |
|
|
|-------|---------------|
|
|
| 500 | 0.136500 |
|
|
| 1000 | 0.126500 |
|
|
| 1500 | 0.127300 |
|
|
| 2000 | 0.114500 |
|
|
| 2500 | 0.110600 |
|
|
| 3000 | 0.102300 |
|
|
| 3500 | 0.101300 |
|
|
| 4000 | 0.106900 |
|
|
| 4500 | 0.097200 |
|
|
| 5000 | 0.091700 |
|
|
| 5500 | 0.092400 |
|
|
| 6000 | 0.095500 |
|
|
|
|
## Model Training Code
|
|
|
|
The model was trained using the following code (without resuming from checkpoints):
|
|
|
|
```python
|
|
from datasets import load_dataset
|
|
from transformers import AutoTokenizer, AutoModel, TrainingArguments, Trainer
|
|
from torch.nn import TripletMarginLoss
|
|
|
|
# Load dataset
|
|
dataset = load_dataset("Omartificial-Intelligence-Space/Arabic-NLi-Triplet")
|
|
|
|
# Load tokenizer
|
|
tokenizer = AutoTokenizer.from_pretrained("intfloat/multilingual-e5-small")
|
|
|
|
# Tokenize function
|
|
def tokenize_function(examples):
|
|
anchor_encodings = tokenizer(examples['anchor'], truncation=True, padding='max_length', max_length=128)
|
|
positive_encodings = tokenizer(examples['positive'], truncation=True, padding='max_length', max_length=128)
|
|
negative_encodings = tokenizer(examples['negative'], truncation=True, padding='max_length', max_length=128)
|
|
|
|
return {
|
|
'anchor_input_ids': anchor_encodings['input_ids'],
|
|
'anchor_attention_mask': anchor_encodings['attention_mask'],
|
|
'positive_input_ids': positive_encodings['input_ids'],
|
|
'positive_attention_mask': positive_encodings['attention_mask'],
|
|
'negative_input_ids': negative_encodings['input_ids'],
|
|
'negative_attention_mask': negative_encodings['attention_mask'],
|
|
}
|
|
|
|
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=dataset["train"].column_names)
|
|
|
|
# Define triplet loss
|
|
triplet_loss = TripletMarginLoss(margin=1.0)
|
|
|
|
def compute_loss(anchor_embedding, positive_embedding, negative_embedding):
|
|
return triplet_loss(anchor_embedding, positive_embedding, negative_embedding)
|
|
|
|
# Load model
|
|
model = AutoModel.from_pretrained("intfloat/multilingual-e5-small")
|
|
|
|
class TripletTrainer(Trainer):
|
|
def compute_loss(self, model, inputs, return_outputs=False):
|
|
anchor_input_ids = inputs['anchor_input_ids'].to(self.args.device)
|
|
anchor_attention_mask = inputs['anchor_attention_mask'].to(self.args.device)
|
|
positive_input_ids = inputs['positive_input_ids'].to(self.args.device)
|
|
positive_attention_mask = inputs['positive_attention_mask'].to(self.args.device)
|
|
negative_input_ids = inputs['negative_input_ids'].to(self.args.device)
|
|
negative_attention_mask = inputs['negative_attention_mask'].to(self.args.device)
|
|
|
|
anchor_embeds = model(input_ids=anchor_input_ids, attention_mask=anchor_attention_mask).last_hidden_state.mean(dim=1)
|
|
positive_embeds = model(input_ids=positive_input_ids, attention_mask=positive_attention_mask).last_hidden_state.mean(dim=1)
|
|
negative_embeds = model(input_ids=negative_input_ids, attention_mask=negative_attention_mask).last_hidden_state.mean(dim=1)
|
|
|
|
return compute_loss(anchor_embeds, positive_embeds, negative_embeds)
|
|
|
|
# Training arguments
|
|
training_args = TrainingArguments(
|
|
output_dir="/content/drive/MyDrive/results",
|
|
learning_rate=2e-5,
|
|
per_device_train_batch_size=16,
|
|
num_train_epochs=3,
|
|
weight_decay=0.01,
|
|
logging_dir='/content/drive/MyDrive/logs',
|
|
remove_unused_columns=False,
|
|
fp16=True,
|
|
save_total_limit=3,
|
|
)
|
|
|
|
# Initialize trainer
|
|
trainer = TripletTrainer(
|
|
model=model,
|
|
args=training_args,
|
|
train_dataset=tokenized_datasets['train'],
|
|
)
|
|
|
|
# Start training
|
|
trainer.train()
|
|
|
|
# Save model and evaluate
|
|
trainer.save_model("/content/drive/MyDrive/fine-tuned-multilingual-e5")
|
|
results = trainer.evaluate()
|
|
print(results)
|
|
```
|
|
|
|
## Framework Versions
|
|
|
|
- Python: 3.10.11
|
|
- Sentence Transformers: 3.0.1
|
|
- Transformers: 4.44.2
|
|
- PyTorch: 2.4.0
|
|
- Datasets: 2.21.0
|
|
|
|
## How to Use
|
|
|
|
To use the model, install the required libraries and load the model with the following code:
|
|
|
|
```bash
|
|
pip install -U sentence-transformers
|
|
```
|
|
|
|
```python
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
# Load the fine-tuned model
|
|
model = SentenceTransformer("gimmeursocks/ara-e5-small")
|
|
|
|
# Run inference
|
|
sentences = ['أنا سعيد', 'الجو جميل اليوم', 'هذا كلب كبير']
|
|
embeddings = model.encode(sentences)
|
|
print(embeddings.shape)
|
|
```
|
|
|
|
## Citation
|
|
|
|
If you use this model or dataset, please cite the corresponding paper or dataset source.
|
|
|
|
|