---
license: mit
datasets:
- jatinmehra/MIT-PLAGAIRISM-DETECTION-DATASET
language:
- en
metrics:
- accuracy
- f1
- recall
base_model:
- HuggingFaceTB/SmolLM2-135M-Instruct
new_version: jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection
pipeline_tag: text-classification
library_name: transformers
tags:
- legal
- plagiarism-detection
---
# SmolLM Fine-Tuned for Plagiarism Detection

This repository hosts a fine-tuned version of SmolLM (135M Parameters) for detecting plagiarism by classifying sentence pairs as either plagiarized or non-plagiarized. Fine-tuning was performed on the [MIT Plagiarism Detection Dataset](https://www.kaggle.com/datasets/ruvelpereira/mit-plagairism-detection-dataset) to enhance the model’s accuracy and performance in identifying textual similarities.

## Model Information

-   **Base Model**: HuggingFaceTB/SmolLM2-135M-Instruct
-   **Fine-tuned Model Name**: `jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection`
-   **License**: MIT
-   **Language**: English
-   **Task**: Text Classification
-   **Metrics**: Accuracy, F1 Score, Recall

## Dataset

The model was fine-tuned on the MIT Plagiarism Detection Dataset, which provides pairs of sentences labeled to indicate whether one is a rephrased version of the other (i.e., plagiarized). This dataset is suited for sentence-level similarity detection, and the labels (`1` for plagiarized and `0` for non-plagiarized) offer a straightforward approach to training for binary classification.

## Training Procedure

The fine-tuning was done using the `transformers` library from Hugging Face. Key details include:

-   **Model Architecture**: The model was modified for sequence classification with two output labels.
-   **Optimizer**: AdamW was used to handle optimization, with a learning rate of 2e-5.
-   **Loss Function**: Cross-Entropy Loss was used as the objective function.
-   **Batch Size**: Set to 16 for memory and performance balance.
-   **Epochs**: Trained for 3 epochs.
-   **Padding**: A custom padding token was added to align with SmolLM’s requirements, ensuring smooth tokenization.

Training involved a DataLoader that fed sentence pairs into the model, tokenized with attention masking, truncation, and padding. After training, the model achieved a high accuracy score, around 99.66% on the training dataset.

## Usage

This model can be employed directly within the Hugging Face Transformers library to classify sentence pairs as plagiarized or non-plagiarized. Simply load the model and tokenizer from the `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection` repository, and provide sentence pairs as inputs. The model’s output logits can be interpreted to determine whether plagiarism is detected.

- Example:

```
from transformers import GPT2Tokenizer, LlamaForSequenceClassification

tokenizer = GPT2Tokenizer.from_pretrained("jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection")

model = LlamaForSequenceClassification.from_pretrained("jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection", num_labels=2)

model.eval()
```
## Evaluation

During evaluation, the model performed robustly with the following metrics:

#### Accuracy on Validation set: 96%

#### Classification Report On Test Set

**Accuracy**: 96.20%

| Class | Precision | Recall | F1-Score | Support |
|-------|-----------|--------|----------|---------|
| 0     | 0.96      | 0.97   | 0.96     | 36,586  |
| 1     | 0.97      | 0.96   | 0.96     | 36,888  |

**Overall Metrics**:
- **Accuracy**: 0.96
- **Macro Average**:
  - Precision: 0.96
  - Recall: 0.96
  - F1-Score: 0.96
- **Weighted Average**:
  - Precision: 0.96
  - Recall: 0.96
  - F1-Score: 0.96
- **Total Support**: 73,474

## Model and Tokenizer Saving

Upon completion of fine-tuning, the model and tokenizer were saved for deployment and ease of loading in future projects. They can be loaded from Hugging Face or saved locally for custom applications.

## License

This model and associated code are released under the MIT License, allowing for both personal and commercial use.

### Connect with Me

I appreciate your support and am happy to connect!  
[GitHub](https://github.com/Jatin-Mehra119) | [Email](jatinmehra@outlook.in) | [LinkedIn](https://www.linkedin.com/in/jatin-mehra119/) | [Portfolio](https://jatin-mehra119.github.io/Profile/)