--- license: mit datasets: - jatinmehra/MIT-PLAGAIRISM-DETECTION-DATASET language: - en metrics: - accuracy - f1 - recall base_model: - HuggingFaceTB/SmolLM2-135M-Instruct new_version: jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection pipeline_tag: text-classification library_name: transformers tags: - legal - plagiarism-detection --- # SmolLM Fine-Tuned for Plagiarism Detection This repository hosts a fine-tuned version of SmolLM (135M Parameters) for detecting plagiarism by classifying sentence pairs as either plagiarized or non-plagiarized. Fine-tuning was performed on the [MIT Plagiarism Detection Dataset](https://www.kaggle.com/datasets/ruvelpereira/mit-plagairism-detection-dataset) to enhance the model’s accuracy and performance in identifying textual similarities. ## Model Information - **Base Model**: HuggingFaceTB/SmolLM2-135M-Instruct - **Fine-tuned Model Name**: `jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection` - **License**: MIT - **Language**: English - **Task**: Text Classification - **Metrics**: Accuracy, F1 Score, Recall ## Dataset The model was fine-tuned on the MIT Plagiarism Detection Dataset, which provides pairs of sentences labeled to indicate whether one is a rephrased version of the other (i.e., plagiarized). This dataset is suited for sentence-level similarity detection, and the labels (`1` for plagiarized and `0` for non-plagiarized) offer a straightforward approach to training for binary classification. ## Training Procedure The fine-tuning was done using the `transformers` library from Hugging Face. Key details include: - **Model Architecture**: The model was modified for sequence classification with two output labels. - **Optimizer**: AdamW was used to handle optimization, with a learning rate of 2e-5. - **Loss Function**: Cross-Entropy Loss was used as the objective function. - **Batch Size**: Set to 16 for memory and performance balance. - **Epochs**: Trained for 3 epochs. - **Padding**: A custom padding token was added to align with SmolLM’s requirements, ensuring smooth tokenization. Training involved a DataLoader that fed sentence pairs into the model, tokenized with attention masking, truncation, and padding. After training, the model achieved a high accuracy score, around 99.66% on the training dataset. ## Usage This model can be employed directly within the Hugging Face Transformers library to classify sentence pairs as plagiarized or non-plagiarized. Simply load the model and tokenizer from the `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection` repository, and provide sentence pairs as inputs. The model’s output logits can be interpreted to determine whether plagiarism is detected. - Example: ``` from transformers import GPT2Tokenizer, LlamaForSequenceClassification tokenizer = GPT2Tokenizer.from_pretrained("jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection") model = LlamaForSequenceClassification.from_pretrained("jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection", num_labels=2) model.eval() ``` ## Evaluation During evaluation, the model performed robustly with the following metrics: #### Accuracy on Validation set: 96% #### Classification Report On Test Set **Accuracy**: 96.20% | Class | Precision | Recall | F1-Score | Support | |-------|-----------|--------|----------|---------| | 0 | 0.96 | 0.97 | 0.96 | 36,586 | | 1 | 0.97 | 0.96 | 0.96 | 36,888 | **Overall Metrics**: - **Accuracy**: 0.96 - **Macro Average**: - Precision: 0.96 - Recall: 0.96 - F1-Score: 0.96 - **Weighted Average**: - Precision: 0.96 - Recall: 0.96 - F1-Score: 0.96 - **Total Support**: 73,474 ## Model and Tokenizer Saving Upon completion of fine-tuning, the model and tokenizer were saved for deployment and ease of loading in future projects. They can be loaded from Hugging Face or saved locally for custom applications. ## License This model and associated code are released under the MIT License, allowing for both personal and commercial use. ### Connect with Me I appreciate your support and am happy to connect! [GitHub](https://github.com/Jatin-Mehra119) | [Email](jatinmehra@outlook.in) | [LinkedIn](https://www.linkedin.com/in/jatin-mehra119/) | [Portfolio](https://jatin-mehra119.github.io/Profile/)