diegovelilla commited on
Commit
ae7fe37
·
verified ·
1 Parent(s): 150b021

First initial commit

Browse files

All notebooks + python script + readme + requirements

README.md CHANGED
@@ -1,3 +1,130 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EssAI: AI-generated essays detector
2
+
3
+ ## Table of Contents
4
+
5
+ 1. [Overview](#overview)
6
+ 2. [Features](#features)
7
+ 3. [Files](#files)
8
+ 4. [Installation](#installation)
9
+ 5. [Usage](#usage)
10
+ 6. [Model Details](#model-details)
11
+ 7. [Dataset](#dataset)
12
+ 8. [Fine-tuning](#fine-tuning)
13
+ 9. [Results](#results)
14
+ 10. [Additional Resources](#additional-resources)
15
+ 11. [License](#license)
16
+ 12. [Contact](#contact)
17
+
18
+ ## Overview
19
+
20
+ This project fine-tunes a Large Language Model (LLM) in order to detect AI-generated essays. The model aims to help educators, researchers or individuals identify text that has been generated by AI, ensuring the authenticity of written content.
21
+
22
+ ## Features
23
+
24
+ - Detects AI-generated essays with very high accuracy (over 95%).
25
+ - Fine-tuned on massive dataset combining ~500K human-written and AI-generated essays.
26
+
27
+ ## Files
28
+
29
+ ### `requirements.txt`
30
+ This file lists all the Python packages required to run the project. It ensures that all necessary dependencies are installed for the project to function correctly.
31
+
32
+ ### `essai_user_input.py`
33
+ This script is responsible for handling user inputs. Just copy in your essay and run it to get the prediction.
34
+
35
+ ### `training.py`
36
+ This script has handled the training process of the model. It includes code for loading the dataset, fine-tuning it and saving the trained model.
37
+
38
+ ### `testing.py`
39
+ This script is used to evaluate the performance of the trained model. It loads the test dataset, performs predictions, and calculates performance metrics such as accuracy and F1-score.
40
+
41
+ ### `data_insights.py`
42
+ This script generates insights and visualizations from the data used in this project. It includes functions for analyzing dataset statistics, plotting graphs, and summarizing key data points to help understand the dataset better.
43
+
44
+ ## Installation
45
+
46
+ To install the required dependencies, clone the repository and install the necessary Python packages in the **requirements.txt** file:
47
+
48
+ ```bash
49
+ git clone https://github.com/diegovelilla/EssAI
50
+ cd EssAI
51
+ pip install -r requirements.txt
52
+ ```
53
+
54
+ ## Usage
55
+
56
+ You can use the model to check your own essays by running the **essai_user_input.py** file and coping the text into the input part right after the imports:
57
+
58
+ ```python
59
+ # --- INPUT ---
60
+
61
+ input_list = [""" WRITE HERE YOUR FIRST ESSAY """,
62
+ """ WRITE HERE YOUR SECOND ESSAY """]
63
+
64
+ # -------------
65
+ ```
66
+ As you can see, you can check more than one essay at a time. This model has been trained with 350-400 word long essays, so just keep that in mind when using it. Learn more about the data used in the [data_insights](https://github.com/diegovelilla/EssAI/blob/main/essai_data_insights.ipynb) notebook.
67
+
68
+ ## Model details
69
+ The base model selected for this project was the [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) model from hugging face. BERT (Bidirectional Encoder Representations from Transformers) is a transformer model developed and published in 2018 by Google's AI Reasearch Team. This is an open-source model with 110M parameters pretrained on a large corpus of English written data with the objectives of:
70
+
71
+ - Predicting missing words in a sentence.
72
+ - Guessing if two sentences were next to each other in the original text.
73
+
74
+ Which makes it a really competent text classification model and a great candidate for our project.
75
+
76
+ ## Dataset
77
+ The dataset used was taken from Kaggle and can be found [here](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text). It contains about 500K different essays with around 60% being human written and the 40% left AI-generated. For further data info, check out the [data_insights](https://github.com/diegovelilla/EssAI/blob/main/essai_data_insights.ipynb) notebook. Also check out the [training](https://github.com/diegovelilla/EssAI/blob/main/essai_training.ipynb) and [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebooks if interested in how the model was fine-tuned or want to check the model's performance (instructions inside).
78
+
79
+ ## Fine-tuning
80
+ For resource issues and since this was intended as a learning project, only 1% from de full 500K dataset has been used which would still mean a training dataset of 4.000 essays and a testing dataset of 1.000 essays.
81
+
82
+ I encourage anyone reading this to try to further train this model increasing the data used with the [training](https://github.com/diegovelilla/EssAI/blob/main/essai_training.ipynb) notebook.
83
+
84
+ ## Results
85
+ For the first 1.000 datasets tested, the model showed a 98% accuracy. For the second one, and with a testing sample of 20.000 essays, the accuracy shown was 97%.
86
+ Further testing can be done using the [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebook
87
+
88
+ In the initial testing phase with a sample of 1.000 essays, the model demonstrated an impressive accuracy of 98%. In a subsequent, more extensive test involving 20.000 essays, the model maintained a high accuracy of 97%.
89
+
90
+ For more detailed evaluation and further testing, please refer to the [testing](https://github.com/diegovelilla/EssAI/blob/main/essai_testing.ipynb) notebook.
91
+
92
+ ## Additional Resources
93
+
94
+ Throughout the development, I've found some resources very useful that I would like to share apart from others related to the project.
95
+
96
+ ### Tutorials and Documentation
97
+
98
+ - **[Hugging Face NLP Course](huggingface.co/learn/nlp-course/)**: Comprehensive tutorials and documentation on what is NLP and how to use Hugging Face's libraries.
99
+ - **[Hugging Face Transformers Documentation](https://huggingface.co/transformers/)**: The official documentation for the Transformers library.
100
+
101
+ ### Articles and Papers
102
+
103
+ - **[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)**: The original research paper on BERT, which provided insights into the architecture and capabilities of the model.
104
+ - **[A Comprehensive Guide to Fine-Tuning BERT](https://towardsdatascience.com/a-comprehensive-guide-to-fine-tuning-bert-for-nlp-tasks-39ef4a51c7d3)**: An article that outlines various techniques for fine-tuning BERT models for specific tasks.
105
+
106
+ ### Tools and Libraries
107
+
108
+ - **[Kaggle Datasets](https://www.kaggle.com/datasets)**: Platform used to source the dataset for this project.
109
+ - **[Git Large File Storage (LFS)](https://git-lfs.github.com/)**: Tool used for managing large files in the Git repository. Very useful for moving big files like the ones that form the model.
110
+
111
+ ### YouTube channels
112
+
113
+ - **[Andrej Karpathy](https://www.youtube.com/@AndrejKarpathy)**: One of my favourite ML/DL YouTube channels with amazing videos. Can't stress enough how much I have learned from this man.
114
+ - **[DotCSV](https://www.youtube.com/@DotCSV)**: The first AI related YouTube channel I did ever follow. Great spanish speaking channel to keep up with AI news.
115
+
116
+ These resources provided valuable information and tools throughout the project's development. If you’re working on similar projects, they might be helpful to you as well.
117
+
118
+ ## License
119
+ This project is licensed under the **Apache 2.0 License**. See the [LICENSE](https://github.com/diegovelilla/EssAI/blob/main/LICENSE) file for more details.
120
+
121
+ ## Contact
122
+
123
+ For any questions or feedback please reach out to:
124
+
125
+ - **Email**: [diegovelillarecio@gmail.com](mailto:diegovelillarecio@gmail.com)
126
+ - **GitHub Profile**: [diegovelilla](https://github.com/diegovelilla)
127
+ - **Hugging Face Profile**: [diegovelilla](https://huggingface.co/diegovelilla)
128
+ - **LinkedIn**: [Diego Velilla Recio](https://www.linkedin.com/in/diego-velilla-recio/)
129
+
130
+ Feel free to open an issue on GitHub or contact me in any way if you have any queries or suggestions.
essai_data_insights.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
essai_testing.ipynb ADDED
@@ -0,0 +1,280 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "colab": {
4
+ "provenance": []
5
+ },
6
+ "kernelspec": {
7
+ "name": "python3",
8
+ "display_name": "Python 3"
9
+ },
10
+ "language_info": {
11
+ "name": "python",
12
+ "version": "3.10.13",
13
+ "mimetype": "text/x-python",
14
+ "codemirror_mode": {
15
+ "name": "ipython",
16
+ "version": 3
17
+ },
18
+ "pygments_lexer": "ipython3",
19
+ "nbconvert_exporter": "python",
20
+ "file_extension": ".py"
21
+ },
22
+ "kaggle": {
23
+ "accelerator": "none",
24
+ "dataSources": [],
25
+ "dockerImageVersionId": 30746,
26
+ "isInternetEnabled": false,
27
+ "language": "python",
28
+ "sourceType": "notebook",
29
+ "isGpuEnabled": false
30
+ }
31
+ },
32
+ "nbformat_minor": 0,
33
+ "nbformat": 4,
34
+ "cells": [
35
+ {
36
+ "cell_type": "code",
37
+ "source": [
38
+ "# --- INSTALLATION ---\n",
39
+ "\n",
40
+ "!pip install pandas numpy matplotlib nltk scikit-learn transformers datasets torch\n",
41
+ "!kaggle datasets download -d shanegerami/ai-vs-human-text\n",
42
+ "!unzip -n ai-vs-human-text.zip\n",
43
+ "!rm ai-vs-human-text.zip\n",
44
+ "\n",
45
+ "# -------------------------"
46
+ ],
47
+ "metadata": {
48
+ "id": "XKWBDF8lir6o",
49
+ "execution": {
50
+ "iopub.status.busy": "2024-08-14T18:13:18.903225Z",
51
+ "iopub.execute_input": "2024-08-14T18:13:18.903635Z",
52
+ "iopub.status.idle": "2024-08-14T18:14:34.119173Z",
53
+ "shell.execute_reply.started": "2024-08-14T18:13:18.903599Z",
54
+ "shell.execute_reply": "2024-08-14T18:14:34.117649Z"
55
+ },
56
+ "trusted": true
57
+ },
58
+ "execution_count": null,
59
+ "outputs": []
60
+ },
61
+ {
62
+ "cell_type": "code",
63
+ "source": [
64
+ "# --- IMPORTS ---\n",
65
+ "\n",
66
+ "import pandas as pd\n",
67
+ "import numpy as np\n",
68
+ "import matplotlib.pyplot as plt\n",
69
+ "import re\n",
70
+ "import nltk\n",
71
+ "from nltk.corpus import stopwords\n",
72
+ "nltk.download('stopwords')\n",
73
+ "stopwords = set(stopwords.words('english'))\n",
74
+ "from sklearn.model_selection import train_test_split\n",
75
+ "from sklearn.metrics import accuracy_score, precision_recall_fscore_support\n",
76
+ "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
77
+ "from transformers import Trainer, TrainingArguments, DataCollatorWithPadding\n",
78
+ "from datasets import Dataset\n",
79
+ "import torch\n",
80
+ "\n",
81
+ "# -------------------------"
82
+ ],
83
+ "metadata": {
84
+ "id": "q9TGKRUIiPMy"
85
+ },
86
+ "execution_count": null,
87
+ "outputs": []
88
+ },
89
+ {
90
+ "cell_type": "code",
91
+ "source": [
92
+ "# --- USEFUL FUNCTIONS ----\n",
93
+ "\n",
94
+ "def clean_text(text):\n",
95
+ " \"\"\"\n",
96
+ " This funtion get's rid of nonalphabetical characters, stopwords and lower cases the text.\n",
97
+ "\n",
98
+ " Args:\n",
99
+ " text (str): The text to be cleaned\n",
100
+ "\n",
101
+ " Returns:\n",
102
+ " text (str): The cleaned text\n",
103
+ "\n",
104
+ " Example:\n",
105
+ " df['text'] = df['text'].apply(clean_text)\n",
106
+ " \"\"\"\n",
107
+ " text = re.sub(r'[^a-zA-Z]', ' ', text)\n",
108
+ " text = text.lower()\n",
109
+ " words = text.split()\n",
110
+ " text = [word for word in words if not word in stopwords]\n",
111
+ " text = ' '.join(words)\n",
112
+ " return text\n",
113
+ "\n",
114
+ "def tokenize_function(dataframe):\n",
115
+ " \"\"\"\n",
116
+ " This funtion tokenizes the 'text' field of the dataframe.\n",
117
+ "\n",
118
+ " Args:\n",
119
+ " dataframe (pandas.DataFrame): The dataframe to be tokenized\n",
120
+ "\n",
121
+ " Returns:\n",
122
+ " dataframe (pandas.DataFrame): The tokenized dataframe\n",
123
+ "\n",
124
+ " Example and output:\n",
125
+ " train_dataset_token = train_dataset.map(tokenize_function, batched=True)\n",
126
+ " \"\"\"\n",
127
+ " return tokenizer(dataframe[\"text\"], truncation=True)\n",
128
+ "\n",
129
+ "def compute_metrics(eval_pred):\n",
130
+ " \"\"\"\n",
131
+ " This funtion computes the accuracy, precision, recall and f1 score of the model.\n",
132
+ "\n",
133
+ " It'is passed to the trainer and it outputs when evaluating the model.\n",
134
+ "\n",
135
+ " Args:\n",
136
+ " eval_pred (tuple): The predictions and labels of the model\n",
137
+ "\n",
138
+ " Returns:\n",
139
+ " dict: The accuracy, precision, recall and f1 score of the model\n",
140
+ "\n",
141
+ " Example:\n",
142
+ " >>> trainer.evaluate()\n",
143
+ " {\n",
144
+ " 'accuracy': accuracy,\n",
145
+ " 'precision': precision,\n",
146
+ " 'recall': recall,\n",
147
+ " 'f1': f1\n",
148
+ " }\n",
149
+ " \"\"\"\n",
150
+ " predictions, labels = eval_pred\n",
151
+ " predictions = predictions.argmax(axis=-1)\n",
152
+ " accuracy = accuracy_score(labels, predictions)\n",
153
+ " precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')\n",
154
+ " return {\n",
155
+ " 'accuracy': accuracy,\n",
156
+ " 'precision': precision,\n",
157
+ " 'recall': recall,\n",
158
+ " 'f1': f1\n",
159
+ " }\n",
160
+ "\n",
161
+ "# -------------------------"
162
+ ],
163
+ "metadata": {
164
+ "id": "JtYsc4hJAnk3"
165
+ },
166
+ "execution_count": 41,
167
+ "outputs": []
168
+ },
169
+ {
170
+ "cell_type": "code",
171
+ "source": [
172
+ "# --- LOADING THE MODEL ---\n",
173
+ "\n",
174
+ "# Load the initial tokenizer and model to set the number of labels its going to classify as 2\n",
175
+ "checkpoint = \"diegovelilla/EssAI\"\n",
176
+ "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
177
+ "model = AutoModelForSequenceClassification.from_pretrained(checkpoint)\n",
178
+ "\n",
179
+ "# -------------------------"
180
+ ],
181
+ "metadata": {
182
+ "id": "P87A1UTgJIia"
183
+ },
184
+ "execution_count": 42,
185
+ "outputs": []
186
+ },
187
+ {
188
+ "cell_type": "code",
189
+ "source": [
190
+ "# --- DATA PREPROCESSING ---\n",
191
+ "\n",
192
+ "df = pd.read_csv('AI_Human.csv')\n",
193
+ "\n",
194
+ "# Separate human from ai\n",
195
+ "df_human = df[df[\"generated\"] == 0]\n",
196
+ "df_ai = df[df[\"generated\"] == 1]\n",
197
+ "\n",
198
+ "# We take as many human written esssays as AI generate since the dataset is a bit unbalanced\n",
199
+ "df_ai_len = df_ai[\"text\"].count()\n",
200
+ "df_human = df_human.sample(n=df_ai_len)\n",
201
+ "\n",
202
+ "# We concatenate both dataframes, shuffle them and then we take 1% of them since those will be enough to fine tune the model\n",
203
+ "# and with my current resources I won't be able to process more. For better results increase the fraction of the data used.\n",
204
+ "df_unshuffled = pd.concat([df_human, df_ai])\n",
205
+ "df = df_unshuffled.sample(frac=0.01).reset_index(drop=True)\n",
206
+ "\n",
207
+ "# Get rid of nonalphatetical characters, stopwords and we lower case it.\n",
208
+ "df['text'] = df['text'].apply(clean_text)\n",
209
+ "\n",
210
+ "# According to the transformers library of hugging face the targets column name should be labels and ints\n",
211
+ "df = df.rename(columns={'generated': 'labels'})\n",
212
+ "df['labels'] = df['labels'].astype(int)\n",
213
+ "\n",
214
+ "# We convert the pandas dataframe into a hugging face dataset and tokenize both of them\n",
215
+ "ds = Dataset.from_pandas(df)\n",
216
+ "ds_token = ds.map(tokenize_function, batched=True)\n",
217
+ "\n",
218
+ "# Drop columns that are not necessary and set the dataset format to pytorch tensors\n",
219
+ "ds_token = ds_token.remove_columns([\"text\", \"token_type_ids\"])\n",
220
+ "ds_token.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])\n",
221
+ "\n",
222
+ "# -------------------------\n"
223
+ ],
224
+ "metadata": {
225
+ "id": "rYaHbUCDG7tf"
226
+ },
227
+ "execution_count": null,
228
+ "outputs": []
229
+ },
230
+ {
231
+ "cell_type": "code",
232
+ "source": [
233
+ "# --- INSTANTIATING TRAINER ----\n",
234
+ "\n",
235
+ "# We instantiate a DataCollatorWithPadding in order to pad the inputs in batches while training\n",
236
+ "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n",
237
+ "\n",
238
+ "# Create the training arguments\n",
239
+ "training_args = TrainingArguments(\"./results\")\n",
240
+ "\n",
241
+ "# Create the trainer\n",
242
+ "trainer = Trainer(\n",
243
+ " model,\n",
244
+ " training_args,\n",
245
+ " eval_dataset=ds_token,\n",
246
+ " data_collator=data_collator,\n",
247
+ " tokenizer=tokenizer,\n",
248
+ " compute_metrics = compute_metrics\n",
249
+ ")\n",
250
+ "\n",
251
+ "# -------------------------"
252
+ ],
253
+ "metadata": {
254
+ "id": "Golh92ee33aA"
255
+ },
256
+ "execution_count": 50,
257
+ "outputs": []
258
+ },
259
+ {
260
+ "cell_type": "code",
261
+ "source": [
262
+ "# --- EVALUATION ---\n",
263
+ "\n",
264
+ "evaluation_results = trainer.evaluate()\n",
265
+ "\n",
266
+ "print(\"Accuracy:\", evaluation_results['eval_accuracy'])\n",
267
+ "print(\"Precision:\", evaluation_results['eval_precision'])\n",
268
+ "print(\"Recall:\", evaluation_results['eval_recall'])\n",
269
+ "print(\"F1:\", evaluation_results['eval_f1'])\n",
270
+ "\n",
271
+ "# -------------------------"
272
+ ],
273
+ "metadata": {
274
+ "id": "WkQgrxgFPkpJ"
275
+ },
276
+ "execution_count": null,
277
+ "outputs": []
278
+ }
279
+ ]
280
+ }
essai_training.ipynb ADDED
@@ -0,0 +1,333 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "colab": {
4
+ "provenance": []
5
+ },
6
+ "kernelspec": {
7
+ "name": "python3",
8
+ "display_name": "Python 3"
9
+ },
10
+ "language_info": {
11
+ "name": "python",
12
+ "version": "3.10.13",
13
+ "mimetype": "text/x-python",
14
+ "codemirror_mode": {
15
+ "name": "ipython",
16
+ "version": 3
17
+ },
18
+ "pygments_lexer": "ipython3",
19
+ "nbconvert_exporter": "python",
20
+ "file_extension": ".py"
21
+ },
22
+ "kaggle": {
23
+ "accelerator": "none",
24
+ "dataSources": [],
25
+ "dockerImageVersionId": 30746,
26
+ "isInternetEnabled": false,
27
+ "language": "python",
28
+ "sourceType": "notebook",
29
+ "isGpuEnabled": false
30
+ }
31
+ },
32
+ "nbformat_minor": 0,
33
+ "nbformat": 4,
34
+ "cells": [
35
+ {
36
+ "cell_type": "code",
37
+ "source": [
38
+ "# --- INSTALLATION ---\n",
39
+ "\n",
40
+ "!pip install pandas numpy matplotlib nltk scikit-learn transformers datasets torch\n",
41
+ "!kaggle datasets download -d shanegerami/ai-vs-human-text\n",
42
+ "!unzip -n ai-vs-human-text.zip\n",
43
+ "!rm ai-vs-human-text.zip\n",
44
+ "\n",
45
+ "# -------------------------"
46
+ ],
47
+ "metadata": {
48
+ "id": "XKWBDF8lir6o",
49
+ "execution": {
50
+ "iopub.status.busy": "2024-08-14T18:13:18.903225Z",
51
+ "iopub.execute_input": "2024-08-14T18:13:18.903635Z",
52
+ "iopub.status.idle": "2024-08-14T18:14:34.119173Z",
53
+ "shell.execute_reply.started": "2024-08-14T18:13:18.903599Z",
54
+ "shell.execute_reply": "2024-08-14T18:14:34.117649Z"
55
+ },
56
+ "trusted": true
57
+ },
58
+ "execution_count": null,
59
+ "outputs": []
60
+ },
61
+ {
62
+ "cell_type": "code",
63
+ "source": [
64
+ "# --- IMPORTS ---\n",
65
+ "\n",
66
+ "import pandas as pd\n",
67
+ "import numpy as np\n",
68
+ "import matplotlib.pyplot as plt\n",
69
+ "import re\n",
70
+ "import nltk\n",
71
+ "from nltk.corpus import stopwords\n",
72
+ "nltk.download('stopwords')\n",
73
+ "stopwords = set(stopwords.words('english'))\n",
74
+ "from sklearn.model_selection import train_test_split\n",
75
+ "from sklearn.metrics import accuracy_score, precision_recall_fscore_support\n",
76
+ "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n",
77
+ "from transformers import Trainer, TrainingArguments, DataCollatorWithPadding\n",
78
+ "from datasets import Dataset\n",
79
+ "import torch\n",
80
+ "\n",
81
+ "# -------------------------"
82
+ ],
83
+ "metadata": {
84
+ "id": "q9TGKRUIiPMy"
85
+ },
86
+ "execution_count": null,
87
+ "outputs": []
88
+ },
89
+ {
90
+ "cell_type": "code",
91
+ "source": [
92
+ "# --- USEFUL FUNCTIONS ----\n",
93
+ "\n",
94
+ "def clean_text(text):\n",
95
+ " \"\"\"\n",
96
+ " This funtion get's rid of nonalphabetical characters, stopwords and lower cases the text.\n",
97
+ "\n",
98
+ " Args:\n",
99
+ " text (str): The text to be cleaned\n",
100
+ "\n",
101
+ " Returns:\n",
102
+ " text (str): The cleaned text\n",
103
+ "\n",
104
+ " Example:\n",
105
+ " df['text'] = df['text'].apply(clean_text)\n",
106
+ " \"\"\"\n",
107
+ " text = re.sub(r'[^a-zA-Z]', ' ', text)\n",
108
+ " text = text.lower()\n",
109
+ " words = text.split()\n",
110
+ " text = [word for word in words if not word in stopwords]\n",
111
+ " text = ' '.join(words)\n",
112
+ " return text\n",
113
+ "\n",
114
+ "def tokenize_function(dataframe):\n",
115
+ " \"\"\"\n",
116
+ " This funtion tokenizes the 'text' field of the dataframe.\n",
117
+ "\n",
118
+ " Args:\n",
119
+ " dataframe (pandas.DataFrame): The dataframe to be tokenized\n",
120
+ "\n",
121
+ " Returns:\n",
122
+ " dataframe (pandas.DataFrame): The tokenized dataframe\n",
123
+ "\n",
124
+ " Example and output:\n",
125
+ " train_dataset_token = train_dataset.map(tokenize_function, batched=True)\n",
126
+ " \"\"\"\n",
127
+ " return tokenizer(dataframe[\"text\"], truncation=True)\n",
128
+ "\n",
129
+ "def compute_metrics(eval_pred):\n",
130
+ " \"\"\"\n",
131
+ " This funtion computes the accuracy, precision, recall and f1 score of the model.\n",
132
+ "\n",
133
+ " It'is passed to the trainer and it outputs when evaluating the model.\n",
134
+ "\n",
135
+ " Args:\n",
136
+ " eval_pred (tuple): The predictions and labels of the model\n",
137
+ "\n",
138
+ " Returns:\n",
139
+ " dict: The accuracy, precision, recall and f1 score of the model\n",
140
+ "\n",
141
+ " Example:\n",
142
+ " >>> trainer.evaluate()\n",
143
+ " {\n",
144
+ " 'accuracy': accuracy,\n",
145
+ " 'precision': precision,\n",
146
+ " 'recall': recall,\n",
147
+ " 'f1': f1\n",
148
+ " }\n",
149
+ " \"\"\"\n",
150
+ " predictions, labels = eval_pred\n",
151
+ " predictions = predictions.argmax(axis=-1)\n",
152
+ " accuracy = accuracy_score(labels, predictions)\n",
153
+ " precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average='binary')\n",
154
+ " return {\n",
155
+ " 'accuracy': accuracy,\n",
156
+ " 'precision': precision,\n",
157
+ " 'recall': recall,\n",
158
+ " 'f1': f1\n",
159
+ " }\n",
160
+ "\n",
161
+ "# -------------------------"
162
+ ],
163
+ "metadata": {
164
+ "id": "JtYsc4hJAnk3"
165
+ },
166
+ "execution_count": 3,
167
+ "outputs": []
168
+ },
169
+ {
170
+ "cell_type": "code",
171
+ "source": [
172
+ "# --- INSTANTIATING THE MODEL ---\n",
173
+ "\n",
174
+ "# Load the initial tokenizer and model to set the number of labels its going to classify as 2\n",
175
+ "checkpoint = \"bert-base-uncased\"\n",
176
+ "tokenizer = AutoTokenizer.from_pretrained(checkpoint)\n",
177
+ "model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)\n",
178
+ "\n",
179
+ "# -------------------------------"
180
+ ],
181
+ "metadata": {
182
+ "id": "Golh92ee33aA"
183
+ },
184
+ "execution_count": null,
185
+ "outputs": []
186
+ },
187
+ {
188
+ "cell_type": "code",
189
+ "source": [
190
+ "# --- DATA PREPROCESSING ---\n",
191
+ "\n",
192
+ "df = pd.read_csv('AI_Human.csv')\n",
193
+ "\n",
194
+ "# Separate human from ai\n",
195
+ "df_human = df[df[\"generated\"] == 0]\n",
196
+ "df_ai = df[df[\"generated\"] == 1]\n",
197
+ "\n",
198
+ "# We take as many human written esssays as AI generate since the dataset is a bit unbalanced\n",
199
+ "df_ai_len = df_ai[\"text\"].count()\n",
200
+ "df_human = df_human.sample(n=df_ai_len)\n",
201
+ "\n",
202
+ "# We concatenate both dataframes, shuffle them and then we take 1% of them since those will be enough to fine tune the model\n",
203
+ "# and with my current resources I won't be able to process more. For better results increase the fraction of the data used.\n",
204
+ "df_unshuffled = pd.concat([df_human, df_ai])\n",
205
+ "df = df_unshuffled.sample(frac=0.01).reset_index(drop=True)\n",
206
+ "\n",
207
+ "# Get rid of nonalphatetical characters, stopwords and we lower case it.\n",
208
+ "df['text'] = df['text'].apply(clean_text)\n",
209
+ "\n",
210
+ "# Split in train/test (I used 80%/20%)\n",
211
+ "df_train, df_test = train_test_split(df, test_size=0.2)\n",
212
+ "\n",
213
+ "# According to the transformers library of hugging face the targets column name should be labels and ints\n",
214
+ "df_train = df_train.rename(columns={'generated': 'labels'})\n",
215
+ "df_test = df_test.rename(columns={'generated': 'labels'})\n",
216
+ "df_train['labels'] = df_train['labels'].astype(int)\n",
217
+ "df_test['labels'] = df_test['labels'].astype(int)\n",
218
+ "\n",
219
+ "# We convert the pandas dataframe into hugging face datasets and tokenize both of them\n",
220
+ "train_dataset = Dataset.from_pandas(df_train)\n",
221
+ "test_dataset = Dataset.from_pandas(df_test)\n",
222
+ "train_dataset_token = train_dataset.map(tokenize_function, batched=True)\n",
223
+ "test_dataset_token = test_dataset.map(tokenize_function, batched=True)\n",
224
+ "\n",
225
+ "# Drop columns that are not necessary and set the dataset format to pytorch tensors\n",
226
+ "train_dataset_token = train_dataset_token.remove_columns([\"text\", \"__index_level_0__\", \"token_type_ids\"])\n",
227
+ "test_dataset_token = test_dataset_token.remove_columns([\"text\", \"__index_level_0__\", \"token_type_ids\"])\n",
228
+ "train_dataset_token.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])\n",
229
+ "test_dataset_token.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])\n",
230
+ "\n",
231
+ "# -------------------------\n"
232
+ ],
233
+ "metadata": {
234
+ "id": "GUNv7d5lkg2z"
235
+ },
236
+ "execution_count": null,
237
+ "outputs": []
238
+ },
239
+ {
240
+ "cell_type": "code",
241
+ "source": [
242
+ "# --- INSTANTIATING TRAINER ---\n",
243
+ "\n",
244
+ "# We instantiate a DataCollatorWithPadding in order to pad the inputs in batches while training\n",
245
+ "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)\n",
246
+ "\n",
247
+ "# Create the training arguments\n",
248
+ "training_args = TrainingArguments(\n",
249
+ " output_dir=\"./results\",\n",
250
+ " per_device_train_batch_size=16, # Adjust based on GPU memory\n",
251
+ " per_device_eval_batch_size=16,\n",
252
+ " num_train_epochs=3,\n",
253
+ " weight_decay=0.01,\n",
254
+ " logging_dir=\"./logs\",\n",
255
+ " logging_steps=100,\n",
256
+ ")\n",
257
+ "\n",
258
+ "# Create the trainer\n",
259
+ "trainer = Trainer(\n",
260
+ " model,\n",
261
+ " training_args,\n",
262
+ " train_dataset=train_dataset_token,\n",
263
+ " eval_dataset=test_dataset_token,\n",
264
+ " data_collator=data_collator,\n",
265
+ " tokenizer=tokenizer,\n",
266
+ " compute_metrics = compute_metrics\n",
267
+ ")\n",
268
+ "\n",
269
+ "# -------------------------"
270
+ ],
271
+ "metadata": {
272
+ "id": "FhqLhZv5HFot"
273
+ },
274
+ "execution_count": null,
275
+ "outputs": []
276
+ },
277
+ {
278
+ "cell_type": "code",
279
+ "source": [
280
+ "# --- TRAINING ---\n",
281
+ "\n",
282
+ "trainer.train()\n",
283
+ "\n",
284
+ "# ----------------"
285
+ ],
286
+ "metadata": {
287
+ "id": "T65B4LitLfsN"
288
+ },
289
+ "execution_count": null,
290
+ "outputs": []
291
+ },
292
+ {
293
+ "cell_type": "code",
294
+ "source": [
295
+ "# --- EVALUATION ---\n",
296
+ "\n",
297
+ "evaluation_results = trainer.evaluate()\n",
298
+ "\n",
299
+ "print(\"Accuracy:\", evaluation_results['eval_accuracy'])\n",
300
+ "print(\"Precision:\", evaluation_results['eval_precision'])\n",
301
+ "print(\"Recall:\", evaluation_results['eval_recall'])\n",
302
+ "print(\"F1:\", evaluation_results['eval_f1'])\n",
303
+ "\n",
304
+ "# -------------------------"
305
+ ],
306
+ "metadata": {
307
+ "id": "WkQgrxgFPkpJ"
308
+ },
309
+ "execution_count": null,
310
+ "outputs": []
311
+ },
312
+ {
313
+ "cell_type": "code",
314
+ "source": [
315
+ "# --- EXPORTING THE MODEL (optional) ---\n",
316
+ "\n",
317
+ "# Save the model and tokenizer\n",
318
+ "#model.save_pretrained(\"./AI-Detector-Model/Model\")\n",
319
+ "#tokenizer.save_pretrained(\"./AI-Detector-Model/Tokenizer\")\n",
320
+ "\n",
321
+ "# Zip the model\n",
322
+ "#!zip -r AI-Detector-Model.zip AI-Detector-Model\n",
323
+ "\n",
324
+ "# --------------------------"
325
+ ],
326
+ "metadata": {
327
+ "id": "DF-ZWbjHSxuE"
328
+ },
329
+ "execution_count": null,
330
+ "outputs": []
331
+ }
332
+ ]
333
+ }
essai_user_input.py ADDED
@@ -0,0 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # --- IMPORTS ---
2
+
3
+ import torch
4
+ from datasets import Dataset
5
+ from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
6
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
7
+ from sklearn.metrics import accuracy_score, precision_recall_fscore_support
8
+ from sklearn.model_selection import train_test_split
9
+ import pandas as pd
10
+ import numpy as np
11
+ import matplotlib.pyplot as plt
12
+ import re
13
+ import nltk
14
+ from nltk.corpus import stopwords
15
+ nltk.download('stopwords')
16
+ stopwords = set(stopwords.words('english'))
17
+
18
+ # -------------------------
19
+
20
+ # --- INPUT ---
21
+
22
+ input_list = [""" WRITE HERE YOUR FIRST ESSAY """,
23
+ """ WRITE HERE YOUR SECOND ESSAY """]
24
+
25
+ # -------------
26
+
27
+ # --- USEFUL FUNCTIONS ----
28
+
29
+
30
+ def clean_text(text):
31
+ """
32
+ This funtion get's rid of nonalphabetical characters, stopwords and lower cases the text.
33
+
34
+ Args:
35
+ text (str): The text to be cleaned
36
+
37
+ Returns:
38
+ text (str): The cleaned text
39
+
40
+ Example:
41
+ df['text'] = df['text'].apply(clean_text)
42
+ """
43
+ text = re.sub(r'[^a-zA-Z]', ' ', text)
44
+ text = text.lower()
45
+ words = text.split()
46
+ text = [word for word in words if not word in stopwords]
47
+ text = ' '.join(words)
48
+ return text
49
+
50
+
51
+ def tokenize_function(dataframe):
52
+ """
53
+ This funtion tokenizes the 'text' field of the dataframe.
54
+
55
+ Args:
56
+ dataframe (pandas.DataFrame): The dataframe to be tokenized
57
+
58
+ Returns:
59
+ dataframe (pandas.DataFrame): The tokenized dataframe
60
+
61
+ Example and output:
62
+ train_dataset_token = train_dataset.map(tokenize_function, batched=True)
63
+ """
64
+ return tokenizer(dataframe["text"], truncation=True)
65
+
66
+
67
+ def compute_metrics(eval_pred):
68
+ """
69
+ This funtion computes the accuracy, precision, recall and f1 score of the model.
70
+
71
+ It'is passed to the trainer and it outputs when evaluating the model.
72
+
73
+ Args:
74
+ eval_pred (tuple): The predictions and labels of the model
75
+
76
+ Returns:
77
+ dict: The accuracy, precision, recall and f1 score of the model
78
+
79
+ Example:
80
+ >>> trainer.evaluate()
81
+ {
82
+ 'accuracy': accuracy,
83
+ 'precision': precision,
84
+ 'recall': recall,
85
+ 'f1': f1
86
+ }
87
+ """
88
+ predictions, labels = eval_pred
89
+ predictions = predictions.argmax(axis=-1)
90
+ accuracy = accuracy_score(labels, predictions)
91
+ precision, recall, f1, _ = precision_recall_fscore_support(
92
+ labels, predictions, average='binary')
93
+ return {
94
+ 'accuracy': accuracy,
95
+ 'precision': precision,
96
+ 'recall': recall,
97
+ 'f1': f1
98
+ }
99
+
100
+ # -------------------------
101
+
102
+ # --- LOADING THE MODEL ---
103
+
104
+
105
+ # Load the initial tokenizer and model to set the number of labels its going to classify as 2
106
+ checkpoint = "diegovelilla/EssAI"
107
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
108
+ model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
109
+
110
+ # -------------------------
111
+
112
+ # --- DATA PREPROCESSING ---
113
+
114
+ n_input = len(input_list)
115
+
116
+ # Now we convert the input to a dataset
117
+ df = pd.DataFrame({'text': input_list})
118
+
119
+
120
+ # Get rid of nonalphatetical characters, stopwords and we lower case it.
121
+ df['text'] = df['text'].apply(clean_text)
122
+
123
+ # We convert the pandas dataframe into hugging face datasets and tokenize both of them
124
+ ds = Dataset.from_pandas(df)
125
+ ds_token = ds.map(tokenize_function, batched=True)
126
+
127
+ # Drop columns that are not necessary and set the dataset format to pytorch tensors
128
+ ds_token = ds_token.remove_columns(["text", "token_type_ids"])
129
+ ds_token.set_format(type='torch', columns=['input_ids', 'attention_mask'])
130
+
131
+ # -------------------------
132
+
133
+ # --- INSTANTIATING TRAINER ----
134
+
135
+ # We instantiate a DataCollatorWithPadding in order to pad the inputs in batches while training
136
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
137
+
138
+ # Create the training arguments
139
+ training_args = TrainingArguments(".")
140
+
141
+ # Create the trainer
142
+ trainer = Trainer(
143
+ model,
144
+ training_args,
145
+ eval_dataset=ds_token,
146
+ data_collator=data_collator,
147
+ tokenizer=tokenizer,
148
+ compute_metrics=compute_metrics
149
+ )
150
+
151
+ # -------------------------
152
+
153
+ # --- PREDICT ---
154
+
155
+ # We predict and then format the output
156
+ predictions = trainer.predict(ds_token)
157
+ predictions = torch.from_numpy(predictions.predictions)
158
+ predictions = torch.nn.functional.softmax(predictions, dim=-1)
159
+
160
+ print('\n\n')
161
+ for i in range(n_input):
162
+ index = torch.argmax(predictions[i])
163
+ print(f'{i+1}: HUMAN {round(predictions[i][0].item() * 100, 2)}% of confidence.') if index == 0 else print(
164
+ f'{i+1}: AI {round(predictions[i][1].item() * 100, 2)}% of confidence.')
165
+
166
+ # -------------------------
requirements.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ pandas
2
+ numpy
3
+ matplotlib
4
+ nltk
5
+ scikit-learn
6
+ transformers[pytorch]
7
+ accelerate>=0.21.0
8
+ datasets
9
+ torch
10
+ seaborn