--- title: AI Text Detector emoji: 📈 colorFrom: red colorTo: blue sdk: gradio sdk_version: 3.27.0 app_file: app.py pinned: false license: gpl-3.0 --- # Introduction This project presents a tool that predicts whether a given text is generated by a large language model (LLM) such as ChatGPT. To do this, a machine learning model analyzes patterns in the words and sentence structure that are typical for humans or LLM systems. The system outputs a prediction with confidences, and shows the factors that led to its decision. These factors are output in the form of percentiles based on text the model has seen before. This tool is not 100% accurate and can incorrectly flag texts as human or AI written when they are not, so it should not be used as a sole measure for cheating detection. # Usage To run this project, visit https://huggingface.co/spaces/atyshka/ai-detector. The interface allows you to enter any text and click "Submit" to evaluate it. On the right, you will see 3 outputs: ![GUI Interface](GUI.png "GUI Interface") First, you will see the prediction from the model (Human or AI) and the confidence score. Next, you will see the factors that contributed to this decision. The percentages do not represent how much each feature contributed to the decision, rather, they are percentiles indicating whether the feature is high or low compared to other text the model has seen. For example, a perplexity of 95% indicates the perplexity (i.e. usage of rare words) is very high, while 5% indicates very low perplexity. Finally, you will see a visualization of the perplexity, where words are highlighted according to their "rareness". At the bottom of the page, there are 4 examples. Examples 1 & 3 are written by ChatGPT, while 2 & 4 are human-written. Feel free to modify these examples or generate your own samples from scratch to see how the model scores change. # Documentation The heart of this project is a logistic regression model that uses features from the text to predict whether the text is written by an LLM. This model uses 4 input features: perplexity, mean sentence length, the standard deviation of sentence length, and the Shapiro-Wilk p-value for sentence length. Perplexity is measured as the negative log likelihood of the sequence, while the Shapiro-Wilk p-value measures how well the sentence lengths fit a normal distribution. I also experimented with using a multilayer perceptron, but the F1 score improved by only 1 point (.93 -> .94). Given the better interpretability, I decided to keep the logistic regression approach for the final product. For calculating perplexity, the system uses a GPT2-large model, which is an autoregressive decoder-only transformer with 774M parameters. This model was trained by OpenAI on the WebText dataset, consisting of outbound links from Reddit. OpenAI did not share the training time, epochs, or other specifics of the training procedure, simply noting that learning rate was manually tuned. GPT-2 can contain many biases and factual inaccuracies, however, given the model is not used generatively for this project these problems are fairly irrelevant. For training this model, I use the GPT Wiki Intro dataset, which consists of Wikipedia intro paragraphs written by GPT-3. The Curie model variant is used to generate intro paragraphs given the title of the article and the first 7 words as a prompt. 150k topics are present in the dataset; the creators did not share how these topics were selected. Each LLM-written paragraph is paired with the human written version; having paired examples from the same domain is why I chose this dataset. For computational efficiency, I use only 4000 topics for training and 1000 for testing. The model uses Scikit Learn for the logistic regression model, Huggingface/Pytorch for the GPT-2 model, and Gradio for the user interface. NLTK and Scipy are also used for calculating the features for the model input. # Contributions A perplexity-based LLM detector is not a new idea, with many solutions such as GPT-Zero being published in the wake of ChatGPT. However, most of the popular solutions are not open-source. The initial design was based on Huggingface's perplexity example (https://huggingface.co/docs/transformers/perplexity) but I needed quite a bit of modification to obtain word-level perplexities and then map subword tokens back to full words. Then I added the other statistics on the distribution of sentence lengths and evaluated two types of classifiers. I also planned to add synonym-frequency usage with WordNet as a feature, however, there was not sufficient time to implement this. From the user side, I decided to focus on interpretability of the results, showing how the 4 features contributed to the final result. I also added perplexity visualization to help non-experts understand what the model is paying attention to. I hope that this added interpretability makes the model less of a black box for users. # Limitations The LLM detector does not work well with short spans of text, as there is not sufficient data to make a strong inference. It also works best in the original domain of Wikipedia generation, as other domains such as fiction may contain different distributions of word length and perplexity. Text that an LLM has paraphrased from a vocabulary-rich sample is particularly hard to detect, because the model will reuse these high-perplexity words. In general, longer prompts fool the classifier more, because they significantly alter the word distribution. I also suspect that state-of-the-art approaches using GPT 3.5/4 or PaLM may be harder to detect than the much simpler Curie GPT-3 model. Ideally, I would use these models for training my classifier, however, this would incur significant expense. Finally, the classifier can be tricked with some simple modification of LLM-written text, by adding rare words or long/short sentences. Given these limitations, I would emphasize that this detector serves as only a loose guide and should not be used for cheating detection. # Code The model inference is run by app.py, and dependencies are in requirements.txt. The training code is in main.ipynb.