Spaces:
Runtime error
title: Arxiv Plagiarism Checker LLM
emoji: π
colorFrom: pink
colorTo: pink
sdk: docker
app_port: 7860
pinned: true
Arxiv Plagiarism Checker LLM
Demo - Link
Dataset - Link
Arxiv author's plagiarism check just by entering the arxiv author
Docs & Working
INPUT - Authors Name OUTPUT - Plagiarism Check Results
You can get MIT authors List from here - Link
Dataset & Embeddings
We have used the arxiv dataset for the year 2023 & 2024 and then we have used the OpenAI Embeddings to generate the embeddings for the documents.
- Install gsutil - Link
# Single year files
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/19*/ ./papers_from_2019/
#single file
gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2310/2310.00001v1.pdf .
Tech Stack
- Gradio
- ChromaDB
- SERP API
- OpenAI GPT Embeddings & LLM Models
We have collected the data from arxiv GCP cloud for the year of 2023 & 2024 and then we have used the text-embedding-3-large to generate the embeddings for the documents. This amount to about 10GB.
Document Text Extraction is done in 2 formats with metdata
- Document Level
- Paragraph Level
- MetaData
Meta data example
{
"id": "2106.09680",
"title": "Accuracy, Interpretability, and Differential Privacy via Explainable Boosting",
"summary": "We show that adding differential privacy to Explainable Boosting Machines\n(EBMs), a recent method for training interpretable ML models, yields\nstate-of-the-art accuracy while protecting privacy. Our experiments on multiple\nclassification and regression datasets show that DP-EBM models suffer\nsurprisingly little accuracy loss even with strong differential privacy\nguarantees. In addition to high accuracy, two other benefits of applying DP to\nEBMs are: a) trained models provide exact global and local interpretability,\nwhich is often important in settings where differential privacy is needed; and\nb) the models can be edited after training without loss of privacy to correct\nerrors which DP noise may have introduced.",
"source": "http://arxiv.org/pdf/2106.09680",
"authors": "Harsha Nori Rich Caruana Zhiqi Bu Judy Hanwen Shen Janardhan Kulkarni",
"references": ""
}
Embeddings are generated for the documents and paragraphs using OpenAI Models
Authors are then searched on the Google SERP API and the documents (Top 10) are then compared individually with the embeddings of the documents.
Retreived documents & Top 3 simialar papers from Google SERP API on the topic
- Metadata and text is extracted
Once Extracted Unique Lines and Paragraphs are extracted and then compared by using LLM - GPT 4 Preview Model - 128K
Unique Lines are then compared with the document embeddings and the paragraphs are compared with the paragraph embeddings.
Top 3 Similar Text and respective documents are then returned to the user as Plagiarised Content.
Research Points
Top Plagiarism Checkers API
- ProWritingAid API V2 - Free Plan
- Unicheck - Request Demo
- Copyleaks - Request Demo
- EDEN AI - Free Plan
Requirements
- Python 3.9+
- Gradio
- GPT Keys
Installation
pip install -r requirements.txt
Usage
We are using a gradio app to implement the plagiarism checker
python app.py or gradio app.py