We have trained distil bert on this dataset [https://huggingface.co/datasets/nothingiisreal/Human_Stories]

It's kinda okay for sampling, but needs improvements and exposure to more synthetic data and types of mistakes LLMs do.

Overall I'm extremely impressed with how well this 68 million parameter model works, and extremely disappointed with how every single AI is getting picked up after only training BERT on GPT3.5 rows of the data.

Class label 0 means human, 1 means AI.

We tested these models all of which worked:

GPT3.5, 4, 4o

Claude Sonnet, Opus

Wizard LM 2

Gemini 1.5 Pro

It's really blatant how every single AI company is using the same watermark whether knowingly or unknowingly (through LLM "incest")

Downloads last month
127
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.