File size: 2,679 Bytes
33910a4
2b5848c
 
 
 
 
 
 
 
33910a4
2b5848c
 
 
bdb1fba
2b5848c
 
bdb1fba
2b5848c
bdb1fba
 
2b5848c
bdb1fba
 
 
2b5848c
bdb1fba
 
2b5848c
33910a4
 
2b5848c
33910a4
2b5848c
33910a4
f781758
 
 
33910a4
 
2b5848c
 
 
 
 
 
 
 
 
33910a4
 
 
 
2b5848c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
tags:
- text-classification
metrics:
- accuracy
- f1
- roc_auc
base_model:
- intfloat/e5-small
library_name: transformers
datasets:
- liamdugan/raid
model-index:
- name: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
  results:
  - task:
      type: text-classification
    dataset:
      name: RAID-test
      type: RAID-test
    metrics:
    - name: accuracy
      type: accuracy
      value: 0.939
    source:
      name: RAID Benchmark Leaderboard
      url: https://raid-bench.xyz/leaderboard
pipeline_tag: text-classification
---

# My LoRA Fine-Tuned AI-generated Detector

This is a e5-small model fine-tuned with LoRA for sequence classification tasks. It is optimized to classify text into AI-generated or human-written with high accuracy.

- **Label 0**: Represents **human-written** content.
- **Label 1**: Represents **AI-generated** content.

## Model Details

- **Base Model**: `intfloat/e5-small`
- **Fine-Tuning Technique**: LoRA (Low-Rank Adaptation)
- **Task**: Sequence Classification
- **Use Cases**: Text classification for AI-generated detection.
- **Hyperparameters**: 
   - Learning rate: `5e-5`
   - Epochs: `3`
   - LoRA rank: `8`
   - LoRA alpha: `16`


## Training Details

- **Dataset**:
    - 10,000 twitters and 10,000 rewritten twitters with GPT-4o-mini.
    - 80,000 human-written text from [RAID](https://github.com/liamdugan/raid).
    - 128,000 AI-generated text from [RAID](https://github.com/liamdugan/raid).
- **Hardware**: Fine-tuned on a single NVIDIA A100 GPU.
- **Training Time**: Approximately 2 hours.
- **Evaluation Metrics**:
| Metric | (Raw) E5-small | Fine-tuned |
|--------|---------------:|-----------:|
|Accuracy| 65.2%          | 89.0%      |
|F1 Score| 0.653          | 0.887      |
| AUC    | 0.697          | 0.976      |

## Collaborators

- **Menglin Zhou**
- **Jiaping Liu**
- **Xiaotian Zhan**


## Citation
If you use this model, please cite the RAID dataset as follows:
```
@inproceedings{dugan-etal-2024-raid,
    title = "{RAID}: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors",
    author = "Dugan, Liam  and
      Hwang, Alyssa  and
      Trhl{\'\i}k, Filip  and
      Zhu, Andrew  and
      Ludan, Josh Magnus  and
      Xu, Hainiu  and
      Ippolito, Daphne  and
      Callison-Burch, Chris",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.674",
    pages = "12463--12492",
}
```