File size: 4,375 Bytes
08d031a
 
 
 
 
 
 
 
e45cd6c
 
 
 
 
 
 
 
 
 
 
08d031a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e45cd6c
08d031a
 
 
 
 
 
 
 
 
e45cd6c
 
08d031a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
library_name: transformers
base_model: deberta-v3-xsmall-quality-pretrain
tags:
- generated_from_trainer
model-index:
- name: deberta-v3-xsmall-quality
  results: []
license: mit
datasets:
- agentlans/text-quality
- allenai/c4
- HuggingFaceFW/fineweb-edu
- monology/pile-uncopyrighted
- agentlans/common-crawl-sample
- agentlans/wikipedia-paragraphs
language:
- en
pipeline_tag: text-classification
---

# English Text Quality Classifier

The **deberta-v3-xsmall-quality** model is designed to evaluate text quality by using a composite score that combines the results from multiple classifiers. This method provides a more thorough assessment than traditional educational metrics, making it ideal for a variety of NLP and AI applications.

## Intended Uses & Limitations

**Intended Uses**:
- Quality assessment of text across various domains.
- Enhancing NLP applications by providing a robust measure of text quality.
- Supporting research and development in AI by offering insights into text quality metrics.

**Limitations**:
- The model's performance may vary depending on the specific characteristics of the input text.
- It's also a black box. Hard to explain why something is classified as higher quality than another.
- It is essential to consider the context in which the model is applied, as different domains may have unique quality requirements.
- May still be biased towards non-fiction and educational genres.

## Training and Evaluation Data

The model was trained on the [agentlans/text-quality](https://huggingface.co/datasets/agentlans/text-quality) dataset comprising **100,000 sentences** sourced from five distinct datasets, with **20,000 sentences** drawn from each of the following:

1. **allenai/c4**
2. **HuggingFaceFW/fineweb-edu**
3. **monology/pile-uncopyrighted**
4. **agentlans/common-crawl-sample**
5. **agentlans/wikipedia-paragraphs**

This diverse dataset enables the model to generalize well across different text types and domains.

90% of the rows were used for training and the remaining 10% for evaluation.

## How to use

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name="agentlans/deberta-v3-xsmall-quality"

# Put model on GPU or else CPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def quality(text):
    """Processes the text using the model and returns its logits.
    In this case, it's interpreted as the the combined quality score for that text."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits.squeeze().cpu()
    return logits.tolist()

# Example usage
text = [
    "Congratulations! You've won a $1,000 gift card! Click here to claim your prize now!!!",
    "Page 1 2 3 4 5 Next Last>>",
    "Urgent: Your account has been compromised! Click this link to verify your identity and secure your account immediately!!!",
    "Today marks a significant milestone in our journey towards sustainability! 🌍✨ We’re excited to announce our partnership with local organizations to plant 10,000 trees in our community this fall. Join us in making a positive impact on our environment!",
    "In recent years, the impact of climate change has become increasingly evident, affecting ecosystems and human livelihoods across the globe."]

result = quality(text)
[round(x, 2) for x in result] # Estimated quality for each text [0.19, -3.06, 0.15, 1.77, 1.34]
```

## Training Procedure

<details>
<summary>Training hyperparameters, results, framework</summary>

### Training Hyperparameters

The following hyperparameters were utilized during training:
- **Learning Rate**: 5e-05
- **Training Batch Size**: 8
- **Evaluation Batch Size**: 8
- **Seed**: 42
- **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- **Learning Rate Scheduler Type**: Linear
- **Number of Epochs**: 3.0

### Training Results

- **Loss**: 0.1280
- **Mean Squared Error (MSE)**: 0.1280

### Framework Versions

The model was developed using the following frameworks and libraries:
- **Transformers**: 4.44.2
- **PyTorch**: 2.2.2+cu121
- **Datasets**: 2.18.0
- **Tokenizers**: 0.19.1
</details>