English Text Quality Classifier

The deberta-v3-xsmall-quality model is designed to evaluate text quality by using a composite score that combines the results from multiple classifiers. This method provides a more thorough assessment than traditional educational metrics, making it ideal for a variety of NLP and AI applications.

Intended Uses & Limitations

Intended Uses:

Quality assessment of text across various domains.
Enhancing NLP applications by providing a robust measure of text quality.
Supporting research and development in AI by offering insights into text quality metrics.

Limitations:

The model's performance may vary depending on the specific characteristics of the input text.
It's also a black box. Hard to explain why something is classified as higher quality than another.
It is essential to consider the context in which the model is applied, as different domains may have unique quality requirements.
May still be biased towards non-fiction and educational genres.

Training and Evaluation Data

The model was trained on the agentlans/text-quality dataset comprising 100,000 sentences sourced from five distinct datasets, with 20,000 sentences drawn from each of the following:

allenai/c4
HuggingFaceFW/fineweb-edu
monology/pile-uncopyrighted
agentlans/common-crawl-sample
agentlans/wikipedia-paragraphs

This diverse dataset enables the model to generalize well across different text types and domains.

90% of the rows were used for training and the remaining 10% for evaluation.

How to use

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name="agentlans/deberta-v3-xsmall-quality"

# Put model on GPU or else CPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def quality(text):
    """Processes the text using the model and returns its logits.
    In this case, it's interpreted as the the combined quality score for that text."""
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    with torch.no_grad():
        logits = model(**inputs).logits.squeeze().cpu()
    return logits.tolist()

# Example usage
text = [
    "Congratulations! You've won a $1,000 gift card! Click here to claim your prize now!!!",
    "Page 1 2 3 4 5 Next Last>>",
    "Urgent: Your account has been compromised! Click this link to verify your identity and secure your account immediately!!!",
    "Today marks a significant milestone in our journey towards sustainability! 🌍✨ We’re excited to announce our partnership with local organizations to plant 10,000 trees in our community this fall. Join us in making a positive impact on our environment!",
    "In recent years, the impact of climate change has become increasingly evident, affecting ecosystems and human livelihoods across the globe."]

result = quality(text)
[round(x, 2) for x in result] # Estimated quality for each text [-0.89, -0.76, -0.7, 0.3, 1.64]

Training Procedure

Training hyperparameters, results, framework

Training Hyperparameters

The following hyperparameters were utilized during training:

Learning Rate: 5e-05
Training Batch Size: 8
Evaluation Batch Size: 8
Seed: 42
Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
Learning Rate Scheduler Type: Linear
Number of Epochs: 3.0

Training Results

Loss: 0.0924
Mse: 0.0924
Num Input Tokens Seen: 34560000

Framework Versions

The model was developed using the following frameworks and libraries:

Transformers 4.45.1
Pytorch 2.4.1+cu121
Datasets 3.0.1
Tokenizers 0.20.0

agentlans
/

deberta-v3-xsmall-quality