|
--- |
|
library_name: transformers |
|
base_model: deberta-v3-xsmall-quality-pretrain |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: deberta-v3-xsmall-quality |
|
results: [] |
|
license: mit |
|
datasets: |
|
- agentlans/text-quality |
|
- allenai/c4 |
|
- HuggingFaceFW/fineweb-edu |
|
- monology/pile-uncopyrighted |
|
- agentlans/common-crawl-sample |
|
- agentlans/wikipedia-paragraphs |
|
language: |
|
- en |
|
pipeline_tag: text-classification |
|
--- |
|
|
|
# English Text Quality Classifier |
|
|
|
The **deberta-v3-xsmall-quality** model is designed to evaluate text quality by using a composite score that combines the results from multiple classifiers. This method provides a more thorough assessment than traditional educational metrics, making it ideal for a variety of NLP and AI applications. |
|
|
|
## Intended Uses & Limitations |
|
|
|
**Intended Uses**: |
|
- Quality assessment of text across various domains. |
|
- Enhancing NLP applications by providing a robust measure of text quality. |
|
- Supporting research and development in AI by offering insights into text quality metrics. |
|
|
|
**Limitations**: |
|
- The model's performance may vary depending on the specific characteristics of the input text. |
|
- It's also a black box. Hard to explain why something is classified as higher quality than another. |
|
- It is essential to consider the context in which the model is applied, as different domains may have unique quality requirements. |
|
- May still be biased towards non-fiction and educational genres. |
|
|
|
## Training and Evaluation Data |
|
|
|
The model was trained on the [agentlans/text-quality](https://huggingface.co/datasets/agentlans/text-quality) dataset comprising **100,000 sentences** sourced from five distinct datasets, with **20,000 sentences** drawn from each of the following: |
|
|
|
1. **allenai/c4** |
|
2. **HuggingFaceFW/fineweb-edu** |
|
3. **monology/pile-uncopyrighted** |
|
4. **agentlans/common-crawl-sample** |
|
5. **agentlans/wikipedia-paragraphs** |
|
|
|
This diverse dataset enables the model to generalize well across different text types and domains. |
|
|
|
90% of the rows were used for training and the remaining 10% for evaluation. |
|
|
|
## How to use |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
model_name="agentlans/deberta-v3-xsmall-quality" |
|
|
|
# Put model on GPU or else CPU |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model = model.to(device) |
|
|
|
def quality(text): |
|
"""Processes the text using the model and returns its logits. |
|
In this case, it's interpreted as the the combined quality score for that text.""" |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device) |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits.squeeze().cpu() |
|
return logits.tolist() |
|
|
|
# Example usage |
|
text = [ |
|
"Congratulations! You've won a $1,000 gift card! Click here to claim your prize now!!!", |
|
"Page 1 2 3 4 5 Next Last>>", |
|
"Urgent: Your account has been compromised! Click this link to verify your identity and secure your account immediately!!!", |
|
"Today marks a significant milestone in our journey towards sustainability! 🌍✨ We’re excited to announce our partnership with local organizations to plant 10,000 trees in our community this fall. Join us in making a positive impact on our environment!", |
|
"In recent years, the impact of climate change has become increasingly evident, affecting ecosystems and human livelihoods across the globe."] |
|
|
|
result = quality(text) |
|
[round(x, 2) for x in result] # Estimated quality for each text [0.19, -3.06, 0.15, 1.77, 1.34] |
|
``` |
|
|
|
## Training Procedure |
|
|
|
<details> |
|
<summary>Training hyperparameters, results, framework</summary> |
|
|
|
### Training Hyperparameters |
|
|
|
The following hyperparameters were utilized during training: |
|
- **Learning Rate**: 5e-05 |
|
- **Training Batch Size**: 8 |
|
- **Evaluation Batch Size**: 8 |
|
- **Seed**: 42 |
|
- **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08 |
|
- **Learning Rate Scheduler Type**: Linear |
|
- **Number of Epochs**: 3.0 |
|
|
|
### Training Results |
|
|
|
- **Loss**: 0.1280 |
|
- **Mean Squared Error (MSE)**: 0.1280 |
|
|
|
### Framework Versions |
|
|
|
The model was developed using the following frameworks and libraries: |
|
- **Transformers**: 4.44.2 |
|
- **PyTorch**: 2.2.2+cu121 |
|
- **Datasets**: 2.18.0 |
|
- **Tokenizers**: 0.19.1 |
|
</details> |