snowflake-arctic-xs-grammar-classifier

This model is a fine-tuned version of agentlans/snowflake-arctic-embed-xs-zyda-2 for grammar classification. It achieves an accuracy of 0.8724 on the evaluation set.

Model description

The snowflake-arctic-xs-grammar-classifier is designed to classify the grammatical correctness of English sentences. It is based on the snowflake-arctic-embed-xs-zyda-2 model and has been fine-tuned on a grammar classification dataset derived from the C4 (Colossal Clean Crawled Corpus).

Intended uses & limitations

This model is intended for classifying the grammatical correctness of English sentences. It can be used in various applications such as writing assistance tools, educational software, or content moderation systems.

Usage example

from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1
classifier = pipeline(
    "text-classification",
    model="agentlans/snowflake-arctic-xs-grammar-classifier",
    device=device,
)

text = "I absolutely loved this movie!"
result = classifier(text)
print(result)  # [{'label': 'grammatical', 'score': 0.8963921666145325}]

Example Classifications

Status Text Explanation
βœ”οΈ I absolutely loved this movie! Grammatically correct, clear sentence structure
❌ How do I shot web? Grammatically incorrect, improper verb usage
βœ”οΈ Beware the Jabberwock, my son! Poetic language, grammatically sound
βœ”οΈ Colourless green ideas sleep furiously. Grammatically correct, though semantically nonsensical
❌ Has anyone really been far even as decided to use even go want to do look more like? Completely incoherent and grammatically incorrect

Limitations

The model's performance is limited by the quality and diversity of its training data. It may not perform well on specialized or domain-specific text, or on languages other than English. Additionally, it may struggle with complex grammatical structures or nuanced language use.

Training and evaluation data

The model was trained on the agentlans/grammar-classification dataset, which contains 600 000 examples for binary classification of grammatical correctness in English. This dataset is derived from a subset of the C4_200M Synthetic Dataset for Grammatical Error Correction.

Training procedure

Training hyperparameters

  • Learning rate: 5e-05
  • Batch size: 128
  • Number of epochs: 10
  • Optimizer: AdamW with betas=(0.9,0.999) and epsilon=1e-08
  • Learning rate scheduler: Linear
πŸ“Š Detailed Training Results
Training Loss Epoch Step Validation Loss Accuracy Input Tokens Seen
0.5192 1.0 3750 0.4722 0.7738 61 440 000
0.4875 2.0 7500 0.4521 0.7881 122 880 000
0.4590 3.0 11250 0.3895 0.8227 184 320 000
0.4351 4.0 15000 0.3981 0.8197 245 760 000
0.4157 5.0 18750 0.3690 0.8337 307 200 000
0.3955 6.0 22500 0.3260 0.8585 368 640 000
0.3788 7.0 26250 0.3267 0.8566 430 080 000
0.3616 8.0 30000 0.3192 0.8621 491 520 000
0.3459 9.0 33750 0.3017 0.8707 552 960 000
0.3382 10.0 37500 0.2971 0.8724 614 400 000

Framework versions

  • Transformers: 4.46.3
  • PyTorch: 2.5.1+cu124
  • Datasets: 3.2.0
  • Tokenizers: 20.3
Downloads last month
12
Safetensors
Model size
22.7M params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for agentlans/snowflake-arctic-xs-grammar-classifier

Datasets used to train agentlans/snowflake-arctic-xs-grammar-classifier

Evaluation results