fluency-score-classification-ja
This model is a fine-tuned version of line-corporation/line-distilbert-base-japanese on the "日本語文法誤りデータセット". It achieves the following results on the evaluation set:
- Loss: 0.1912
- ROC AUC: 0.9811
Model description
This model wraps line-corporation/line-distilbert-base-japanese with DistilBertForSequenceClassification to make a binary classifier.
Intended uses & limitations
This model can be used to classify whether the given Japanese texts are fluent (i.e., not having grammactical errors). Example usage:
# Load the tokenizer & the model
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True)
model = AutoModelForSequenceClassification.from_pretrained("liwii/fluency-score-classification-ja")
# Make predictions
input_tokens = tokenizer([
'黒い猫が',
'黒い猫がいます',
'あっちの方で黒い猫があくびをしています',
'あっちの方でで黒い猫ががあくびをしています',
'ある日の暮方の事である。一人の下人が、羅生門の下で雨やみを待っていた。'
],
return_tensors='pt',
padding=True)
output = model(**input_tokens)
with torch.no_grad():
# Probabilities of [not_fluent, fluent]
probs = torch.nn.functional.softmax(
output.logits, dim=1)
probs[:, 1] # => tensor([0.1007, 0.2416, 0.5635, 0.0453, 0.7701])
The scores could be low for short sentences even if they do not contain any grammatical erros because the training dataset consist of long sentences.
Training and evaluation data
From "日本語文法誤りデータセット", used 512 rows as the evaluation dataset and the rest of the dataset as the training dataset. For each dataset split, Used the "original" rows as the data with "fluent" label, and "perturbed" as the data with "not fluent" data.
Training procedure
Fine-tuned the model for 5 epochs. Freezed the params in the original DistilBERT during the fine-duning.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 64
- eval_batch_size: 8
- seed: 42
- distributed_type: tpu
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5
Training results
Training Loss | Epoch | Step | Validation Loss | Roc Auc |
---|---|---|---|---|
0.4582 | 1.0 | 647 | 0.2887 | 0.9679 |
0.2664 | 2.0 | 1294 | 0.2224 | 0.9761 |
0.2177 | 3.0 | 1941 | 0.2047 | 0.9793 |
0.1899 | 4.0 | 2588 | 0.1944 | 0.9807 |
0.1865 | 5.0 | 3235 | 0.1912 | 0.9811 |
Framework versions
- Transformers 4.34.0
- Pytorch 2.0.0+cu118
- Datasets 2.14.5
- Tokenizers 0.14.0
- Downloads last month
- 3,786