fluency-score-classification-ja

This model is a fine-tuned version of line-corporation/line-distilbert-base-japanese on the "日本語文法誤りデータセット". It achieves the following results on the evaluation set:

Loss: 0.1912
ROC AUC: 0.9811

Model description

This model wraps line-corporation/line-distilbert-base-japanese with DistilBertForSequenceClassification to make a binary classifier.

Intended uses & limitations

This model can be used to classify whether the given Japanese texts are fluent (i.e., not having grammactical errors). Example usage:

# Load the tokenizer & the model
from transformers import AutoTokenizer,  AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True)
model =  AutoModelForSequenceClassification.from_pretrained("liwii/fluency-score-classification-ja")

# Make predictions
input_tokens = tokenizer([
        '黒い猫が',
        '黒い猫がいます',
        'あっちの方で黒い猫があくびをしています',
        'あっちの方でで黒い猫ががあくびをしています',
        'ある日の暮方の事である。一人の下人が、羅生門の下で雨やみを待っていた。'
    ],
    return_tensors='pt',
    padding=True)

output = model(**input_tokens)
with torch.no_grad():
    # Probabilities of [not_fluent, fluent]
    probs = torch.nn.functional.softmax(
        output.logits, dim=1)
probs[:, 1] # => tensor([0.1007, 0.2416, 0.5635, 0.0453, 0.7701])

The scores could be low for short sentences even if they do not contain any grammatical erros because the training dataset consist of long sentences.

Training and evaluation data

From "日本語文法誤りデータセット", used 512 rows as the evaluation dataset and the rest of the dataset as the training dataset. For each dataset split, Used the "original" rows as the data with "fluent" label, and "perturbed" as the data with "not fluent" data.

Training procedure

Fine-tuned the model for 5 epochs. Freezed the params in the original DistilBERT during the fine-duning.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 64
eval_batch_size: 8
seed: 42
distributed_type: tpu
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 5

Training results

Training Loss	Epoch	Step	Validation Loss	Roc Auc
0.4582	1.0	647	0.2887	0.9679
0.2664	2.0	1294	0.2224	0.9761
0.2177	3.0	1941	0.2047	0.9793
0.1899	4.0	2588	0.1944	0.9807
0.1865	5.0	3235	0.1912	0.9811

Framework versions

Transformers 4.34.0
Pytorch 2.0.0+cu118
Datasets 2.14.5
Tokenizers 0.14.0

liwii
/

fluency-score-classification-ja