--- license: apache-2.0 base_model: line-corporation/line-distilbert-base-japanese tags: - generated_from_trainer model-index: - name: fluency-score-classification-ja results: [] --- # fluency-score-classification-ja This model is a fine-tuned version of [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) on the ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main). It achieves the following results on the evaluation set: - Loss: 0.1912 - ROC AUC: 0.9811 ## Model description This model wraps [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) with [DistilBertForSequenceClassification](https://huggingface.co/docs/transformers/v4.34.0/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification) to make a binary classifier. ## Intended uses & limitations This model can be used to classify whether the given Japanese texts are fluent (i.e., not having grammactical errors). Example usage: ```python # Load the tokenizer & the model from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True) model = AutoModelForSequenceClassification.from_pretrained("liwii/fluency-score-classification-ja") # Make predictions input_tokens = tokenizer([ '黒い猫が', '黒い猫がいます', 'あっちの方で黒い猫があくびをしています', 'あっちの方でで黒い猫ががあくびをしています', 'ある日の暮方の事である。一人の下人が、羅生門の下で雨やみを待っていた。' ], return_tensors='pt', padding=True) output = model(**input_tokens) with torch.no_grad(): # Probabilities of [not_fluent, fluent] probs = torch.nn.functional.softmax( output.logits, dim=1) probs[:, 1] # => tensor([0.1007, 0.2416, 0.5635, 0.0453, 0.7701]) ``` The scores could be low for short sentences even if they do not contain any grammatical erros because the training dataset consist of long sentences. ## Training and evaluation data From ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main), used 512 rows as the evaluation dataset and the rest of the dataset as the training dataset. For each dataset split, Used the "original" rows as the data with "fluent" label, and "perturbed" as the data with "not fluent" data. ## Training procedure Fine-tuned the model for 5 epochs. Freezed the params in the original DistilBERT during the fine-duning. ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 1e-05 - train_batch_size: 64 - eval_batch_size: 8 - seed: 42 - distributed_type: tpu - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - num_epochs: 5 ### Training results | Training Loss | Epoch | Step | Validation Loss | Roc Auc | |:-------------:|:-----:|:----:|:---------------:|:-------:| | 0.4582 | 1.0 | 647 | 0.2887 | 0.9679 | | 0.2664 | 2.0 | 1294 | 0.2224 | 0.9761 | | 0.2177 | 3.0 | 1941 | 0.2047 | 0.9793 | | 0.1899 | 4.0 | 2588 | 0.1944 | 0.9807 | | 0.1865 | 5.0 | 3235 | 0.1912 | 0.9811 | ### Framework versions - Transformers 4.34.0 - Pytorch 2.0.0+cu118 - Datasets 2.14.5 - Tokenizers 0.14.0