liwii
/

fluency-score-classification-ja

Transformers

PyTorch

distilbert

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

liwii commited on Oct 9, 2023

Commit

c373333

1 Parent(s): dcf2f4f

Update README.md

Browse files

Files changed (1) hide show

README.md +38 -11

README.md CHANGED Viewed

@@ -8,29 +8,56 @@ model-index:
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # fluency-score-classification-ja
-This model is a fine-tuned version of [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) on the None dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.1912
-- Roc Auc: 0.9811
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
@@ -60,4 +87,4 @@ The following hyperparameters were used during training:
 - Transformers 4.34.0
 - Pytorch 2.0.0+cu118
 - Datasets 2.14.5
-- Tokenizers 0.14.0

   results: []
 ---
 # fluency-score-classification-ja
+This model is a fine-tuned version of [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) on the ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main).
 It achieves the following results on the evaluation set:
 - Loss: 0.1912
+- ROC AUC: 0.9811
 ## Model description
+This model wraps [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) with [DistilBertForSequenceClassification](https://huggingface.co/docs/transformers/v4.34.0/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification) to make a binary classifier.
 ## Intended uses & limitations
+This model can be used to classify whether the given Japanese texts are fluent (i.e., not having grammactical errors).
+Example usage:
+```python
+# Load the tokenizer & the model
+from transformers import AutoTokenizer,  AutoModelForSequenceClassification
+import torch
+tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True)
+model =  AutoModelForSequenceClassification.from_pretrained("liwii/fluency-score-classification-ja")
+# Make predictions
+input_tokens = tokenizer([
+        '黒い猫が',
+        '黒い猫がいます',
+        'あっちの方で黒い猫があくびをしています',
+        'あっちの方でで黒い猫ががあくびをしています',
+        'ある日の暮方の事である。一人の下人が、羅生門の下で雨やみを待っていた。'
+    ],
+    return_tensors='pt',
+    padding=True)
+output = model(**input_tokens)
+with torch.no_grad():
+    # Probabilities of [not_fluent, fluent]
+    probs = torch.nn.functional.softmax(
+        output.logits, dim=1)
+probs[:, 1] # => tensor([0.1007, 0.2416, 0.5635, 0.0453, 0.7701])
+```
+The scores could be low for short sentences even if they do not contain any grammatical erros because the training dataset consist of long sentences.
 ## Training and evaluation data
+From ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main), used 512 rows as the evaluation dataset and the rest of the dataset as the training dataset.
+For each dataset split, Used the "original" rows as the data with "fluent" label, and "perturbed" as the data with "not fluent" data.
 ## Training procedure
+Fine-tuned the model for 5 epochs. Freezed the params in the original DistilBERT during the fine-duning.
 ### Training hyperparameters
 - Transformers 4.34.0
 - Pytorch 2.0.0+cu118
 - Datasets 2.14.5
+- Tokenizers 0.14.0