metadata
license: apache-2.0
datasets:
- shibing624/CSC
language:
- zh
metrics:
- accuracy
pipeline_tag: text2text-generation
tags:
- CSC
- CGED
- spelling error
CSC T5 - T5 for Traditional and Simplified Chinese Spelling Correction
This model was obtained by instruction-tuning
the corresponding ClueAI/PromptCLUE-base-v1-5
model on the spelling error corpus.
Model Details
Model Description
- Language(s) (NLP):
Chinese
- Pretrained from model:
ClueAI/PromptCLUE-base-v1-5
- Pretrained by dataset:
1M UDN news corpus
- Finetuned by dataset:
shibing624/CSC
spelling error corpus (CN + TC)
Model Sources
Evaluation
- Chinese spelling error correction task(SIGHAN2015):
- FPR: False Positive Rate
Model | Base Model | accuracy | recall | precision | F1 | FPR |
---|---|---|---|---|---|---|
GECToR | hfl/chinese-macbert-base | 71.7 | 71.6 | 71.8 | 71.7 | 28.2 |
GECToR_large | hfl/chinese-macbert-large | 73.7 | 76.5 | 72.5 | 74.4 | 29.1 |
T5 w/ pretrain | ClueAI/PromptCLUE-base-v1-5 | 79.2 | 69.2 | 85.8 | 76.6 | 11.1 |
T5 w/o pretrain | ClueAI/PromptCLUE-base-v1-5 | 75.1 | 63.1 | 82.2 | 71.4 | 13.3 |
PTCSpell | N/A | 79.0 | 89.4 | 83.8 | N/A | |
MDCSpell | N/A | 77.2 | 81.5 | 79.3 | N/A |
Usage
from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5")
model = T5ForConditionalGeneration.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5")
input_text = '糾正句子裡的錯字: 為了降低少子化,政府可以堆動獎勵生育的政策。'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_length=256)
edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
Related Project
CodeTed/CGEDit - Chinese Grammatical Error Diagnosis by Task-Specific Instruction Tuning