|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- shibing624/CSC |
|
language: |
|
- zh |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- CSC |
|
- CGED |
|
- spelling error |
|
--- |
|
|
|
# CSC T5 - T5 for Traditional and Simplified Chinese Spelling Correction |
|
|
|
This model was obtained by `instruction-tuning` the corresponding `ClueAI/PromptCLUE-base-v1-5` model on the spelling error corpus. |
|
|
|
## Model Details |
|
### Model Description |
|
- Language(s) (NLP): `Chinese` |
|
- Pretrained from model: `ClueAI/PromptCLUE-base-v1-5` |
|
- Pretrained by dataset: `1M UDN news corpus` |
|
- Finetuned by dataset: `shibing624/CSC` spelling error corpus (CN + TC) |
|
|
|
### Model Sources |
|
- Repository: [https://github.com/TedYeh/Chinese_spelling_Correction](https://github.com/TedYeh/Chinese_spelling_Correction) |
|
|
|
### Evaluation |
|
|
|
- Chinese spelling error correction task(SIGHAN2015): |
|
- FPR: False Positive Rate |
|
|
|
| Model | Base Model | accuracy | recall | precision | F1 | FPR | |
|
|:--------------:|:---------------------------:|:---------:|:---------:|:---------:|:-----:|:-----:| |
|
| GECToR | hfl/chinese-macbert-base | 71.7 | 71.6 | 71.8 | 71.7 | 28.2 | |
|
| GECToR_large | hfl/chinese-macbert-large | 73.7 | 76.5 | 72.5 | 74.4 | 29.1 | |
|
| T5 w/ pretrain | ClueAI/PromptCLUE-base-v1-5 | 79.2 | 69.2 | 85.8 | 76.6 | 11.1 | |
|
| T5 w/o pretrain| ClueAI/PromptCLUE-base-v1-5 | 75.1 | 63.1 | 82.2 | 71.4 | 13.3 | |
|
| PTCSpell | | N/A | 79.0 | 89.4 | 83.8 | N/A | |
|
| MDCSpell | | N/A | 77.2 | 81.5 | 79.3 | N/A | |
|
|
|
## Usage |
|
```python |
|
from transformers import AutoTokenizer, T5ForConditionalGeneration |
|
tokenizer = AutoTokenizer.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5") |
|
model = T5ForConditionalGeneration.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5") |
|
input_text = '糾正句子裡的錯字: 為了降低少子化,政府可以堆動獎勵生育的政策。' |
|
input_ids = tokenizer(input_text, return_tensors="pt").input_ids |
|
outputs = model.generate(input_ids, max_length=256) |
|
edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|
|
### Related Project |
|
[CodeTed/CGEDit](https://huggingface.co/CodeTed/CGEDit) - Chinese Grammatical Error Diagnosis by Task-Specific Instruction Tuning |