File size: 2,303 Bytes
fa540d6
 
7126b25
 
 
 
 
 
 
0745389
 
 
 
fa540d6
db65d3d
da99d48
db65d3d
 
 
 
 
 
 
 
da99d48
db65d3d
 
 
 
394c1d1
83af335
394c1d1
630f23b
394c1d1
630f23b
394c1d1
 
 
 
 
 
 
 
db65d3d
 
 
ef9dd97
 
db65d3d
 
 
 
 
 
 
6f76bf8
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
license: apache-2.0
datasets:
- shibing624/CSC
language:
- zh
metrics:
- accuracy
pipeline_tag: text2text-generation
tags:
- CSC
- CGED
- spelling error
---

# CSC T5 - T5 for Traditional and Simplified Chinese Spelling Correction

This model was obtained by `instruction-tuning` the corresponding `ClueAI/PromptCLUE-base-v1-5` model on the spelling error corpus. 

## Model Details
### Model Description
- Language(s) (NLP): `Chinese`
- Pretrained from model: `ClueAI/PromptCLUE-base-v1-5`
- Pretrained by dataset: `1M UDN news corpus`
- Finetuned by dataset: `shibing624/CSC` spelling error corpus (CN + TC)

### Model Sources
- Repository: [https://github.com/TedYeh/Chinese_spelling_Correction](https://github.com/TedYeh/Chinese_spelling_Correction)

### Evaluation

- Chinese spelling error correction task(SIGHAN2015):
  - FPR: False Positive Rate

| Model           | Base Model                  | accuracy  |  recall   | precision | F1    | FPR   |
|:--------------:|:---------------------------:|:---------:|:---------:|:---------:|:-----:|:-----:|
| GECToR         | hfl/chinese-macbert-base    | 71.7     | 71.6 | 71.8 | 71.7  | 28.2 |
| GECToR_large   | hfl/chinese-macbert-large   | 73.7     | 76.5 | 72.5 | 74.4 | 29.1 |
| T5 w/ pretrain | ClueAI/PromptCLUE-base-v1-5 | 79.2     | 69.2 | 85.8 | 76.6 | 11.1 |
| T5 w/o pretrain| ClueAI/PromptCLUE-base-v1-5 | 75.1     | 63.1 | 82.2 | 71.4 | 13.3 | 
| PTCSpell       |                             | N/A     | 79.0 | 89.4 | 83.8 | N/A |
| MDCSpell       |                             | N/A     | 77.2 | 81.5 | 79.3 |  N/A |

## Usage
```python
from transformers import AutoTokenizer, T5ForConditionalGeneration
tokenizer = AutoTokenizer.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5")
model = T5ForConditionalGeneration.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5")
input_text = '糾正句子裡的錯字: 為了降低少子化,政府可以堆動獎勵生育的政策。'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_length=256)
edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
```

### Related Project
[CodeTed/CGEDit](https://huggingface.co/CodeTed/CGEDit) - Chinese Grammatical Error Diagnosis by Task-Specific Instruction Tuning