CodeTed
/

Chinese_Spelling_Correction_T5

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Chinese_Spelling_Correction_T5 / README.md

CodeTed's picture

Update README.md

ef9dd97 11 months ago

|

history blame contribute delete

2.3 kB

	---
	license: apache-2.0
	datasets:
	- shibing624/CSC
	language:
	- zh
	metrics:
	- accuracy
	pipeline_tag: text2text-generation
	tags:
	- CSC
	- CGED
	- spelling error
	---

	# CSC T5 - T5 for Traditional and Simplified Chinese Spelling Correction

	This model was obtained by `instruction-tuning` the corresponding `ClueAI/PromptCLUE-base-v1-5` model on the spelling error corpus.

	## Model Details
	### Model Description
	- Language(s) (NLP): `Chinese`
	- Pretrained from model: `ClueAI/PromptCLUE-base-v1-5`
	- Pretrained by dataset: `1M UDN news corpus`
	- Finetuned by dataset: `shibing624/CSC` spelling error corpus (CN + TC)

	### Model Sources
	- Repository: [https://github.com/TedYeh/Chinese_spelling_Correction](https://github.com/TedYeh/Chinese_spelling_Correction)

	### Evaluation

	- Chinese spelling error correction task(SIGHAN2015)：
	- FPR: False Positive Rate

	\| Model \| Base Model \| accuracy \| recall \| precision \| F1 \| FPR \|
	\|:--------------:\|:---------------------------:\|:---------:\|:---------:\|:---------:\|:-----:\|:-----:\|
	\| GECToR \| hfl/chinese-macbert-base \| 71.7 \| 71.6 \| 71.8 \| 71.7 \| 28.2 \|
	\| GECToR_large \| hfl/chinese-macbert-large \| 73.7 \| 76.5 \| 72.5 \| 74.4 \| 29.1 \|
	\| T5 w/ pretrain \| ClueAI/PromptCLUE-base-v1-5 \| 79.2 \| 69.2 \| 85.8 \| 76.6 \| 11.1 \|
	\| T5 w/o pretrain\| ClueAI/PromptCLUE-base-v1-5 \| 75.1 \| 63.1 \| 82.2 \| 71.4 \| 13.3 \|
	\| PTCSpell \| \| N/A \| 79.0 \| 89.4 \| 83.8 \| N/A \|
	\| MDCSpell \| \| N/A \| 77.2 \| 81.5 \| 79.3 \| N/A \|

	## Usage
	```python
	from transformers import AutoTokenizer, T5ForConditionalGeneration
	tokenizer = AutoTokenizer.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5")
	model = T5ForConditionalGeneration.from_pretrained("CodeTed/Chinese_Spelling_Correction_T5")
	input_text = '糾正句子裡的錯字: 為了降低少子化，政府可以堆動獎勵生育的政策。'
	input_ids = tokenizer(input_text, return_tensors="pt").input_ids
	outputs = model.generate(input_ids, max_length=256)
	edited_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	### Related Project
	[CodeTed/CGEDit](https://huggingface.co/CodeTed/CGEDit) - Chinese Grammatical Error Diagnosis by Task-Specific Instruction Tuning