File size: 3,815 Bytes
65caebd
c61f3ec
 
 
 
 
 
 
 
 
 
 
65caebd
c61f3ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
---
language: 
- zh
tags:
- chatglm
- pytorch
- zh
- Text2Text-Generation
license: "apache-2.0"
widget:
- text: "对下面中文拼写纠错:\n少先队员因该为老人让坐。\n答:"

---

# Chinese Spelling Correction LoRA Model
ChatGLM3-6B中文纠错LoRA模型

`shibing624/chatglm3-6b-csc-chinese-lora` evaluate test data:

The overall performance of shibing624/chatglm3-6b-csc-chinese-lora on CSC **test**:

|prefix|input_text|target_text|pred|
|:-- |:--- |:--- |:-- |
|对下面文本纠错:|少先队员因该为老人让坐。|少先队员应该为老人让座。|少先队员应该为老人让座。|

在CSC测试集上生成结果纠错准确率高,由于是基于ChatGLM3-6B模型,结果常常能带给人惊喜,不仅能纠错,还带有句子润色和改写功能。


## Usage

本项目开源在 pycorrector 项目:[textgen](https://github.com/shibing624/pycorrector),可支持ChatGLM原生模型和LoRA微调后的模型,通过如下命令调用:

Install package:
```shell
pip install -U pycorrector
```

```python
from pycorrector.gpt.gpt_model import GptModel
model = GptModel("chatglm", "THUDM/chatglm3-6b", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.predict(["对下面文本纠错:\n少先队员因该为老人让坐。"])
print(r) # ['少先队员应该为老人让座。']
```

## Usage (HuggingFace Transformers)
Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this: 

First, you pass your input through the transformer model, then you get the generated sentence.

Install package:
```
pip install transformers 
```

```python
import sys
from peft import PeftModel
from transformers import AutoModel, AutoTokenizer

sys.path.append('..')

model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True, device_map='auto')
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")
model = model.half().cuda()  # fp16
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)

sents = ['对下面中文拼写纠错:\n少先队员因该为老人让坐。',
         '对下面中文拼写纠错:\n下个星期,我跟我朋唷打算去法国玩儿。']
for s in sents:
    response = model.chat(tokenizer, s, max_length=128, eos_token_id=tokenizer.eos_token_id)
    print(response)
```

output:
```shell
少先队员应该为老人让座。
下个星期,我跟我朋友打算去法国玩儿。
```


模型文件组成:
```
chatglm3-6b-csc-chinese-lora
    ├── adapter_config.json
    └── adapter_model.bin
```

#### 训练参数:

![loss](train_loss.png)

- num_epochs: 5
- per_device_train_batch_size: 6
- learning_rate: 2e-05
- best steps: 25100
- train_loss: 0.0834
- lr_scheduler_type: linear
- base model: THUDM/chatglm3-6b
- warmup_steps: 50
- "save_strategy": "steps"
- "save_steps": 500
- "save_total_limit": 10
- "bf16": false
- "fp16": true
- "optim": "adamw_torch"
- "ddp_find_unused_parameters": false
- "gradient_checkpointing": true
- max_seq_length: 512
- max_length: 512
- prompt_template_name: vicuna
- 6 * V100 32GB, training 48 hours

### 训练数据集
训练集包括以下数据:

- 中文拼写纠错数据集:https://huggingface.co/datasets/shibing624/CSC
- 中文语法纠错数据集:https://github.com/shibing624/pycorrector/tree/llm/examples/data/grammar
- 通用GPT4问答数据集:https://huggingface.co/datasets/shibing624/sharegpt_gpt4


如果需要训练GPT模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector)



## Citation

```latex
@software{pycorrector,
  author = {Ming Xu},
  title = {pycorrector: Text Error Correction Tool},
  year = {2023},
  url = {https://github.com/shibing624/pycorrector},
}
```