File size: 3,815 Bytes
65caebd c61f3ec 65caebd c61f3ec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
---
language:
- zh
tags:
- chatglm
- pytorch
- zh
- Text2Text-Generation
license: "apache-2.0"
widget:
- text: "对下面中文拼写纠错:\n少先队员因该为老人让坐。\n答:"
---
# Chinese Spelling Correction LoRA Model
ChatGLM3-6B中文纠错LoRA模型
`shibing624/chatglm3-6b-csc-chinese-lora` evaluate test data:
The overall performance of shibing624/chatglm3-6b-csc-chinese-lora on CSC **test**:
|prefix|input_text|target_text|pred|
|:-- |:--- |:--- |:-- |
|对下面文本纠错:|少先队员因该为老人让坐。|少先队员应该为老人让座。|少先队员应该为老人让座。|
在CSC测试集上生成结果纠错准确率高,由于是基于ChatGLM3-6B模型,结果常常能带给人惊喜,不仅能纠错,还带有句子润色和改写功能。
## Usage
本项目开源在 pycorrector 项目:[textgen](https://github.com/shibing624/pycorrector),可支持ChatGLM原生模型和LoRA微调后的模型,通过如下命令调用:
Install package:
```shell
pip install -U pycorrector
```
```python
from pycorrector.gpt.gpt_model import GptModel
model = GptModel("chatglm", "THUDM/chatglm3-6b", peft_name="shibing624/chatglm3-6b-csc-chinese-lora")
r = model.predict(["对下面文本纠错:\n少先队员因该为老人让坐。"])
print(r) # ['少先队员应该为老人让座。']
```
## Usage (HuggingFace Transformers)
Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
First, you pass your input through the transformer model, then you get the generated sentence.
Install package:
```
pip install transformers
```
```python
import sys
from peft import PeftModel
from transformers import AutoModel, AutoTokenizer
sys.path.append('..')
model = AutoModel.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True, device_map='auto')
model = PeftModel.from_pretrained(model, "shibing624/chatglm3-6b-csc-chinese-lora")
model = model.half().cuda() # fp16
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm3-6b", trust_remote_code=True)
sents = ['对下面中文拼写纠错:\n少先队员因该为老人让坐。',
'对下面中文拼写纠错:\n下个星期,我跟我朋唷打算去法国玩儿。']
for s in sents:
response = model.chat(tokenizer, s, max_length=128, eos_token_id=tokenizer.eos_token_id)
print(response)
```
output:
```shell
少先队员应该为老人让座。
下个星期,我跟我朋友打算去法国玩儿。
```
模型文件组成:
```
chatglm3-6b-csc-chinese-lora
├── adapter_config.json
└── adapter_model.bin
```
#### 训练参数:
![loss](train_loss.png)
- num_epochs: 5
- per_device_train_batch_size: 6
- learning_rate: 2e-05
- best steps: 25100
- train_loss: 0.0834
- lr_scheduler_type: linear
- base model: THUDM/chatglm3-6b
- warmup_steps: 50
- "save_strategy": "steps"
- "save_steps": 500
- "save_total_limit": 10
- "bf16": false
- "fp16": true
- "optim": "adamw_torch"
- "ddp_find_unused_parameters": false
- "gradient_checkpointing": true
- max_seq_length: 512
- max_length: 512
- prompt_template_name: vicuna
- 6 * V100 32GB, training 48 hours
### 训练数据集
训练集包括以下数据:
- 中文拼写纠错数据集:https://huggingface.co/datasets/shibing624/CSC
- 中文语法纠错数据集:https://github.com/shibing624/pycorrector/tree/llm/examples/data/grammar
- 通用GPT4问答数据集:https://huggingface.co/datasets/shibing624/sharegpt_gpt4
如果需要训练GPT模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector)
## Citation
```latex
@software{pycorrector,
author = {Ming Xu},
title = {pycorrector: Text Error Correction Tool},
year = {2023},
url = {https://github.com/shibing624/pycorrector},
}
```
|