shibing624 commited on
Commit
beb5107
·
verified ·
1 Parent(s): 83b1665

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -3
README.md CHANGED
@@ -1,3 +1,153 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: Qwen/Qwen2.5-7B-Instruct
4
+ license: apache-2.0
5
+ datasets:
6
+ - shibing624/chinese_text_correction
7
+ language:
8
+ - zh
9
+ metrics:
10
+ - f1
11
+ tags:
12
+ - text-generation-inference
13
+ widget:
14
+ - text: "文本纠错:\n少先队员因该为老人让坐。"
15
+ ---
16
+
17
+
18
+
19
+ # Chinese Text Correction Model
20
+ 中文文本纠错模型chinese-text-correction-7b:用于拼写纠错、语法纠错
21
+
22
+ `shibing624/chinese-text-correction-7b` evaluate test data:
23
+
24
+ The overall performance of CSC **test**:
25
+
26
+ |input_text|predict_text|
27
+ |:--- |:--- |
28
+ |文本纠错:\n少先队员因该为老人让坐。|少先队员应该为老人让座。|
29
+
30
+ # Models
31
+
32
+ | Name | Base Model | Download |
33
+ |-----------------|-------------------|-----------------------------------------------------------------------|
34
+ | chinese-text-correction-1.5b | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b) |
35
+ | chinese-text-correction-1.5b-lora | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora) |
36
+ | chinese-text-correction-7b | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b) |
37
+ | chinese-text-correction-7b-lora | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b-lora) |
38
+
39
+
40
+
41
+ ## Usage (pycorrector)
42
+
43
+ 本项目开源在`pycorrector`项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持大模型微调后用于文本纠错,通过如下命令调用:
44
+
45
+ Install package:
46
+ ```shell
47
+ pip install -U pycorrector
48
+ ```
49
+
50
+ ```python
51
+ from pycorrector.gpt.gpt_corrector import GptCorrector
52
+
53
+ if __name__ == '__main__':
54
+ error_sentences = [
55
+ '真麻烦你了。希望你们好好的跳无',
56
+ '少先队员因该为老人让坐',
57
+ '机七学习是人工智能领遇最能体现智能的一个分知',
58
+ '一只小鱼船浮在平净的河面上',
59
+ '我的家乡是有明的渔米之乡',
60
+ ]
61
+ m = GptCorrector("shibing624/chinese-text-correction-7b")
62
+
63
+ batch_res = m.correct_batch(error_sentences)
64
+ for i in batch_res:
65
+ print(i)
66
+ print()
67
+ ```
68
+
69
+ ## Usage (HuggingFace Transformers)
70
+ Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
71
+
72
+ First, you pass your input through the transformer model, then you get the generated sentence.
73
+
74
+ Install package:
75
+ ```
76
+ pip install transformers
77
+ ```
78
+
79
+ ```python
80
+ # pip install transformers
81
+ from transformers import AutoModelForCausalLM, AutoTokenizer
82
+ checkpoint = "shibing624/chinese-text-correction-7b"
83
+
84
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
85
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
86
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
87
+
88
+ input_content = "文本纠错:\n少先队员因该为老人让坐。"
89
+
90
+ messages = [{"role": "user", "content": input_content}]
91
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False)
92
+
93
+ print(input_text)
94
+
95
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
96
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
97
+
98
+ print(tokenizer.decode(outputs[0]))
99
+ ```
100
+
101
+ output:
102
+ ```shell
103
+ 少先队员应该为老人让座。
104
+ ```
105
+
106
+
107
+ 模型文件组成:
108
+ ```
109
+ shibing624/chinese-text-correction-7b
110
+ |-- added_tokens.json
111
+ |-- config.json
112
+ |-- generation_config.json
113
+ |-- merges.txt
114
+ |-- model.safetensors
115
+ |-- model.safetensors.index.json
116
+ |-- README.md
117
+ |-- special_tokens_map.json
118
+ |-- tokenizer_config.json
119
+ |-- tokenizer.json
120
+ `-- vocab.json
121
+ ```
122
+
123
+ #### 训练参数:
124
+
125
+ - num_epochs: 8
126
+ - batch_size: 2
127
+ - steps: 36000
128
+ - eval_loss: 0.12
129
+ - base model: Qwen/Qwen2.5-7B-Instruct
130
+ - train data: [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
131
+ - train time: 10 days
132
+ - eval_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/eval_loss_7b.png)
133
+ - train_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-7b-lora/resolve/main/train_loss_7b.png)
134
+
135
+ ### 训练数据集
136
+ #### 中文纠错数据集
137
+
138
+ - 数据:[shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
139
+
140
+
141
+ 如果需要训练Qwen的纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector) 或者 [https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
142
+
143
+ ## Citation
144
+
145
+ ```latex
146
+ @software{pycorrector,
147
+ author = {Xu Ming},
148
+ title = {pycorrector: Implementation of language model finetune},
149
+ year = {2024},
150
+ url = {https://github.com/shibing624/pycorrector},
151
+ }
152
+ ```
153
+