mav23 commited on
Commit
7cd0bb0
1 Parent(s): 45efdef

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ chinese-text-correction-1.5b.Q4_0.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: Qwen/Qwen2.5-1.5B-Instruct
4
+ license: apache-2.0
5
+ datasets:
6
+ - shibing624/chinese_text_correction
7
+ language:
8
+ - zh
9
+ metrics:
10
+ - f1
11
+ tags:
12
+ - text-generation-inference
13
+ widget:
14
+ - text: "文本纠错:\n少先队员因该为老人让坐。"
15
+ ---
16
+
17
+
18
+
19
+ # Chinese Text Correction Model
20
+ 中文文本纠错模型chinese-text-correction-1.5b:用于拼写纠错、语法纠错
21
+
22
+ `shibing624/chinese-text-correction-1.5b` evaluate test data:
23
+
24
+ The overall performance of CSC **test**:
25
+
26
+ |input_text|predict_text|
27
+ |:--- |:--- |
28
+ |文本纠错:\n少先队员因该为老人让坐。|少先队员应该为老人让座。|
29
+
30
+ # Models
31
+
32
+ | Name | Base Model | Download |
33
+ |-----------------|-------------------|-----------------------------------------------------------------------|
34
+ | chinese-text-correction-1.5b | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b) |
35
+ | chinese-text-correction-1.5b-lora | Qwen/Qwen2.5-1.5B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora) |
36
+ | chinese-text-correction-7b | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b) |
37
+ | chinese-text-correction-7b-lora | Qwen/Qwen2.5-7B-Instruct | [🤗 Hugging Face](https://huggingface.co/shibing624/chinese-text-correction-7b-lora) |
38
+
39
+
40
+ ### 评估结果
41
+ - 评估指标:F1
42
+ - CSC(Chinese Spelling Correction): 拼写纠错模型,表示模型可以处理音似、形似、语法等长度对齐的错误纠正
43
+ - CTC(CHinese Text Correction): 文本纠错模型,表示模型支持拼写、语法等长度对齐的错误纠正,还可以处理多字、少字等长度不对齐的错误纠正
44
+ - GPU:Tesla V100,显存 32 GB
45
+
46
+ | Model Name | Model Link | Base Model | Avg | SIGHAN-2015 | EC-LAW | MCSC | GPU/CPU | QPS |
47
+ |:-----------------|:------------------------------------------------------------------------------------------------------------------------|:---------------------------|:-----------|:------------|:-------|:-------|:--------|:--------|
48
+ | Kenlm-CSC | [shibing624/chinese-kenlm-klm](https://huggingface.co/shibing624/chinese-kenlm-klm) | kenlm | 0.3409 | 0.3147 | 0.3763 | 0.3317 | CPU | 9 |
49
+ | Mengzi-T5-CSC | [shibing624/mengzi-t5-base-chinese-correction](https://huggingface.co/shibing624/mengzi-t5-base-chinese-correction) | mengzi-t5-base | 0.3984 | 0.7758 | 0.3156 | 0.1039 | GPU | 214 |
50
+ | ERNIE-CSC | [PaddleNLP/ernie-csc](https://github.com/PaddlePaddle/PaddleNLP/tree/develop/legacy/examples/text_correction/ernie-csc) | PaddlePaddle/ernie-1.0-base-zh | 0.4353 | 0.8383 | 0.3357 | 0.1318 | GPU | 114 |
51
+ | MacBERT-CSC | [shibing624/macbert4csc-base-chinese](https://huggingface.co/shibing624/macbert4csc-base-chinese) | hfl/chinese-macbert-base | 0.3993 | 0.8314 | 0.1610 | 0.2055 | GPU | **224** |
52
+ | ChatGLM3-6B-CSC | [shibing624/chatglm3-6b-csc-chinese-lora](https://huggingface.co/shibing624/chatglm3-6b-csc-chinese-lora) | THUDM/chatglm3-6b | 0.4538 | 0.6572 | 0.4369 | 0.2672 | GPU | 3 |
53
+ | Qwen2.5-1.5B-CTC | [shibing624/chinese-text-correction-1.5b](https://huggingface.co/shibing624/chinese-text-correction-1.5b) | Qwen/Qwen2.5-1.5B-Instruct | 0.6802 | 0.3032 | 0.7846 | 0.9529 | GPU | 6 |
54
+ | Qwen2.5-7B-CTC | [shibing624/chinese-text-correction-7b](https://huggingface.co/shibing624/chinese-text-correction-7b) | Qwen/Qwen2.5-7B-Instruct | **0.8225** | 0.4917 | 0.9798 | 0.9959 | GPU | 3 |
55
+
56
+ ## Usage (pycorrector)
57
+
58
+ 本项目开源在`pycorrector`项目:[pycorrector](https://github.com/shibing624/pycorrector),可支持大模型微调后用于文本纠错,通过如下命令调用:
59
+
60
+ Install package:
61
+ ```shell
62
+ pip install -U pycorrector
63
+ ```
64
+
65
+ ```python
66
+ from pycorrector.gpt.gpt_corrector import GptCorrector
67
+
68
+ if __name__ == '__main__':
69
+ error_sentences = [
70
+ '真麻烦你了。希望你们好好的跳无',
71
+ '少先队员因该为老人让坐',
72
+ '机七学习是人工智能领遇最能体现智能的一个分知',
73
+ '一只小鱼船浮在平净的河面上',
74
+ '我的家乡是有明的渔米之乡',
75
+ ]
76
+ m = GptCorrector("shibing624/chinese-text-correction-1.5b")
77
+
78
+ batch_res = m.correct_batch(error_sentences)
79
+ for i in batch_res:
80
+ print(i)
81
+ print()
82
+ ```
83
+
84
+ ## Usage (HuggingFace Transformers)
85
+ Without [pycorrector](https://github.com/shibing624/pycorrector), you can use the model like this:
86
+
87
+ First, you pass your input through the transformer model, then you get the generated sentence.
88
+
89
+ Install package:
90
+ ```
91
+ pip install transformers
92
+ ```
93
+
94
+ ```python
95
+ # pip install transformers
96
+ from transformers import AutoModelForCausalLM, AutoTokenizer
97
+ checkpoint = "shibing624/chinese-text-correction-1.5b"
98
+
99
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
100
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
101
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
102
+
103
+ input_content = "文本纠错:\n少先队员因该为老人让坐。"
104
+
105
+ messages = [{"role": "user", "content": input_content}]
106
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False)
107
+
108
+ print(input_text)
109
+
110
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
111
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
112
+
113
+ print(tokenizer.decode(outputs[0]))
114
+ ```
115
+
116
+ output:
117
+ ```shell
118
+ 少先队员应该为老人让座。
119
+ ```
120
+
121
+
122
+ 模型文件组成:
123
+ ```
124
+ shibing624/chinese-text-correction-1.5b
125
+ |-- added_tokens.json
126
+ |-- config.json
127
+ |-- generation_config.json
128
+ |-- merges.txt
129
+ |-- model.safetensors
130
+ |-- model.safetensors.index.json
131
+ |-- README.md
132
+ |-- special_tokens_map.json
133
+ |-- tokenizer_config.json
134
+ |-- tokenizer.json
135
+ `-- vocab.json
136
+ ```
137
+
138
+ #### 训练参数:
139
+
140
+ - num_epochs: 8
141
+ - batch_size: 4
142
+ - steps: 36000
143
+ - eval_loss: 0.14
144
+ - base model: Qwen/Qwen2.5-1.5B-Instruct
145
+ - train data: [shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
146
+ - train time: 9 days 8 hours
147
+ - eval_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora/resolve/main/eval_loss_1.5b.png)
148
+ - train_loss: ![](https://huggingface.co/shibing624/chinese-text-correction-1.5b-lora/resolve/main/train_loss_1.5b.png)
149
+
150
+ ### 训练数据集
151
+ #### 中文纠错数据集
152
+
153
+ - 数据:[shibing624/chinese_text_correction](https://huggingface.co/datasets/shibing624/chinese_text_correction)
154
+
155
+
156
+ 如果需要训练Qwen的纠错模型,请参考[https://github.com/shibing624/pycorrector](https://github.com/shibing624/pycorrector) 或者 [https://github.com/shibing624/MedicalGPT](https://github.com/shibing624/MedicalGPT)
157
+
158
+ ## Citation
159
+
160
+ ```latex
161
+ @software{pycorrector,
162
+ author = {Xu Ming},
163
+ title = {pycorrector: Implementation of language model finetune},
164
+ year = {2024},
165
+ url = {https://github.com/shibing624/pycorrector},
166
+ }
167
+ ```
168
+
chinese-text-correction-1.5b.Q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c17743a891f3c2d52b329b6f0f05037fd6316c3a685e5f15a61ccd6c3aedcd44
3
+ size 934955040