shibing624
commited on
Commit
•
bb7d55d
1
Parent(s):
d210fa5
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,86 @@
|
|
1 |
---
|
2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- zh
|
4 |
+
tags:
|
5 |
+
- bert
|
6 |
+
- pytorch
|
7 |
+
- zh
|
8 |
+
- ner
|
9 |
+
license: "apache-2.0"
|
10 |
---
|
11 |
+
|
12 |
+
# BERT for Chinese Named Entity Recognition(bert4ner) Model
|
13 |
+
中文实体识别模型
|
14 |
+
|
15 |
+
`bert4ner-base-chinese` evaluate CNER test data:
|
16 |
+
|
17 |
+
- precision: 0.9395, recall: 0.9604, f1: 0.9498
|
18 |
+
|
19 |
+
由于训练使用的数据使用了CNER的训练集,在CNER的测试集上达到接近SOTA水平。
|
20 |
+
|
21 |
+
模型结构,标准BertSoftmax的网络结构:
|
22 |
+
|
23 |
+
![arch](bert.png)
|
24 |
+
|
25 |
+
## Usage
|
26 |
+
|
27 |
+
本项目开源在实体识别项目:[nerpy](https://github.com/shibing624/nerpy),可支持bert4ner模型,通过如下命令调用:
|
28 |
+
|
29 |
+
```shell
|
30 |
+
>>> from nerpy import NERModel
|
31 |
+
>>> model = NERModel("bert", "shibing624/bert4ner-base-chinese")
|
32 |
+
>>> predictions, raw_outputs, entities = model.predict(["常建良,男,1963年出生,工科学士,高级工程师"], split_on_space=False)
|
33 |
+
entities: [('常建良', 'NAME'), ('工科', 'PRO'), ('学士', 'EDU'), ('高级工程师', 'TITLE')]
|
34 |
+
```
|
35 |
+
|
36 |
+
模型文件组成:
|
37 |
+
```
|
38 |
+
bert4ner-base-chinese
|
39 |
+
├── config.json
|
40 |
+
├── model_args.json
|
41 |
+
├── eval_result.txt
|
42 |
+
├── pytorch_model.bin
|
43 |
+
├── special_tokens_map.json
|
44 |
+
├── tokenizer_config.json
|
45 |
+
└── vocab.txt
|
46 |
+
```
|
47 |
+
|
48 |
+
### 训练数据集
|
49 |
+
#### 中文实体识别数据集
|
50 |
+
|
51 |
+
|
52 |
+
| 数据集 | 语料 | 下载链接 | 文件大小 |
|
53 |
+
| :------- | :--------- | :---------: | :---------: |
|
54 |
+
| **`CNER中文实体识别数据集`** | CNER(12万字) | [CNER github](https://github.com/shibing624/nerpy/tree/main/examples/data/cner)| 1.1MB |
|
55 |
+
| **`PEOPLE中文实体识别数据集`** | 人民日报实体集(200万字) | [PEOPLE github](https://github.com/shibing624/nerpy/tree/main/examples/data/people)| 12.8MB |
|
56 |
+
|
57 |
+
|
58 |
+
CNER中文实体识别数据集,数据格式:
|
59 |
+
|
60 |
+
```text
|
61 |
+
美 B-LOC
|
62 |
+
国 I-LOC
|
63 |
+
的 O
|
64 |
+
华 B-PER
|
65 |
+
莱 I-PER
|
66 |
+
士 I-PER
|
67 |
+
|
68 |
+
我 O
|
69 |
+
跟 O
|
70 |
+
他 O
|
71 |
+
```
|
72 |
+
|
73 |
+
|
74 |
+
如果需要训练bert4ner,请参考[https://github.com/shibing624/nerpy/tree/main/examples](https://github.com/shibing624/nerpy/tree/main/examples)
|
75 |
+
|
76 |
+
|
77 |
+
## Citation
|
78 |
+
|
79 |
+
```latex
|
80 |
+
@software{nerpy,
|
81 |
+
author = {Xu Ming},
|
82 |
+
title = {nerpy: Named Entity Recognition toolkit},
|
83 |
+
year = {2022},
|
84 |
+
url = {https://github.com/shibing624/nerpy},
|
85 |
+
}
|
86 |
+
```
|