suolyer commited on
Commit
3de816b
1 Parent(s): e783f27

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md CHANGED
@@ -1,3 +1,135 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ inference: false
4
+
5
+ tags:
6
+ - ner
7
+ - zero-shot
8
+ - information extruction
9
+
10
  ---
11
+
12
+ # Erlangshen-UniEX-RoBERTa-110M-Chinese
13
+
14
+ - Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/UniEX/)
15
+ - Docs: [Fengshenbang-Docs](https://fengshenbang-doc.readthedocs.io/)
16
+
17
+ ## 简介 Brief Introduction
18
+
19
+ UniEX 核心思想是将信息抽取转化为 token-pair 任务,为了将实体识别、关系抽取、事件抽取等抽取任务统一起来。我们使用一张表来识别实体的位置,其他表用来识别实体的类型或者关系的类型。此外,我们将标签信息和要抽取的文本拼接在一起,通过transformer进行编码。然后得到label的表示和文本的表示。最后通过Triaffine注意力机制使得所有任务可以共享一套参数。
20
+
21
+ The core idea of UniEX is to transform information extraction into token-pair tasks, in order to unify extraction tasks such as entity recognition, relationship extraction, and event extraction. We use one table to identify the location of the entity and other tables to identify the type of entity or the type of relationship. In addition, we stitch together the label information and the text to be extracted, and encode it through a transformer. Then get the representation of the label and the representation of the text. Finally, through the Triaffine attention mechanism, all tasks can share a set of parameters.
22
+
23
+
24
+ ## 模型分类 Model Taxonomy
25
+
26
+ | 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra |
27
+ | :----: | :----: | :----: | :----: | :----: | :----: |
28
+ | 抽取 Extraction | 自然语言理解 NLU | 二郎神 Erlangshen | RoBERTa | 110M | Chinese |
29
+
30
+ ## 模型信息 Model Information
31
+
32
+ 由于 UniEX 可以统一所有抽取任务,且经过预训练之后,UniEX拥有着不错的 Few-Shot 和 Zero-shot 性能。为了方便社区做中文领域的抽取任务,我们使用百度百科这种结构化的数据构建弱监督数据集,通过清洗过后得到大概600M的数据,此外也收集了 16 个实体识别,7个关系抽取,6个事件抽取,11个阅读理解数据集。我们将收集得到的数据同时输入模型进行预训练。
33
+
34
+ Because UniEX can unify all extraction tasks, and after pre-training, UniEX has strong Few-Shot and Zero-shot performance. We use the structured data of Baidu Encyclopedia to build a weakly supervised data set. After cleaning, we get about 600M data. In addition, we also collected 16 entity recognition, 7 relationship extraction, 6 event extraction, and 11 reading comprehension data sets. . We mix this data and feed it to the model for pre-training
35
+
36
+
37
+ ### 下游效果 Performance
38
+ | Task type | Datsset | TANL(t5-base) | UniEX(roberta-base) | UIE(t5-large) | UniEX(roberta-large) |
39
+ |:-------------------------:|:-------------:|:-------------:|:-------------------:|:-------------:|:--------------------:|
40
+ | Relation Extraction | CoNLL04 | 71.4 | 71.79 | 73.07 | 73.4 |
41
+ | | SciERC | - | - | 33.36 | 38 |
42
+ | | ACE05 | 63.7 | 63.64 | 64.68 | 64.9 |
43
+ | | ADE | 80.6 | 83.81 | - | - |
44
+ | Nemed Entity Recognition | CoNNL03 | 91.7 | 92.13 | 92.17 | 92.65 |
45
+ | | ACE04 | - | - | 86.52 | 87.12 |
46
+ | | ACE05 | 84.9 | 85.96 | 85.52 | 87.02 |
47
+ | | GENIA | 76.4 | 76.69 | - | - |
48
+ | Sentiment Extraction | 14lap | - | - | 63.15 | 65.23 |
49
+ | | 14res | - | - | 73.78 | 74.77 |
50
+ | | 15res | - | - | 66.1 | 68.58 |
51
+ | | 16res | - | - | 73.87 | 76.02 |
52
+ | Event Extraction | ACE05-Trigger | 68.4 | 70.86 | 72.63 | 74.08 |
53
+ | | ACE05-Role | 47.6 | 50.67 | 54.67 | 53.92 |
54
+ | | CASIE-Trigger | - | - | 68.98 | 71.46 |
55
+ | | CASIE-Role | - | - | 60.37 | 62.91 |
56
+
57
+ ## 使用 Usage
58
+ ```shell
59
+ git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
60
+ cd Fengshenbang-LM
61
+ pip install --editable .
62
+ ```
63
+
64
+
65
+ ```python3
66
+ import argparse
67
+ from fengshen.pipelines.multiplechoice import UniEXPipelines
68
+
69
+
70
+ total_parser = argparse.ArgumentParser("TASK NAME")
71
+ total_parser = UniEXPipelines.piplines_args(total_parser)
72
+ args = total_parser.parse_args()
73
+ pretrained_model_path = 'IDEA-CCNL/Erlangshen-UniEX-RoBERTa-110M-Chinese'
74
+ args.learning_rate=2e-5
75
+ args.max_length=512
76
+ args.max_epochs=3
77
+ args.batchsize=8
78
+ args.default_root_dir='./'
79
+ model = UniEXPipelines(args, pretrained_model_path)
80
+
81
+ train_data = []
82
+ dev_data = []
83
+ test_data = [
84
+ {"texta": "放弃了途观L和荣威RX5,果断入手这部车,外观霸气又好开",
85
+ "textb": "",
86
+ "question": "下面新闻属于哪一个类别?",
87
+ "choice": [
88
+ "房产",
89
+ "汽车",
90
+ "教育",
91
+ "科技"
92
+ ],
93
+ "answer": "汽车",
94
+ "label": 1,
95
+ "id": 7759}
96
+ ]
97
+
98
+ if args.train:
99
+ model.train(train_data, dev_data)
100
+ result = model.predict(test_data)
101
+ for line in result[:20]:
102
+ print(line)
103
+
104
+
105
+ ```
106
+
107
+
108
+ ## 引用 Citation
109
+
110
+ 如果您在您的工作中使用了我们的模型,可以引用我们的[论文](https://arxiv.org/abs/2209.02970):
111
+
112
+ If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2209.02970):
113
+
114
+ ```text
115
+ @article{fengshenbang,
116
+ author = {Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen and Ruyi Gan and Jiaxing Zhang},
117
+ title = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
118
+ journal = {CoRR},
119
+ volume = {abs/2209.02970},
120
+ year = {2022}
121
+ }
122
+ ```
123
+
124
+ 也可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
125
+
126
+ You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
127
+
128
+ ```text
129
+ @misc{Fengshenbang-LM,
130
+ title={Fengshenbang-LM},
131
+ author={IDEA-CCNL},
132
+ year={2021},
133
+ howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
134
+ }
135
+ ```