shibing624's picture
Update README.md
90c27a7
|
raw
history blame
4.19 kB
metadata
language:
  - en
tags:
  - bert
  - pytorch
  - en
  - ner
license: apache-2.0

BERT for English Named Entity Recognition(bert4ner) Model

英文实体识别模型

bert4ner-base-uncased evaluate CoNLL-2003 test data:

The overall performance of BERT on CoNLL-2003 test:

Accuracy Recall F1
BertSoftmax 0.8956 0.9132 0.9043

在CoNLL-2003的测试集上达到接近SOTA水平。

BertSoftmax的网络结构(原生BERT)。

本项目开源在实体识别项目:nerpy,可支持bert4ner模型,通过如下命令调用:

英文实体识别:

>>> from nerpy import NERModel
>>> model = NERModel("bert", "shibing624/bert4ner-base-uncased")
>>> predictions, raw_outputs, entities = model.predict(["AL-AIN, United Arab Emirates 1996-12-06"], split_on_space=True)
entities:  [('AL-AIN,', 'LOC'), ('United Arab Emirates', 'LOC')]

模型文件组成:

bert4ner-base-uncased
    ├── config.json
    ├── model_args.json
    ├── pytorch_model.bin
    ├── special_tokens_map.json
    ├── tokenizer_config.json
    └── vocab.txt

Usage (HuggingFace Transformers)

Without nerpy, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the bio tag to get the entity words.

Install package:

pip install transformers seqeval
import os
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
from seqeval.metrics.sequence_labeling import get_entities

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("../bert4ner-base-uncased")
model = AutoModelForTokenClassification.from_pretrained("../bert4ner-base-uncased")
label_list = ["E-ORG", "E-LOC", "S-MISC", "I-MISC", "S-PER", "E-PER", "B-MISC", "O", "S-LOC",
              "E-MISC", "B-ORG", "S-ORG", "I-ORG", "B-LOC", "I-LOC", "B-PER", "I-PER"]

sentence = "AL-AIN, United Arab Emirates 1996-12-06"


def get_entity(sentence):
    tokens = tokenizer.tokenize(sentence)
    inputs = tokenizer.encode(sentence, return_tensors="pt")
    with torch.no_grad():
        outputs = model(inputs).logits
    predictions = torch.argmax(outputs, dim=2)
    word_tags = [(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].numpy()[1:-1])]
    print(sentence)
    print(word_tags)

    pred_labels = [i[1] for i in word_tags]
    entities = []
    line_entities = get_entities(pred_labels)
    for i in line_entities:
        word = tokens[i[1]: i[2] + 1]
        entity_type = i[0]
        entities.append((word, entity_type))

    print("Sentence entity:")
    print(entities)


get_entity(sentence)

数据集

实体识别数据集

数据集 语料 下载链接 文件大小
CNER中文实体识别数据集 CNER(12万字) CNER github 1.1MB
PEOPLE中文实体识别数据集 人民日报数据集(200万字) PEOPLE github 12.8MB
CoNLL03英文实体识别数据集 CoNLL-2003数据集(22万字) CoNLL03 github 1.7MB

input format

Input format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.

EU	S-ORG
rejects	O
German	S-MISC
call	O
to	O
boycott	O
British	S-MISC
lamb	O
.	O

Peter	B-PER
Blackburn	E-PER

如果需要训练bert4ner,请参考https://github.com/shibing624/nerpy/tree/main/examples

Citation

@software{nerpy,
  author = {Xu Ming},
  title = {nerpy: Named Entity Recognition toolkit},
  year = {2022},
  url = {https://github.com/shibing624/nerpy},
}