|
--- |
|
|
|
|
|
{} |
|
--- |
|
|
|
# Patent Entity Extraction Model |
|
|
|
### Model Description |
|
|
|
**patent_entities_ner** is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on a custom dataset of OCR'd front pages of patent specifications published by the British Patent Office, and filed between 1617-1899. |
|
|
|
It has been trained to recognize six classes of named entities: |
|
|
|
- PER: full name of inventor |
|
- OCC: occupation of inventor |
|
- ADD: full (permanent) address of inventor |
|
- DATE: patent filing, submission, or approval dates |
|
- FIRM: name of firm affiliated with inventor |
|
- COMM: name and information mentioned about communicant |
|
|
|
We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 29 epochs with a learning rate of 5e-05 and a batch size of 21. We chose the learning rate by tuning on the validation set. |
|
|
|
### Usage |
|
|
|
This model can be used with HuggingFace Transformer's Pipelines API for NER: |
|
|
|
```python |
|
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("gbpatentdata/patent_entities_ner") |
|
model = AutoModelForTokenClassification.from_pretrained("gbpatentdata/patent_entities_ner") |
|
|
|
|
|
def custom_recognizer(text, model=model, tokenizer=tokenizer, device=0): |
|
|
|
# HF ner pipeline |
|
token_level_results = pipeline("ner", model=model, device=0, tokenizer=tokenizer)(text) |
|
|
|
# keep entities tracked |
|
entities = [] |
|
current_entity = None |
|
|
|
for item in token_level_results: |
|
|
|
tag = item['entity'] |
|
|
|
# replace '▁' with space for easier reading (_ is created by the XLM-RoBERTa tokenizer) |
|
word = item['word'].replace('▁', ' ') |
|
|
|
# aggregate I-O-B tagged entities |
|
if tag.startswith('B-'): |
|
|
|
if current_entity: |
|
entities.append(current_entity) |
|
|
|
current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']} |
|
|
|
elif tag.startswith('I-'): |
|
|
|
if current_entity and tag[2:] == current_entity['type']: |
|
current_entity['text'] += word |
|
current_entity['end'] = item['end'] |
|
|
|
else: |
|
|
|
if current_entity: |
|
entities.append(current_entity) |
|
|
|
current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']} |
|
|
|
else: |
|
# deal with O tag |
|
if current_entity: |
|
entities.append(current_entity) |
|
current_entity = None |
|
|
|
if current_entity: |
|
# add to entities |
|
entities.append(current_entity) |
|
|
|
# track entity merges |
|
merged_entities = [] |
|
|
|
# merge entities of the same type |
|
for entity in entities: |
|
if merged_entities and merged_entities[-1]['type'] == entity['type'] and merged_entities[-1]['end'] == entity['start']: |
|
merged_entities[-1]['text'] += entity['text'] |
|
merged_entities[-1]['end'] = entity['end'] |
|
else: |
|
merged_entities.append(entity) |
|
|
|
# clean up extra spaces |
|
for entity in merged_entities: |
|
entity['text'] = ' '.join(entity['text'].split()) |
|
|
|
# convert to list of dicts |
|
return [{'class': entity['type'], |
|
'entity_text': entity['text'], |
|
'start': entity['start'], |
|
'end': entity['end']} for entity in merged_entities] |
|
|
|
|
|
|
|
example = """ |
|
Date of Application, 1st Aug., 1890-Accepted, 6th Sept., 1890 |
|
COMPLETE SPECIFICATION. |
|
Improvements in Coin-freed Apparatus for the Sale of Goods. |
|
I, CHARLES LOTINGA, of 33 Cambridge Street, Lower Grange, Cardiff, in the County of Glamorgan, Gentleman, |
|
do hereby declare the nature of this invention and in what manner the same is to be performed, |
|
to be particularly described and ascertained in and by the following statement |
|
""" |
|
|
|
ner_results = custom_recognizer(example) |
|
print(ner_results) |
|
``` |
|
|
|
### Training Data |
|
|
|
The custom dataset of front page texts of patent specifications was assembled in the following steps: |
|
|
|
1. We fine tuned a YOLO vision [model](https://huggingface.co/gbpatentdata/yolov8_patent_layouts) to detect bounding boxes around text. We use this to identify text regions on the front pages of patent specifications. |
|
2. We use [Google Cloud Vision](https://cloud.google.com/vision?hl=en) to OCR the detected text regions, and then concatenate the OCR text. |
|
3. We randomly sample 200 front page texts (and another 201 oversampled from those that contain either firm or communicant information). |
|
|
|
Our custom dataset has accurate manual labels created jointly by an undergraduate student and an economics professor. The final dataset is split 60-20-20 (train-val-test). In the event that the front page text is too long, we restrict the text to the first 512 tokens. |
|
|
|
### Evaluation |
|
|
|
Our evaluation metric is F1 at the full entity-level. That is, we aggregated adjacent-indexed entities into full entities and computed F1 scores requiring an exact match. These scores for the test set are below. |
|
|
|
<table> |
|
<thead> |
|
<tr> |
|
<th>Full Entity</th> |
|
<th>Precision</th> |
|
<th>Recall</th> |
|
<th>F1-Score</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td>PER</td> |
|
<td>92.2%</td> |
|
<td>97.7%</td> |
|
<td>94.9%</td> |
|
</tr> |
|
<tr> |
|
<td>OCC</td> |
|
<td>93.8%</td> |
|
<td>93.8%</td> |
|
<td>93.8%</td> |
|
</tr> |
|
<tr> |
|
<td>ADD</td> |
|
<td>88.6%</td> |
|
<td>91.2%</td> |
|
<td>89.9%</td> |
|
</tr> |
|
<tr> |
|
<td>DATE</td> |
|
<td>93.7%</td> |
|
<td>98.7%</td> |
|
<td>96.1%</td> |
|
</tr> |
|
<tr> |
|
<td>FIRM</td> |
|
<td>64.0%</td> |
|
<td>94.1%</td> |
|
<td>76.2%</td> |
|
</tr> |
|
<tr> |
|
<td>COMM</td> |
|
<td>77.1%</td> |
|
<td>87.1%</td> |
|
<td>81.8%</td> |
|
</tr> |
|
<tr> |
|
<td>Overall (micro avg)</td> |
|
<td>89.9%</td> |
|
<td>95.3%</td> |
|
<td>92.5%</td> |
|
</tr> |
|
<tr> |
|
<td>Overall (macro avg)</td> |
|
<td>84.9%</td> |
|
<td>93.8%</td> |
|
<td>88.9%</td> |
|
</tr> |
|
<tr> |
|
<td>Overall (weighted avg)</td> |
|
<td>90.3%</td> |
|
<td>95.3%</td> |
|
<td>92.7%</td> |
|
</tr> |
|
</tbody> |
|
</table> |