File size: 6,784 Bytes
550e3cc 169642e 550e3cc b980551 550e3cc 5ff4052 550e3cc 1e972af 550e3cc 1e972af 550e3cc 53e6ba0 550e3cc 1e972af 550e3cc 1e972af 550e3cc 1e972af 550e3cc 10060ac 43da179 550e3cc 1e972af 550e3cc 43da179 1e972af 550e3cc 1e972af 376f86e 1e972af 0b8ed7d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
---
language:
- en
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: token-classification
library_name: transformers
---
# Patent Entity Extraction Model
### Model Description
**patent_entities_ner** is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on a custom dataset of OCR'd front pages of patent specifications published by the British Patent Office, and filed between 1617-1899.
It has been trained to recognize six classes of named entities:
- PER: full name of inventor
- OCC: occupation of inventor
- ADD: full (permanent) address of inventor
- DATE: patent filing, submission, or approval dates
- FIRM: name of firm affiliated with inventor
- COMM: name and information mentioned about communicant
We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 29 epochs with a learning rate of 5e-05 and a batch size of 21. We chose the learning rate by tuning on the validation set.
### Usage
This model can be used with HuggingFace Transformer's Pipelines API for NER:
```python
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gbpatentdata/patent_entities_ner")
model = AutoModelForTokenClassification.from_pretrained("gbpatentdata/patent_entities_ner")
def custom_recognizer(text, model=model, tokenizer=tokenizer, device=0):
# HF ner pipeline
token_level_results = pipeline("ner", model=model, device=0, tokenizer=tokenizer)(text)
# keep entities tracked
entities = []
current_entity = None
for item in token_level_results:
tag = item['entity']
# replace '▁' with space for easier reading (_ is created by the XLM-RoBERTa tokenizer)
word = item['word'].replace('▁', ' ')
# aggregate I-O-B tagged entities
if tag.startswith('B-'):
if current_entity:
entities.append(current_entity)
current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']}
elif tag.startswith('I-'):
if current_entity and tag[2:] == current_entity['type']:
current_entity['text'] += word
current_entity['end'] = item['end']
else:
if current_entity:
entities.append(current_entity)
current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']}
else:
# deal with O tag
if current_entity:
entities.append(current_entity)
current_entity = None
if current_entity:
# add to entities
entities.append(current_entity)
# track entity merges
merged_entities = []
# merge entities of the same type
for entity in entities:
if merged_entities and merged_entities[-1]['type'] == entity['type'] and merged_entities[-1]['end'] == entity['start']:
merged_entities[-1]['text'] += entity['text']
merged_entities[-1]['end'] = entity['end']
else:
merged_entities.append(entity)
# clean up extra spaces
for entity in merged_entities:
entity['text'] = ' '.join(entity['text'].split())
# convert to list of dicts
return [{'class': entity['type'],
'entity_text': entity['text'],
'start': entity['start'],
'end': entity['end']} for entity in merged_entities]
example = """
Date of Application, 1st Aug., 1890-Accepted, 6th Sept., 1890
COMPLETE SPECIFICATION.
Improvements in Coin-freed Apparatus for the Sale of Goods.
I, CHARLES LOTINGA, of 33 Cambridge Street, Lower Grange, Cardiff, in the County of Glamorgan, Gentleman,
do hereby declare the nature of this invention and in what manner the same is to be performed,
to be particularly described and ascertained in and by the following statement
"""
ner_results = custom_recognizer(example)
print(ner_results)
```
### Training Data
The custom dataset of front page texts of patent specifications was assembled in the following steps:
1. We fine tuned a YOLO vision [model](https://huggingface.co/gbpatentdata/yolov8_patent_layouts) to detect bounding boxes around text. We use this to identify text regions on the front pages of patent specifications.
2. We use [Google Cloud Vision](https://cloud.google.com/vision?hl=en) to OCR the detected text regions, and then concatenate the OCR text.
3. We randomly sample 200 front page texts (and another 201 oversampled from those that contain either firm or communicant information).
Our custom dataset has accurate manual labels created jointly by an undergraduate student and an economics professor. The final dataset is split 60-20-20 (train-val-test). In the event that the front page text is too long, we restrict the text to the first 512 tokens.
### Evaluation
Our evaluation metric is F1 at the full entity-level. That is, we aggregated adjacent-indexed entities into full entities and computed F1 scores requiring an exact match. These scores for the test set are below.
<table>
<thead>
<tr>
<th>Full Entity</th>
<th>Precision</th>
<th>Recall</th>
<th>F1-Score</th>
</tr>
</thead>
<tbody>
<tr>
<td>PER</td>
<td>92.2%</td>
<td>97.7%</td>
<td>94.9%</td>
</tr>
<tr>
<td>OCC</td>
<td>93.8%</td>
<td>93.8%</td>
<td>93.8%</td>
</tr>
<tr>
<td>ADD</td>
<td>88.6%</td>
<td>91.2%</td>
<td>89.9%</td>
</tr>
<tr>
<td>DATE</td>
<td>93.7%</td>
<td>98.7%</td>
<td>96.1%</td>
</tr>
<tr>
<td>FIRM</td>
<td>64.0%</td>
<td>94.1%</td>
<td>76.2%</td>
</tr>
<tr>
<td>COMM</td>
<td>77.1%</td>
<td>87.1%</td>
<td>81.8%</td>
</tr>
<tr>
<td>Overall (micro avg)</td>
<td>89.9%</td>
<td>95.3%</td>
<td>92.5%</td>
</tr>
<tr>
<td>Overall (macro avg)</td>
<td>84.9%</td>
<td>93.8%</td>
<td>88.9%</td>
</tr>
<tr>
<td>Overall (weighted avg)</td>
<td>90.3%</td>
<td>95.3%</td>
<td>92.7%</td>
</tr>
</tbody>
</table>
## Citation
If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows:
```bibtex
@article{bct2025,
title = {300 Years of British Patents},
author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
journal = {arXiv preprint arXiv:2401.12345},
year = {2025},
url = {https://arxiv.org/abs/2401.12345}
}
``` |