|
--- |
|
license: cc-by-2.0 |
|
language: |
|
- en |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
# Historical newspaper NER |
|
|
|
## Model description |
|
|
|
**historical_newspaper_ner** is a fine-tuned Roberta-large model for use on text that may contain OCR errors. |
|
|
|
It has been trained to recognize four types of entities: location (LOC), organizations (ORG), person (PER) and Miscellaneous (MISC). |
|
|
|
It was trained on a custom historical newspaper dataset, with highly accurate labels. All data were double entered by two highly skilled Harvard undergraduates and all discrepancies were resolved by hand. |
|
|
|
|
|
## Intended uses |
|
|
|
You can use this model with Transformers pipeline for NER. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from transformers import pipeline |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("dell-research-harvard/historical_newspaper_ner") |
|
model = AutoModelForTokenClassification.from_pretrained("dell-research-harvard/historical_newspaper_ner") |
|
|
|
nlp = pipeline("ner", model=model, tokenizer=tokenizer) |
|
example = "My name is Wolfgang and I live in Berlin" |
|
|
|
ner_results = nlp(example) |
|
print(ner_results) |
|
``` |
|
|
|
## Limitations and bias |
|
|
|
This model was trained on historical news and may reflect biases from a specific period of time. It may also not generalise well to other setting. |
|
Additionally, the model occasionally tags subword tokens as entities and post-processing of results may be necessary to handle those cases. |
|
|
|
## Training data |
|
|
|
The training dataset distinguishes between the beginning and continuation of an entity so that if there are back-to-back entities of the same type, the model can output where the second entity begins. Each token will be classified as one of the following classes: |
|
|
|
Abbreviation|Description |
|
-|- |
|
O|Outside of a named entity |
|
B-MISC |Beginning of a miscellaneous entity |
|
I-MISC | Miscellaneous entity |
|
B-PER |Beginning of a person’s name |
|
I-PER |Person’s name |
|
B-ORG |Beginning of an organization |
|
I-ORG |organization |
|
B-LOC |Beginning of a location |
|
I-LOC |Location |
|
|
|
This model was fine-tuned on historical English-language news that had been OCRd from American newspapers. |
|
Unlike other NER datasets, this data has highly accurate labels. All data were double entered by two highly skilled Harvard undergraduates and all discrepancies were resolved by hand. |
|
|
|
|
|
#### # of training examples per entity type |
|
Dataset|Article|PER|ORG|LOC|MISC |
|
-|-|-|-|-|- |
|
Train|227|1345|450|1191|1037 |
|
Dev|48|231|59|192|149 |
|
Test|48|261|83|199|181 |
|
|
|
|
|
## Training procedure |
|
|
|
The data was used to fine-tune a Roberta-Large model (Liu et. al, 2020) at a learning rate of 4.7e-05 with a batch size of 128 for 184 epochs. |
|
|
|
|
|
## Eval results |
|
entities|f1 |
|
-|- |
|
PER | 94.3 |
|
ORG | 80.7 |
|
LOC | 90.8 |
|
MISC | 79.6 |
|
Overall (stringent) | 86.5 |
|
Overall (ignoring entity type) | 90.4 |
|
|
|
|
|
|
|
|
|
## Notes |
|
|
|
This model card was influence by that of [dslim/bert-base-NER](https://huggingface.co/dslim/bert-base-NER/edit/main/README.md) |
|
|
|
|
|
## Citation |
|
|
|
If you use this model, you can cite the following paper: |
|
|
|
``` |
|
@misc{franklin2024ndjv, |
|
title={News Deja Vu: Connecting Past and Present with Semantic Search}, |
|
author={Brevin Franklin, Emily Silcock, Abhishek Arora, Tom Bryan and Melissa Dell}, |
|
year={2024}, |
|
eprint={2406.15593}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2406.15593}, |
|
} |
|
``` |
|
|
|
|
|
## Applications |
|
|
|
We applied this model to a century of historical news articles. You can see all the named entities in the [NEWSWIRE dataset](https://huggingface.co/datasets/dell-research-harvard/newswire). |
|
|
|
|