|
--- |
|
|
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bm |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- ff |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gn |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- ht |
|
- hu |
|
- hy |
|
- id |
|
- ig |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kg |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lg |
|
- ln |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- no |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- qu |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- ss |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- ti |
|
- tl |
|
- tn |
|
- tr |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- wo |
|
- xh |
|
- yo |
|
- zh |
|
|
|
|
|
tags: |
|
- retrieval |
|
- entity-retrieval |
|
- named-entity-disambiguation |
|
- entity-disambiguation |
|
- named-entity-linking |
|
- entity-linking |
|
- text2text-generation |
|
--- |
|
|
|
|
|
# mGENRE |
|
|
|
|
|
The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture. |
|
GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. |
|
|
|
This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), composed of the following datasets. |
|
|
|
| Dataset alias | README | Document type | Languages | Suitable for | Project | License | |
|
|---------|---------|---------------|-----------| ---------------|---------------| ---------------| |
|
| ajmc | [link](documentation/README-ajmc.md) | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | [AjMC](https://mromanello.github.io/ajax-multi-commentary/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) | |
|
| hipe2020 | [link](documentation/README-hipe2020.md)| historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)| |
|
| topres19th | [link](documentation/README-topres19th.md) | historical newspapers | en | NERC-Coarse, EL |[Living with Machines](https://livingwithmachines.ac.uk/) | [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)| |
|
| newseye | [link](documentation/README-newseye.md)| historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL | [NewsEye](https://www.newseye.eu/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)| |
|
| sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)| |
|
|
|
|
|
## BibTeX entry and citation info |
|
|
|
|
|
## Usage |
|
|
|
Here is an example of generation for Wikipedia page disambiguation: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-hipe-multilingual") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-hipe-multilingual").eval() |
|
|
|
sentences = ["[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids.", |
|
"In [START] London [END], trotz der Zerstörung, ist der Geist der Menschen ungebrochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.", |
|
"Les rapports des correspondants de la [START] AFP [END] mettent en lumière la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."] |
|
|
|
for sentence in sentences: |
|
outputs = model.generate( |
|
**tokenizer([sentence], return_tensors="pt"), |
|
num_beams=5, |
|
num_return_sequences=5 |
|
) |
|
|
|
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) |
|
``` |
|
which outputs the following top-5 predictions (using constrained beam search) |
|
``` |
|
['United Press International >> en ', 'The United Press International >> en ', 'United Press International >> de ', 'United Press >> en ', 'Associated Press >> en '] |
|
['London >> de ', 'London >> de ', 'London >> de ', 'Stadt London >> de ', 'Londonderry >> de '] |
|
['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr '] |
|
``` |
|
|
|
Example with simulated OCR noise: |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-hipe-multilingual") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-hipe-multilingual").eval() |
|
|
|
sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.", |
|
"In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.", |
|
"Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."] |
|
|
|
for sentence in sentences: |
|
outputs = model.generate( |
|
**tokenizer([sentence], return_tensors="pt"), |
|
num_beams=5, |
|
num_return_sequences=5 |
|
) |
|
|
|
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) |
|
``` |
|
|
|
``` |
|
['United Press International >> en ', 'Un1ted Press >> en ', 'Joseph Bradley Varnum >> en ', 'The Press >> en ', 'The Unused Press >> en '] |
|
['London >> de ', 'Longbourne >> de ', 'Longbon >> de ', 'Longston >> de ', 'Lyon >> de '] |
|
['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr '] |
|
``` |
|
|
|
--- |
|
license: agpl-3.0 |
|
--- |
|
|
|
|