File size: 5,771 Bytes
debc4ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a76cf3f
 
debc4ef
57b93e7
a76cf3f
 
 
 
 
 
 
 
debc4ef
 
 
 
 
 
 
19638c4
5b2d2fe
 
a72b9b3
5b2d2fe
964213b
 
 
 
 
5b2d2fe
beedfc5
 
 
 
964213b
 
 
 
beedfc5
964213b
 
 
beedfc5
 
 
964213b
 
 
beedfc5
 
c381a7f
 
 
debc4ef
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
---

language:
- multilingual
- af
- am
- ar
- as
- az
- be
- bg
- bm
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- ff
- fi
- fr
- fy
- ga
- gd
- gl
- gn
- gu
- ha
- he
- hi
- hr
- ht
- hu
- hy
- id
- ig
- is
- it
- ja
- jv
- ka
- kg
- kk
- km
- kn
- ko
- ku
- ky
- la
- lg
- ln
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- no
- om
- or
- pa
- pl
- ps
- pt
- qu
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- ss
- su
- sv
- sw
- ta
- te
- th
- ti
- tl
- tn
- tr
- uk
- ur
- uz
- vi
- wo
- xh
- yo
- zh


tags:
- retrieval
- entity-retrieval
- named-entity-disambiguation
- entity-disambiguation
- named-entity-linking
- entity-linking
- text2text-generation
---


# mGENRE 


The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture. 
GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. 

This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), composed of the following datasets.

| Dataset alias | README | Document type | Languages |  Suitable for | Project | License |
|---------|---------|---------------|-----------| ---------------|---------------| ---------------|
| ajmc       | [link](documentation/README-ajmc.md)  | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | [AjMC](https://mromanello.github.io/ajax-multi-commentary/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) |
| hipe2020   | [link](documentation/README-hipe2020.md)| historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
| topres19th | [link](documentation/README-topres19th.md) | historical newspapers | en | NERC-Coarse, EL |[Living with Machines](https://livingwithmachines.ac.uk/) | [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
| newseye    | [link](documentation/README-newseye.md)|  historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL |  [NewsEye](https://www.newseye.eu/) |  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
| sonar      | [link](documentation/README-sonar.md) | historical newspapers  | de | NERC-Coarse, EL |  [SoNAR](https://sonar.fh-potsdam.de/)  | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|


## BibTeX entry and citation info


## Usage

Here is an example of generation for Wikipedia page disambiguation with simulated OCR noise:
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual"

# Load the tokenizer and model from the specified pre-trained model name
# The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual"
nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual")

sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
             "In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
             "Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]

nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME, 
                        tokenizer=nel_tokenizer, 
                        trust_remote_code=True,
                        device='cpu')
for sentence in sentences:
    print(sentence)
    linked_entity = nel_pipeline(sentence)
    print(linked_entity)
```

```
[{'title': 'United Press International', 'qid': 'Q493845', 'url': 'https://en.wikipedia.org/wiki/United_Press_International'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'Joseph Bradley Varnum', 'qid': 'Q1706673', 'url': 'https://en.wikipedia.org/wiki/Joseph_Bradley_Varnum'}, {'title': 'The Press', 'qid': 'Q2413590', 'url': 'https://en.wikipedia.org/wiki/The_Press'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}]
[{'title': 'London', 'qid': 'Q84', 'url': 'https://en.wikipedia.org/wiki/London'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'Lyon', 'qid': 'Q456', 'url': 'https://en.wikipedia.org/wiki/Lyon'}]
[{'title': 'Agence France-Presse', 'qid': 'Q40464', 'url': 'https://en.wikipedia.org/wiki/Agence_France-Presse'}, {'title': 'Agence France-Presse', 'qid': 'Q40464', 'url': 'https://en.wikipedia.org/wiki/Agence_France-Presse'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}]
```

---
license: agpl-3.0
---