nel-mgenre-multilingual / README.md

maudehrmann

added a link

578d4bb verified about 2 months ago

preview code

raw

history blame contribute delete

No virus

6.58 kB

	---

	language:
	- multilingual
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bm
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- ff
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gn
	- gu
	- ha
	- he
	- hi
	- hr
	- ht
	- hu
	- hy
	- id
	- ig
	- is
	- it
	- ja
	- jv
	- ka
	- kg
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lg
	- ln
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- no
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- qu
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- ss
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- ti
	- tl
	- tn
	- tr
	- uk
	- ur
	- uz
	- vi
	- wo
	- xh
	- yo
	- zh


	tags:
	- retrieval
	- entity-retrieval
	- named-entity-disambiguation
	- entity-disambiguation
	- named-entity-linking
	- entity-linking
	- text2text-generation
	---


	# mGENRE


	The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture.
	GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.

	This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), composed of the following datasets.

	\| Dataset alias \| README \| Document type \| Languages \| Suitable for \| Project \| License \|
	\|---------\|---------\|---------------\|-----------\| ---------------\|---------------\| ---------------\|
	\| ajmc \| [link](documentation/README-ajmc.md) \| classical commentaries \| de, fr, en \| NERC-Coarse, NERC-Fine, EL \| [AjMC](https://mromanello.github.io/ajax-multi-commentary/) \| [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) \|
	\| hipe2020 \| [link](documentation/README-hipe2020.md)\| historical newspapers \| de, fr, en \| NERC-Coarse, NERC-Fine, EL \| [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)\| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)\|
	\| topres19th \| [link](documentation/README-topres19th.md) \| historical newspapers \| en \| NERC-Coarse, EL \|[Living with Machines](https://livingwithmachines.ac.uk/) \| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)\|
	\| newseye \| [link](documentation/README-newseye.md)\| historical newspapers \| de, fi, fr, sv \| NERC-Coarse, NERC-Fine, EL \| [NewsEye](https://www.newseye.eu/) \| [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\|
	\| sonar \| [link](documentation/README-sonar.md) \| historical newspapers \| de \| NERC-Coarse, EL \| [SoNAR](https://sonar.fh-potsdam.de/) \| [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)\|


	## BibTeX entry and citation info


	## Usage

	Here is an example of generation for Wikipedia page disambiguation:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-hipe-multilingual")
	model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-hipe-multilingual").eval()

	sentences = ["[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids.",
	"In [START] London [END], trotz der Zerstörung, ist der Geist der Menschen ungebrochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
	"Les rapports des correspondants de la [START] AFP [END] mettent en lumière la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]

	for sentence in sentences:
	outputs = model.generate(
	**tokenizer([sentence], return_tensors="pt"),
	num_beams=5,
	num_return_sequences=5
	)

	print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
	```
	which outputs the following top-5 predictions (using constrained beam search)
	```
	['United Press International >> en ', 'The United Press International >> en ', 'United Press International >> de ', 'United Press >> en ', 'Associated Press >> en ']
	['London >> de ', 'London >> de ', 'London >> de ', 'Stadt London >> de ', 'Londonderry >> de ']
	['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr ']
	```

	Example with simulated OCR noise:
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-hipe-multilingual")
	model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-hipe-multilingual").eval()

	sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.",
	"In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
	"Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]

	for sentence in sentences:
	outputs = model.generate(
	**tokenizer([sentence], return_tensors="pt"),
	num_beams=5,
	num_return_sequences=5
	)

	print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
	```

	```
	['United Press International >> en ', 'Un1ted Press >> en ', 'Joseph Bradley Varnum >> en ', 'The Press >> en ', 'The Unused Press >> en ']
	['London >> de ', 'Longbourne >> de ', 'Longbon >> de ', 'Longston >> de ', 'Lyon >> de ']
	['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr ']
	```

	---
	license: agpl-3.0
	---