--- language: - multilingual - af - am - ar - as - az - be - bg - bm - bn - br - bs - ca - cs - cy - da - de - el - en - eo - es - et - eu - fa - ff - fi - fr - fy - ga - gd - gl - gn - gu - ha - he - hi - hr - ht - hu - hy - id - ig - is - it - ja - jv - ka - kg - kk - km - kn - ko - ku - ky - la - lg - ln - lo - lt - lv - mg - mk - ml - mn - mr - ms - my - ne - nl - no - om - or - pa - pl - ps - pt - qu - ro - ru - sa - sd - si - sk - sl - so - sq - sr - ss - su - sv - sw - ta - te - th - ti - tl - tn - tr - uk - ur - uz - vi - wo - xh - yo - zh tags: - retrieval - entity-retrieval - named-entity-disambiguation - entity-disambiguation - named-entity-linking - entity-linking - text2text-generation --- # mGENRE The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. This model was finetuned on the [HIPE-2022 dataset](https://github.com/hipe-eval/HIPE-2022-data), composed of the following datasets. | Dataset alias | README | Document type | Languages | Suitable for | Project | License | |---------|---------|---------------|-----------| ---------------|---------------| ---------------| | ajmc | [link](documentation/README-ajmc.md) | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | [AjMC](https://mromanello.github.io/ajax-multi-commentary/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) | | hipe2020 | [link](documentation/README-hipe2020.md)| historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)| | topres19th | [link](documentation/README-topres19th.md) | historical newspapers | en | NERC-Coarse, EL |[Living with Machines](https://livingwithmachines.ac.uk/) | [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)| | newseye | [link](documentation/README-newseye.md)| historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL | [NewsEye](https://www.newseye.eu/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)| | sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)| ## BibTeX entry and citation info ## Usage Here is an example of generation for Wikipedia page disambiguation: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-hipe-multilingual") model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-hipe-multilingual").eval() sentences = ["[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids.", "In [START] London [END], trotz der Zerstörung, ist der Geist der Menschen ungebrochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.", "Les rapports des correspondants de la [START] AFP [END] mettent en lumière la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."] for sentence in sentences: outputs = model.generate( **tokenizer([sentence], return_tensors="pt"), num_beams=5, num_return_sequences=5 ) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ``` which outputs the following top-5 predictions (using constrained beam search) ``` ['United Press International >> en ', 'The United Press International >> en ', 'United Press International >> de ', 'United Press >> en ', 'Associated Press >> en '] ['London >> de ', 'London >> de ', 'London >> de ', 'Stadt London >> de ', 'Londonderry >> de '] ['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr '] ``` Example with simulated OCR noise: ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM NEL_MODEL_NAME = "impresso-project/nel-mgenre-multilingual" # Load the tokenizer and model from the specified pre-trained model name # The model used here is "https://huggingface.co/impresso-project/nel-mgenre-multilingual" nel_tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-mgenre-multilingual") sentences = ["[START] Un1ted Press [END] - On the h0me fr0nt, the British p0pulace remains steadfast in the f4ce of 0ngoing air raids.", "In [START] Lon6on [END], trotz d3r Zerstörung, ist der Geist der M3nschen ungeb4ochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.", "Les rapports des correspondants de la [START] AFP [END] mettent en lumiére la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."] nel_pipeline = pipeline("generic-nel", model=NEL_MODEL_NAME, tokenizer=nel_tokenizer, trust_remote_code=True, device='cpu') for sentence in sentences: print(sentence) linked_entity = nel_pipeline(sentence) print(linked_entity) ``` ``` [{'title': 'United Press International', 'qid': 'Q493845', 'url': 'https://en.wikipedia.org/wiki/United_Press_International'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'Joseph Bradley Varnum', 'qid': 'Q1706673', 'url': 'https://en.wikipedia.org/wiki/Joseph_Bradley_Varnum'}, {'title': 'The Press', 'qid': 'Q2413590', 'url': 'https://en.wikipedia.org/wiki/The_Press'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}] [{'title': 'London', 'qid': 'Q84', 'url': 'https://en.wikipedia.org/wiki/London'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'Lyon', 'qid': 'Q456', 'url': 'https://en.wikipedia.org/wiki/Lyon'}] [{'title': 'Agence France-Presse', 'qid': 'Q40464', 'url': 'https://en.wikipedia.org/wiki/Agence_France-Presse'}, {'title': 'Agence France-Presse', 'qid': 'Q40464', 'url': 'https://en.wikipedia.org/wiki/Agence_France-Presse'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}, {'title': 'NIL', 'qid': 'NIL', 'url': 'None'}] ``` --- license: agpl-3.0 ---