impresso-project
/

nel-mgenre-multilingual

@@ -123,33 +123,22 @@ tags:
 # mGENRE
-The mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528) implemented in pytorch.
-In a nutshell, mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned [mBART](https://arxiv.org/abs/2001.08210) architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. The model was first released in the [facebookresearch/GENRE](https://github.com/facebookresearch/GENRE) repository using `fairseq` (the `transformers` models are obtained with a conversion script similar to [this](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py).
-This model was trained on 105 languages from Wikipedia.
 ## BibTeX entry and citation info
-**Please consider citing our works if you use code from this repository.**
-```bibtex
-@article{decao2020multilingual,
-    author = {De Cao, Nicola and Wu, Ledell and Popat, Kashyap and Artetxe, Mikel
-    and Goyal, Naman and Plekhanov, Mikhail and Zettlemoyer, Luke
-    and Cancedda, Nicola and Riedel, Sebastian and Petroni, Fabio},
-    title = "{Multilingual Autoregressive Entity Linking}",
-    journal = {Transactions of the Association for Computational Linguistics},
-    volume = {10},
-    pages = {274-290},
-    year = {2022},
-    month = {03},
-    issn = {2307-387X},
-    doi = {10.1162/tacl_a_00460},
-    url = {https://doi.org/10.1162/tacl\_a\_00460},
-    eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00460/2004070/tacl\_a\_00460.pdf},
-}
-```
 ## Usage
@@ -161,23 +150,24 @@ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
 tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-historic-multilingual")
 model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-historic-multilingual").eval()
-sentences = ["[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids. In [START] London [END], despite the destruction, the spirit of the people is unbroken, with volunteers and civil defense units working tirelessly to support the war effort. Reports from [START] BUP [START]correspondents highlight the nationwide push for increased production in factories, essential for supplying the front lines with the materials needed for victory. "]
-outputs = model.generate(
-    **tokenizer(sentences, return_tensors="pt"),
-    num_beams=5,
-    num_return_sequences=5
-)
-tokenizer.batch_decode(outputs, skip_special_tokens=True)
 ```
 which outputs the following top-5 predictions (using constrained beam search)
 ```
-['Albert Einstein >> it',
- 'Albert Einstein (disambiguation) >> en',
- 'Alfred Einstein >> it',
- 'Alberto Einstein >> it',
- 'Einstein >> it']
 ```
 ---

 # mGENRE
+The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture.
+GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.
+This model was finetuned on the following datasets.
+| Dataset alias | README | Document type | Languages |  Suitable for | Project | License |
+|---------|---------|---------------|-----------| ---------------|---------------| ---------------|
+| ajmc       | [link](documentation/README-ajmc.md)  | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | [AjMC](https://mromanello.github.io/ajax-multi-commentary/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) |
+| hipe2020   | [link](documentation/README-hipe2020.md)| historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
+| topres19th | [link](documentation/README-topres19th.md) | historical newspapers | en | NERC-Coarse, EL |[Living with Machines](https://livingwithmachines.ac.uk/) | [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
+| newseye    | [link](documentation/README-newseye.md)|  historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL |  [NewsEye](https://www.newseye.eu/) |  [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
+| sonar      | [link](documentation/README-sonar.md) | historical newspapers  | de | NERC-Coarse, EL |  [SoNAR](https://sonar.fh-potsdam.de/)  | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
 ## BibTeX entry and citation info
 ## Usage
 tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-historic-multilingual")
 model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-historic-multilingual").eval()
+sentences = ["[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids.",
+             "In [START] London [END], trotz der Zerstörung, ist der Geist der Menschen ungebrochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
+             "Les rapports des correspondants de la [START] AFP [END] mettent en lumière la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]
+for sentence in sentences:
+    outputs = model.generate(
+        **tokenizer([sentence], return_tensors="pt"),
+        num_beams=5,
+        num_return_sequences=5
+    )
+    print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
 ```
 which outputs the following top-5 predictions (using constrained beam search)
 ```
+['United Press International >> en ', 'The United Press International >> en ', 'United Press International >> de ', 'United Press >> en ', 'Associated Press >> en ']
+['London >> de ', 'London >> de ', 'London >> de ', 'Stadt London >> de ', 'Londonderry >> de ']
+['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr ']
 ```
 ---