emanuelaboros commited on
Commit
9446d10
1 Parent(s): 43b1307

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -36
README.md CHANGED
@@ -123,33 +123,22 @@ tags:
123
  # mGENRE
124
 
125
 
126
- The mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528) implemented in pytorch.
 
127
 
128
- In a nutshell, mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned [mBART](https://arxiv.org/abs/2001.08210) architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. The model was first released in the [facebookresearch/GENRE](https://github.com/facebookresearch/GENRE) repository using `fairseq` (the `transformers` models are obtained with a conversion script similar to [this](https://github.com/huggingface/transformers/blob/master/src/transformers/models/bart/convert_bart_original_pytorch_checkpoint_to_pytorch.py).
 
 
 
 
 
 
 
 
129
 
130
- This model was trained on 105 languages from Wikipedia.
131
 
132
  ## BibTeX entry and citation info
133
 
134
- **Please consider citing our works if you use code from this repository.**
135
-
136
- ```bibtex
137
- @article{decao2020multilingual,
138
- author = {De Cao, Nicola and Wu, Ledell and Popat, Kashyap and Artetxe, Mikel
139
- and Goyal, Naman and Plekhanov, Mikhail and Zettlemoyer, Luke
140
- and Cancedda, Nicola and Riedel, Sebastian and Petroni, Fabio},
141
- title = "{Multilingual Autoregressive Entity Linking}",
142
- journal = {Transactions of the Association for Computational Linguistics},
143
- volume = {10},
144
- pages = {274-290},
145
- year = {2022},
146
- month = {03},
147
- issn = {2307-387X},
148
- doi = {10.1162/tacl_a_00460},
149
- url = {https://doi.org/10.1162/tacl\_a\_00460},
150
- eprint = {https://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl\_a\_00460/2004070/tacl\_a\_00460.pdf},
151
- }
152
- ```
153
 
154
  ## Usage
155
 
@@ -161,23 +150,24 @@ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
161
  tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-historic-multilingual")
162
  model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-historic-multilingual").eval()
163
 
164
- sentences = ["[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids. In [START] London [END], despite the destruction, the spirit of the people is unbroken, with volunteers and civil defense units working tirelessly to support the war effort. Reports from [START] BUP [START]correspondents highlight the nationwide push for increased production in factories, essential for supplying the front lines with the materials needed for victory. "]
165
-
166
- outputs = model.generate(
167
- **tokenizer(sentences, return_tensors="pt"),
168
- num_beams=5,
169
- num_return_sequences=5
170
- )
171
-
172
- tokenizer.batch_decode(outputs, skip_special_tokens=True)
 
 
 
173
  ```
174
  which outputs the following top-5 predictions (using constrained beam search)
175
  ```
176
- ['Albert Einstein >> it',
177
- 'Albert Einstein (disambiguation) >> en',
178
- 'Alfred Einstein >> it',
179
- 'Alberto Einstein >> it',
180
- 'Einstein >> it']
181
  ```
182
 
183
  ---
 
123
  # mGENRE
124
 
125
 
126
+ The historical multilingual named entity linking (NEL) model is based on mGENRE (multilingual Generative ENtity REtrieval) system as presented in [Multilingual Autoregressive Entity Linking](https://arxiv.org/abs/2103.12528). mGENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on finetuned [mBART](https://arxiv.org/abs/2001.08210) architecture.
127
+ GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers.
128
 
129
+ This model was finetuned on the following datasets.
130
+
131
+ | Dataset alias | README | Document type | Languages | Suitable for | Project | License |
132
+ |---------|---------|---------------|-----------| ---------------|---------------| ---------------|
133
+ | ajmc | [link](documentation/README-ajmc.md) | classical commentaries | de, fr, en | NERC-Coarse, NERC-Fine, EL | [AjMC](https://mromanello.github.io/ajax-multi-commentary/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/) |
134
+ | hipe2020 | [link](documentation/README-hipe2020.md)| historical newspapers | de, fr, en | NERC-Coarse, NERC-Fine, EL | [CLEF-HIPE-2020](https://impresso.github.io/CLEF-HIPE-2020)| [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
135
+ | topres19th | [link](documentation/README-topres19th.md) | historical newspapers | en | NERC-Coarse, EL |[Living with Machines](https://livingwithmachines.ac.uk/) | [![License: CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc-sa/4.0/)|
136
+ | newseye | [link](documentation/README-newseye.md)| historical newspapers | de, fi, fr, sv | NERC-Coarse, NERC-Fine, EL | [NewsEye](https://www.newseye.eu/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
137
+ | sonar | [link](documentation/README-sonar.md) | historical newspapers | de | NERC-Coarse, EL | [SoNAR](https://sonar.fh-potsdam.de/) | [![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)|
138
 
 
139
 
140
  ## BibTeX entry and citation info
141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
  ## Usage
144
 
 
150
  tokenizer = AutoTokenizer.from_pretrained("impresso-project/nel-historic-multilingual")
151
  model = AutoModelForSeq2SeqLM.from_pretrained("impresso-project/nel-historic-multilingual").eval()
152
 
153
+ sentences = ["[START] United Press [END] - On the home front, the British populace remains steadfast in the face of ongoing air raids.",
154
+ "In [START] London [END], trotz der Zerstörung, ist der Geist der Menschen ungebrochen, mit Freiwilligen und zivilen Verteidigungseinheiten, die unermüdlich arbeiten, um die Kriegsanstrengungen zu unterstützen.",
155
+ "Les rapports des correspondants de la [START] AFP [END] mettent en lumière la poussée nationale pour augmenter la production dans les usines, essentielle pour fournir au front les matériaux nécessaires à la victoire."]
156
+
157
+ for sentence in sentences:
158
+ outputs = model.generate(
159
+ **tokenizer([sentence], return_tensors="pt"),
160
+ num_beams=5,
161
+ num_return_sequences=5
162
+ )
163
+
164
+ print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
165
  ```
166
  which outputs the following top-5 predictions (using constrained beam search)
167
  ```
168
+ ['United Press International >> en ', 'The United Press International >> en ', 'United Press International >> de ', 'United Press >> en ', 'Associated Press >> en ']
169
+ ['London >> de ', 'London >> de ', 'London >> de ', 'Stadt London >> de ', 'Londonderry >> de ']
170
+ ['Agence France-Presse >> fr ', 'Agence France-Presse >> fr ', 'Agence France-Presse de la Presse écrite >> fr ', 'Agence France-Presse de la porte de Vincennes >> fr ', 'Agence France-Presse de la porte océanique >> fr ']
 
 
171
  ```
172
 
173
  ---