|
--- |
|
language: |
|
- multilingual |
|
- pl |
|
- ru |
|
- uk |
|
- bg |
|
- cs |
|
- sl |
|
datasets: |
|
- SlavicNER |
|
license: apache-2.0 |
|
library_name: transformers |
|
pipeline_tag: text2text-generation |
|
tags: |
|
- entity linking |
|
widget: |
|
- text: pl:Polsce |
|
example_title: Polish |
|
- text: cs:Velké Británii |
|
example_title: Czech |
|
- text: bg:българите |
|
example_title: Bulgarian |
|
- text: ru:Великобританию |
|
example_title: Russian |
|
- text: sl:evropske komisije |
|
example_title: Slovene |
|
- text: uk:Європейського агентства лікарських засобів |
|
example_title: Ukrainian |
|
--- |
|
|
|
# Model description |
|
|
|
This is a baseline model for named entity **lemmatization** trained on the single-out topic split of the |
|
[SlavicNER corpus](https://github.com/SlavicNLP/SlavicNER). |
|
|
|
|
|
# Resources and Technical Documentation |
|
|
|
- Paper: [Cross-lingual Named Entity Corpus for Slavic Languages](https://arxiv.org/pdf/2404.00482), to appear in LREC-COLING 2024. |
|
- Annotation guidelines: https://arxiv.org/pdf/2404.00482 |
|
- SlavicNER Corpus: https://github.com/SlavicNLP/SlavicNER |
|
|
|
|
|
# Evaluation |
|
|
|
*Will appear soon* |
|
|
|
|
|
# Usage |
|
|
|
You can use this model directly with a pipeline for text2text generation: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_name = "SlavicNLP/slavicner-linking-cross-topic-large" |
|
pipe = pipeline("text2text-generation", model_name) |
|
|
|
texts = ["pl:Polsce", "cs:Velké Británii", "bg:българите", "ru:Великобританию", |
|
"sl:evropske komisije", "uk:Європейського агентства лікарських засобів"] |
|
|
|
outputs = pipe(texts) |
|
|
|
ids = [o['generated_text'] for o in outputs] |
|
print(ids) |
|
# ['GPE-Poland', 'GPE-Great-Britain', 'GPE-Bulgaria', 'GPE-Great-Britain', |
|
# 'ORG-European-Commission', 'ORG-EMA-European-Medicines-Agency'] |
|
``` |
|
|
|
# Citation |
|
|
|
```latex |
|
@inproceedings{piskorski-etal-2024-cross-lingual, |
|
title = "Cross-lingual Named Entity Corpus for {S}lavic Languages", |
|
author = "Piskorski, Jakub and |
|
Marci{\'n}czuk, Micha{\l} and |
|
Yangarber, Roman", |
|
editor = "Calzolari, Nicoletta and |
|
Kan, Min-Yen and |
|
Hoste, Veronique and |
|
Lenci, Alessandro and |
|
Sakti, Sakriani and |
|
Xue, Nianwen", |
|
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", |
|
month = may, |
|
year = "2024", |
|
address = "Torino, Italy", |
|
publisher = "ELRA and ICCL", |
|
url = "https://aclanthology.org/2024.lrec-main.369", |
|
pages = "4143--4157", |
|
abstract = "This paper presents a corpus manually annotated with named entities for six Slavic languages {---} Bulgarian, Czech, Polish, Slovenian, Russian, |
|
and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017{--}2023 as a part of the Workshops on Slavic Natural |
|
Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities. |
|
Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits |
|
{---} single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture |
|
with the pre-trained multilingual models {---} XLM-RoBERTa-large for named entity mention recognition and categorization, |
|
and mT5-large for named entity lemmatization and linking.", |
|
} |
|
``` |
|
|
|
# Contact |
|
|
|
Michał Marcińczuk (marcinczuk@gmail.com) |