|
--- |
|
language: |
|
- multilingual |
|
- pl |
|
- ru |
|
- uk |
|
- bg |
|
- cs |
|
- sl |
|
datasets: |
|
- SlavicNER |
|
license: apache-2.0 |
|
library_name: transformers |
|
pipeline_tag: token-classification |
|
tags: |
|
- ner |
|
- named entity recognition |
|
widget: |
|
- text: Nie jest za późno, aby powstrzymać Brexit, a Wielka Brytania wciąż może zmienić zdanie - powiedział przewodniczący Rady Europejskiej eurodeputowanym w Strasburgu |
|
example_title: Polish |
|
--- |
|
|
|
# Model description |
|
|
|
This is a baseline model for named entity **recognition** trained on the cross-topic split of the |
|
[SlavicNER corpus](https://github.com/SlavicNLP/SlavicNER). |
|
|
|
|
|
# Resources and Technical Documentation |
|
|
|
- Paper: [Cross-lingual Named Entity Corpus for Slavic Languages](https://arxiv.org/pdf/2404.00482), to appear in LREC-COLING 2024. |
|
- Annotation guidelines: https://arxiv.org/pdf/2404.00482 |
|
- SlavicNER Corpus: https://github.com/SlavicNLP/SlavicNER |
|
|
|
|
|
# Evaluation |
|
|
|
*Will appear soon* |
|
|
|
|
|
# Usage |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
|
|
model = "SlavicNLP/slavicner-ner-cross-topic-large" |
|
|
|
text = """Nie jest za późno, aby powstrzymać Brexit, a Wielka Brytania wciąż |
|
może zmienić zdanie - powiedział przewodniczący Rady Europejskiej |
|
eurodeputowanym w Strasburgu""" |
|
|
|
pipe = pipeline("ner", model, aggregation_strategy="simple") |
|
|
|
entities = pipe(text) |
|
|
|
print(*entities, sep="\n") |
|
# {'entity_group': 'EVT', 'score': 0.99720407, 'word': 'Brexit', 'start': 35, 'end': 41} |
|
# {'entity_group': 'LOC', 'score': 0.9656372, 'word': 'Wielka Brytania', 'start': 45, 'end': 60} |
|
# {'entity_group': 'ORG', 'score': 0.9977708, 'word': 'Rady Europejskiej', 'start': 115, 'end': 132} |
|
# {'entity_group': 'LOC', 'score': 0.95184135, 'word': 'Strasburgu', 'start': 151, 'end': 161} |
|
``` |
|
|
|
# Citation |
|
|
|
```latex |
|
@inproceedings{piskorski-etal-2024-cross-lingual, |
|
title = "Cross-lingual Named Entity Corpus for {S}lavic Languages", |
|
author = "Piskorski, Jakub and |
|
Marci{\'n}czuk, Micha{\l} and |
|
Yangarber, Roman", |
|
editor = "Calzolari, Nicoletta and |
|
Kan, Min-Yen and |
|
Hoste, Veronique and |
|
Lenci, Alessandro and |
|
Sakti, Sakriani and |
|
Xue, Nianwen", |
|
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)", |
|
month = may, |
|
year = "2024", |
|
address = "Torino, Italy", |
|
publisher = "ELRA and ICCL", |
|
url = "https://aclanthology.org/2024.lrec-main.369", |
|
pages = "4143--4157", |
|
abstract = "This paper presents a corpus manually annotated with named entities for six Slavic languages {---} Bulgarian, Czech, Polish, Slovenian, Russian, |
|
and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017{--}2023 as a part of the Workshops on Slavic Natural |
|
Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities. |
|
Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits |
|
{---} single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture |
|
with the pre-trained multilingual models {---} XLM-RoBERTa-large for named entity mention recognition and categorization, |
|
and mT5-large for named entity lemmatization and linking.", |
|
} |
|
``` |