File size: 3,681 Bytes
fc83468
 
 
 
 
 
 
 
 
 
 
 
 
 
e802c41
 
12376ec
 
eddcedb
f053986
eddcedb
f053986
eddcedb
f053986
eddcedb
f053986
eddcedb
f053986
eddcedb
9247735
 
 
 
 
 
 
 
dbc7041
 
 
 
 
 
 
 
 
 
 
 
9247735
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dbc7041
 
 
 
8d1e0dc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37a2089
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
language:
- multilingual
- pl
- ru
- uk
- bg
- cs
- sl
datasets:
- SlavicNER
license: apache-2.0
library_name: transformers
pipeline_tag: text2text-generation
tags:
- lemmatization
widget:
  - text: "pl:Polsce"
    example_title: "Polish"
  - text: "cs:Velké Británii"
    example_title: "Czech"
  - text: "bg:българите"
    example_title: "Bulgarian"
  - text: "ru:Великобританию"
    example_title: "Russian"
  - text: "sl:evropske komisije"
    example_title: "Slovene"
  - text: "uk:Європейського агентства лікарських засобів"
    example_title: "Ukrainian"
---

# Model description

This is a baseline model for named entity **lemmatization** trained on the single-out topic split of the 
[SlavicNER corpus](https://github.com/SlavicNLP/SlavicNER).


# Resources and Technical Documentation

- Paper: [Cross-lingual Named Entity Corpus for Slavic Languages](https://arxiv.org/pdf/2404.00482), to appear in LREC-COLING 2024.
- Annotation guidelines: https://arxiv.org/pdf/2404.00482
- SlavicNER Corpus: https://github.com/SlavicNLP/SlavicNER


# Evaluation

*Will appear soon*


# Usage

You can use this model directly with a pipeline for text2text generation:

```python
from transformers import pipeline

model_name = "SlavicNLP/slavicner-lemma-single-out-large"
pipe = pipeline("text2text-generation", model_name)

texts = ["pl:Polsce", "cs:Velké Británii", "bg:българите", "ru:Великобританию", "sl:evropske komisije",
         "uk:Європейського агентства лікарських засобів"]

outputs = pipe(texts)

lemmas = [o['generated_text'] for o in outputs]
print(lemmas)
# ['Polska', 'Velká Británie', 'българи', 'Великобритания', 'evropska komisija', 'Європейське агентство лікарських засобів']
```

# Citation

```latex
@inproceedings{piskorski-etal-2024-cross-lingual,
    title = "Cross-lingual Named Entity Corpus for {S}lavic Languages",
    author = "Piskorski, Jakub  and
      Marci{\'n}czuk, Micha{\l}  and
      Yangarber, Roman",
    editor = "Calzolari, Nicoletta  and
      Kan, Min-Yen  and
      Hoste, Veronique  and
      Lenci, Alessandro  and
      Sakti, Sakriani  and
      Xue, Nianwen",
    booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
    month = may,
    year = "2024",
    address = "Torino, Italy",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lrec-main.369",
    pages = "4143--4157",
    abstract = "This paper presents a corpus manually annotated with named entities for six Slavic languages {---} Bulgarian, Czech, Polish, Slovenian, Russian,
                and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017{--}2023 as a part of the Workshops on Slavic Natural
                Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities.
                Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits
                {---} single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture
                with the pre-trained multilingual models {---} XLM-RoBERTa-large for named entity mention recognition and categorization,
                and mT5-large for named entity lemmatization and linking.",
}
```

# Contact

Michał Marcińczuk (marcinczuk@gmail.com)