czuk commited on
Commit
f7949ab
1 Parent(s): 19eb25f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +95 -67
README.md CHANGED
@@ -1,67 +1,95 @@
1
- ---
2
- language:
3
- - multilingual
4
- - pl
5
- - ru
6
- - uk
7
- - bg
8
- - cs
9
- - sl
10
- datasets:
11
- - SlavicNER
12
- license: apache-2.0
13
- library_name: transformers
14
- pipeline_tag: text2text-generation
15
- tags:
16
- - lemmatization
17
- widget:
18
- - text: "pl:Polsce"
19
- - text: "cs:Velké Británii"
20
- - text: "bg:българите"
21
- - text: "ru:Великобританию"
22
- - text: "sl:evropske komisije"
23
- - text: "uk:Європейського агентства лікарських засобів"
24
- ---
25
-
26
- # Model description
27
-
28
- This is a baseline model for named entity **lemmatization** trained on the single-out topic split of the
29
- [SlavicNER corpus](https://github.com/SlavicNLP/SlavicNER).
30
-
31
-
32
- # Resources and Technical Documentation
33
-
34
- - Paper: [Cross-lingual Named Entity Corpus for Slavic Languages](https://arxiv.org/pdf/2404.00482), to appear in LREC-COLING 2024.
35
- - Annotation guidelines: https://arxiv.org/pdf/2404.00482
36
- - SlavicNER Corpus: https://github.com/SlavicNLP/SlavicNER
37
-
38
-
39
- # Evaluation
40
-
41
- *Will appear soon*
42
-
43
-
44
- # Usage
45
-
46
- You can use this model directly with a pipeline for text2text generation:
47
-
48
- ```python
49
- from transformers import pipeline
50
-
51
- model_name = "SlavicNLP/slavicner-lemma-cross-topic-large"
52
- pipe = pipeline("text2text-generation", model_name)
53
-
54
- texts = ["pl:Polsce", "cs:Velké Británii", "bg:българите", "ru:Великобританию",
55
- "sl:evropske komisije", "uk:Європейського агентства лікарських засобів"]
56
-
57
- outputs = pipe(texts)
58
-
59
- ids = [o['generated_text'] for o in outputs]
60
- print(ids)
61
- # ['GPE-Poland', 'GPE-Great-Britain', 'GPE-Bulgaria', 'GPE-Great-Britain',
62
- # 'ORG-European-Commission', 'ORG-EMA-European-Medicines-Agency']
63
- ```
64
-
65
- # Citation
66
-
67
- *Will appear soon*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - pl
5
+ - ru
6
+ - uk
7
+ - bg
8
+ - cs
9
+ - sl
10
+ datasets:
11
+ - SlavicNER
12
+ license: apache-2.0
13
+ library_name: transformers
14
+ pipeline_tag: text2text-generation
15
+ tags:
16
+ - lemmatization
17
+ widget:
18
+ - text: "pl:Polsce"
19
+ - text: "cs:Velké Británii"
20
+ - text: "bg:българите"
21
+ - text: "ru:Великобританию"
22
+ - text: "sl:evropske komisije"
23
+ - text: "uk:Європейського агентства лікарських засобів"
24
+ ---
25
+
26
+ # Model description
27
+
28
+ This is a baseline model for named entity **lemmatization** trained on the single-out topic split of the
29
+ [SlavicNER corpus](https://github.com/SlavicNLP/SlavicNER).
30
+
31
+
32
+ # Resources and Technical Documentation
33
+
34
+ - Paper: [Cross-lingual Named Entity Corpus for Slavic Languages](https://arxiv.org/pdf/2404.00482), to appear in LREC-COLING 2024.
35
+ - Annotation guidelines: https://arxiv.org/pdf/2404.00482
36
+ - SlavicNER Corpus: https://github.com/SlavicNLP/SlavicNER
37
+
38
+
39
+ # Evaluation
40
+
41
+ *Will appear soon*
42
+
43
+
44
+ # Usage
45
+
46
+ You can use this model directly with a pipeline for text2text generation:
47
+
48
+ ```python
49
+ from transformers import pipeline
50
+
51
+ model_name = "SlavicNLP/slavicner-lemma-cross-topic-large"
52
+ pipe = pipeline("text2text-generation", model_name)
53
+
54
+ texts = ["pl:Polsce", "cs:Velké Británii", "bg:българите", "ru:Великобританию",
55
+ "sl:evropske komisije", "uk:Європейського агентства лікарських засобів"]
56
+
57
+ outputs = pipe(texts)
58
+
59
+ ids = [o['generated_text'] for o in outputs]
60
+ print(ids)
61
+ # ['GPE-Poland', 'GPE-Great-Britain', 'GPE-Bulgaria', 'GPE-Great-Britain',
62
+ # 'ORG-European-Commission', 'ORG-EMA-European-Medicines-Agency']
63
+ ```
64
+
65
+ # Citation
66
+
67
+ ```latex
68
+ @inproceedings{piskorski-etal-2024-cross-lingual,
69
+ title = "Cross-lingual Named Entity Corpus for {S}lavic Languages",
70
+ author = "Piskorski, Jakub and
71
+ Marci{\'n}czuk, Micha{\l} and
72
+ Yangarber, Roman",
73
+ editor = "Calzolari, Nicoletta and
74
+ Kan, Min-Yen and
75
+ Hoste, Veronique and
76
+ Lenci, Alessandro and
77
+ Sakti, Sakriani and
78
+ Xue, Nianwen",
79
+ booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
80
+ month = may,
81
+ year = "2024",
82
+ address = "Torino, Italy",
83
+ publisher = "ELRA and ICCL",
84
+ url = "https://aclanthology.org/2024.lrec-main.369",
85
+ pages = "4143--4157",
86
+ abstract = "This paper presents a corpus manually annotated with named entities for six Slavic languages {---} Bulgarian, Czech, Polish, Slovenian, Russian,
87
+ and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017{--}2023 as a part of the Workshops on Slavic Natural
88
+ Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities.
89
+ Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits
90
+ {---} single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture
91
+ with the pre-trained multilingual models {---} XLM-RoBERTa-large for named entity mention recognition and categorization,
92
+ and mT5-large for named entity lemmatization and linking.",
93
+ }
94
+ ```
95
+