Update README.md

eea8663 verified about 1 month ago

No virus

4.04 kB

	---
	language:
	- multilingual
	- pl
	- ru
	- uk
	- bg
	- cs
	- sl
	datasets:
	- SlavicNER
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text2text-generation
	tags:
	- entity linking
	widget:
	- text: pl:Polsce
	example_title: Polish
	- text: cs:Velké Británii
	example_title: Czech
	- text: bg:българите
	example_title: Bulgarian
	- text: ru:Великобританию
	example_title: Russian
	- text: sl:evropske komisije
	example_title: Slovene
	- text: uk:Європейського агентства лікарських засобів
	example_title: Ukrainian
	---

	# Model description

	This is a baseline model for named entity lemmatization trained on the single-out topic split of the
	[SlavicNER corpus](https://github.com/SlavicNLP/SlavicNER).


	# Resources and Technical Documentation

	- Paper: [Cross-lingual Named Entity Corpus for Slavic Languages](https://arxiv.org/pdf/2404.00482), to appear in LREC-COLING 2024.
	- Annotation guidelines: https://arxiv.org/pdf/2404.00482
	- SlavicNER Corpus: https://github.com/SlavicNLP/SlavicNER


	# Evaluation

	\| Language \| Seq2seq \| Support \|
	\|:------------:\|:-----------:\|-----------------:\|
	\| PL \| 75.13 \| 2 549 \|
	\| CS \| 77.92 \| 1 137 \|
	\| RU \| 67.56 \| 18 018 \|
	\| BG \| 63.60 \| 6 085 \|
	\| SL \| 76.81 \| 7 082 \|
	\| UK \| 58.94 \| 3 085 \|
	\| All \| 68.75 \| 37 956 \|


	# Usage

	You can use this model directly with a pipeline for text2text generation:

	```python
	from transformers import pipeline

	model_name = "SlavicNLP/slavicner-linking-single-out-large"
	pipe = pipeline("text2text-generation", model_name)

	texts = ["pl:Polsce", "cs:Velké Británii", "bg:българите", "ru:Великобританию",
	"sl:evropske komisije", "uk:Європейського агентства лікарських засобів"]

	outputs = pipe(texts)

	ids = [o['generated_text'] for o in outputs]
	print(ids)
	# ['GPE-Poland', 'GPE-Great-Britain', 'GPE-Bulgaria', 'GPE-Great-Britain',
	# 'ORG-European-Commission', 'ORG-EMA-European-Medicines-Agency']
	```


	# Citation

	```latex
	@inproceedings{piskorski-etal-2024-cross-lingual,
	title = "Cross-lingual Named Entity Corpus for {S}lavic Languages",
	author = "Piskorski, Jakub and
	Marci{\'n}czuk, Micha{\l} and
	Yangarber, Roman",
	editor = "Calzolari, Nicoletta and
	Kan, Min-Yen and
	Hoste, Veronique and
	Lenci, Alessandro and
	Sakti, Sakriani and
	Xue, Nianwen",
	booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
	month = may,
	year = "2024",
	address = "Torino, Italy",
	publisher = "ELRA and ICCL",
	url = "https://aclanthology.org/2024.lrec-main.369",
	pages = "4143--4157",
	abstract = "This paper presents a corpus manually annotated with named entities for six Slavic languages {---} Bulgarian, Czech, Polish, Slovenian, Russian,
	and Ukrainian. This work is the result of a series of shared tasks, conducted in 2017{--}2023 as a part of the Workshops on Slavic Natural
	Language Processing. The corpus consists of 5,017 documents on seven topics. The documents are annotated with five classes of named entities.
	Each entity is described by a category, a lemma, and a unique cross-lingual identifier. We provide two train-tune dataset splits
	{---} single topic out and cross topics. For each split, we set benchmarks using a transformer-based neural network architecture
	with the pre-trained multilingual models {---} XLM-RoBERTa-large for named entity mention recognition and categorization,
	and mT5-large for named entity lemmatization and linking.",
	}
	```

	# Contact

	Michał Marcińczuk (marcinczuk@gmail.com)