Update README.md

c7df203 about 1 year ago

4.26 kB

	---
	license: mit
	language:
	- it
	widget:
	- text: "mi chiamo marco rossi, vivo a roma e lavoro per l'agenzia spaziale italiana"
	example_title: "Example 1"
	---

	--------------------------------------------------------------------------------------------------

	<body>
	<span class="vertical-text" style="background-color:lightgreen;border-radius: 3px;padding: 3px;"> </span>
	<br>
	<span class="vertical-text" style="background-color:orange;border-radius: 3px;padding: 3px;"> Task: Named Entity Recognition</span>
	<br>
	<span class="vertical-text" style="background-color:lightblue;border-radius: 3px;padding: 3px;"> Model: DeBERTa</span>
	<br>
	<span class="vertical-text" style="background-color:tomato;border-radius: 3px;padding: 3px;"> Lang: IT</span>
	<br>
	<span class="vertical-text" style="background-color:lightgrey;border-radius: 3px;padding: 3px;"> Type: Uncased</span>
	<br>
	<span class="vertical-text" style="background-color:#CF9FFF;border-radius: 3px;padding: 3px;"> </span>
	</body>

	--------------------------------------------------------------------------------------------------

	<h3>Model description</h3>

	This is a <b>DeBERTa</b> <b>[1]</b> uncased model for the <b>Italian</b> language, fine-tuned for <b>Named Entity Recognition</b> (<b>Person</b>, <b>Location</b>, <b>Organization</b> and <b>Miscellanea</b> classes) on the [WikiNER](https://figshare.com/articles/dataset/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500) dataset <b>[2]</b>, using [mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) as a pre-trained model.


	<h3>Training and Performances</h3>

	The model is trained to perform entity recognition over 4 classes: <b>PER</b> (persons), <b>LOC</b> (locations), <b>ORG</b> (organizations), <b>MISC</b> (miscellanea, mainly events, products and services). It has been fine-tuned for Named Entity Recognition, using the WikiNER Italian dataset plus an additional custom dataset of manually annotated Wikipedia paragraphs.
	The WikiNER dataset has been splitted in 102.352 training instances and 25.588 test instances, and the model has been trained for 1 epoch with a constant learning rate of 1e-5.

	The model has been first fine-tuned on WikiNER, then focused on the Italian language and turned to uncased by modifying the embedding layer (as in [3], computing document-level frequencies over the Wikipedia dataset), and lastly fine-tuned on an additional dataset of ~3.500 manually annotated lowercase paragraphs.

	<h3>Quick usage</h3>

	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer
	from transformers import pipeline
	import re
	import string

	tokenizer = AutoTokenizer.from_pretrained("osiria/deberta-base-italian-uncased-ner")
	model = AutoModelForTokenClassification.from_pretrained("osiria/deberta-base-italian-uncased-ner", num_labels = 5)

	text = "mi chiamo marco rossi, vivo a roma e lavoro per l'agenzia spaziale italiana nella missione prisma"

	for p in string.punctuation:
	text = text.replace(p, " " + p + " ")

	ner = pipeline("ner", model=model, tokenizer=tokenizer)
	ner(text, aggregation_strategy="simple")

	[{'entity_group': 'PER',
	'score': 0.9929623,
	'word': 'marco rossi',
	'start': 9,
	'end': 21},
	{'entity_group': 'LOC',
	'score': 0.9898509,
	'word': 'roma',
	'start': 31,
	'end': 36},
	{'entity_group': 'ORG',
	'score': 0.9905911,
	'word': 'agenzia spaziale italiana',
	'start': 53,
	'end': 79},
	{'entity_group': 'MISC',
	'score': 0.92474234,
	'word': 'missione prisma',
	'start': 85,
	'end': 101}]
	```

	<h3>References</h3>

	[1] https://arxiv.org/abs/2111.09543

	[2] https://www.sciencedirect.com/science/article/pii/S0004370212000276

	[3] https://arxiv.org/abs/2010.05609

	<h3>Limitations</h3>

	This model is mainly trained on Wikipedia, so it's particularly suitable for natively digital text from the world wide web, written in a correct and fluent form (like wikis, web pages, news, etc.). However, it may show limitations when it comes to chaotic text, containing errors and slang expressions
	(like social media posts) or when it comes to domain-specific text (like medical, financial or legal content).

	<h3>License</h3>

	The model is released under <b>MIT</b> license