Update README.md

d4a2d32 verified 6 months ago

4.19 kB

	---
	metrics:
	- accuracy
	pipeline_tag: token-classification
	tags:
	- code
	- map
	- News
	- Customer Support
	- chatbot
	language:
	- de
	- en
	---
	---


	# XLM-RoBERTa Token Classification for Named Entity Recognition (NER)

	### Model Description
	This model is a fine-tuned version of XLM-RoBERTa (xlm-roberta-base) for Named Entity Recognition (NER) tasks. It has been trained on the PAN-X subset of the XTREME dataset for German Language . The model identifies the following entity types:

	PER: Person names

	ORG: Organization names

	LOC: Location names



	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6625c89b3b64b5270e95bbe9/ef0A5MMJ-NTXTCTcQRmIW.png)










	-



	## Uses
	This model is suitable for multilingual NER tasks, especially in scenarios where extracting and classifying person, organization, and location names in text across different languages is required.

	Applications:
	Information extraction
	Multilingual NER tasks
	Automated text analysis for businesses




	## Training Details
	Base Model: xlm-roberta-base

	Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.

	Training Framework: Hugging Face transformers library with PyTorch backend.

	Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.



	### Training Procedure

	Here's a brief overview of the training procedure for the XLM-RoBERTa model for NER:

	Setup Environment:

	Clone the repository and set up dependencies.

	Import necessary libraries and modules.

	Load Data:

	Load the PAN-X subset from the XTREME dataset.

	Shuffle and sample data subsets for training and evaluation.

	Data Preparation:

	Convert raw dataset into a format suitable for token classification.

	Define a mapping for entity tags and apply tokenization.

	Align NER tags with tokenized inputs.

	Define Model:

	Initialize the XLM-RoBERTa model for token classification.

	Configure the model with the number of labels based on the dataset.

	Setup Training Arguments:

	Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.

	Configure logging and checkpointing.

	Initialize Trainer:

	Create a Trainer instance with the model, training arguments, datasets, and data collator.

	Specify evaluation metrics to monitor performance.

	Train the Model:

	Start the training process using the Trainer.

	Monitor training progress and metrics.

	Evaluation and Results:

	Evaluate the model on the validation set.

	Compute metrics like F1 score for performance assessment.

	Save and Push Model:

	Save the fine-tuned model locally or push to a model hub for sharing and further use.




	#### Training Hyperparameters
	The model's performance is evaluated using the F1 score for NER. The predictions are aligned with gold-standard labels, ignoring sub-token predictions where appropriate.






	## Evaluation

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
	import pandas as pd

	model_checkpoint = "MassMin/Multilingual-NER-tagging"

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
	model = AutoModelForTokenClassification.from_pretrained(model_checkpoint).to(device)

	ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, framework="pt", device=0 if torch.cuda.is_available() else -1)

	def tag_text_with_pipeline(text, ner_pipeline):
	# Use the NER pipeline to get predictions
	results = ner_pipeline(text)

	# Convert results to a DataFrame for easy viewing
	df = pd.DataFrame(results)
	df = df[['word', 'entity', 'score']]
	df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity
	return df

	text = "2000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern ."
	result = tag_text_with_pipeline(text, ner_pipeline)
	print(result)







	#### Testing Data








	0 1 2 3 4 5 6 7 8 9 10 11
	Tokens 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern .
	Tags O O O O B-LOC I-LOC O O B-LOC B-LOC I-LOC O