File size: 4,194 Bytes

ef2fa5f
105bcd8
 
 
 
 
5502f58
 
 
 
 
 
 
105bcd8
 
ee829a1
ef2fa5f
 
ee829a1
 
80768d5
f3f513f
 
7ce1555
f3f513f
7ce1555
af96897
f3f513f
ef2fa5f
 
40f5c0b
 
ef2fa5f
5502f58
 
ef2fa5f
 
 
 
 
 
ee829a1
ef2fa5f
 
 
 
af96897
ef2fa5f
af96897
 
 
 
ef2fa5f
 
 
 
ee829a1
 
a17943d
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
 
 
ee829a1
ef2fa5f
ee829a1
ef2fa5f
ee829a1
ef2fa5f
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
ee829a1
a17943d
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
ee829a1
a17943d
ee829a1
a17943d
ee829a1
ef2fa5f
ee829a1
ef2fa5f
 
 
 
 
af96897
ef2fa5f
 
 
 
 
4a5c6ae
 
 
 
 
 
 
 
ee829a1
4a5c6ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4a2d32
4a5c6ae
 
 
9c406f7
 
ef2fa5f
 
 
 
ee829a1
ef2fa5f
5502f58
 
 
 
 
 
 
ee829a1
 
105bcd8

---
metrics:
- accuracy
pipeline_tag: token-classification
tags:
- code
- map
- News
- Customer Support
- chatbot
language:
- de
- en
---
---


# XLM-RoBERTa Token Classification for Named Entity Recognition (NER)

### Model Description
This model is a fine-tuned version of XLM-RoBERTa (xlm-roberta-base) for Named Entity Recognition (NER) tasks. It has been trained on the PAN-X subset of the XTREME dataset for  German Language . The model identifies the following entity types:

PER: Person names

ORG: Organization names

LOC: Location names



![image/png](https://cdn-uploads.huggingface.co/production/uploads/6625c89b3b64b5270e95bbe9/ef0A5MMJ-NTXTCTcQRmIW.png)










-



## Uses
This model is suitable for multilingual NER tasks, especially in scenarios where extracting and classifying person, organization, and location names in text across different languages is required.

Applications:
Information extraction
Multilingual NER tasks
Automated text analysis for businesses




## Training Details
Base Model: xlm-roberta-base

Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.

Training Framework: Hugging Face transformers library with PyTorch backend.

Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.



### Training Procedure

Here's a brief overview of the training procedure for the XLM-RoBERTa model for NER:

Setup Environment:

Clone the repository and set up dependencies.

Import necessary libraries and modules.

Load Data:

Load the PAN-X subset from the XTREME dataset.

Shuffle and sample data subsets for training and evaluation.

Data Preparation:

Convert raw dataset into a format suitable for token classification.

Define a mapping for entity tags and apply tokenization.

Align NER tags with tokenized inputs.

Define Model:

Initialize the XLM-RoBERTa model for token classification.

Configure the model with the number of labels based on the dataset.

Setup Training Arguments:

Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.

Configure logging and checkpointing.

Initialize Trainer:

Create a Trainer instance with the model, training arguments, datasets, and data collator.

Specify evaluation metrics to monitor performance.

Train the Model:

Start the training process using the Trainer.

Monitor training progress and metrics.

Evaluation and Results:

Evaluate the model on the validation set.

Compute metrics like F1 score for performance assessment.

Save and Push Model:

Save the fine-tuned model locally or push to a model hub for sharing and further use.




#### Training Hyperparameters
The model's performance is evaluated using the F1 score for NER. The predictions are aligned with gold-standard labels, ignoring sub-token predictions where appropriate.






 ## Evaluation

```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import pandas as pd

model_checkpoint = "MassMin/Multilingual-NER-tagging"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint).to(device)

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, framework="pt", device=0 if torch.cuda.is_available() else -1)

def tag_text_with_pipeline(text, ner_pipeline):
    # Use the NER pipeline to get predictions
    results = ner_pipeline(text)
    
    # Convert results to a DataFrame for easy viewing
    df = pd.DataFrame(results)
    df = df[['word', 'entity', 'score']]
    df.columns = ['Tokens', 'Tags', 'Score']  # Rename columns for clarity
    return df

text = "2000 Einwohnern	an	der	Danziger	Bucht	in	der	polnischen	Woiwodschaft	Pommern	."
result = tag_text_with_pipeline(text, ner_pipeline)
print(result)







#### Testing Data








	0	1	2	3	4	5	6	7	8	9	10	11
Tokens	2.000	Einwohnern	an	der	Danziger	Bucht	in	der	polnischen	Woiwodschaft	Pommern	.
Tags	O	O	O	O	B-LOC	I-LOC	O	O	B-LOC	B-LOC	I-LOC	O