MassMin's picture
Update README.md
d4a2d32 verified
---
metrics:
- accuracy
pipeline_tag: token-classification
tags:
- code
- map
- News
- Customer Support
- chatbot
language:
- de
- en
---
---
# XLM-RoBERTa Token Classification for Named Entity Recognition (NER)
### Model Description
This model is a fine-tuned version of XLM-RoBERTa (xlm-roberta-base) for Named Entity Recognition (NER) tasks. It has been trained on the PAN-X subset of the XTREME dataset for German Language . The model identifies the following entity types:
PER: Person names
ORG: Organization names
LOC: Location names
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6625c89b3b64b5270e95bbe9/ef0A5MMJ-NTXTCTcQRmIW.png)
-
## Uses
This model is suitable for multilingual NER tasks, especially in scenarios where extracting and classifying person, organization, and location names in text across different languages is required.
Applications:
Information extraction
Multilingual NER tasks
Automated text analysis for businesses
## Training Details
Base Model: xlm-roberta-base
Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
Training Framework: Hugging Face transformers library with PyTorch backend.
Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
### Training Procedure
Here's a brief overview of the training procedure for the XLM-RoBERTa model for NER:
Setup Environment:
Clone the repository and set up dependencies.
Import necessary libraries and modules.
Load Data:
Load the PAN-X subset from the XTREME dataset.
Shuffle and sample data subsets for training and evaluation.
Data Preparation:
Convert raw dataset into a format suitable for token classification.
Define a mapping for entity tags and apply tokenization.
Align NER tags with tokenized inputs.
Define Model:
Initialize the XLM-RoBERTa model for token classification.
Configure the model with the number of labels based on the dataset.
Setup Training Arguments:
Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
Configure logging and checkpointing.
Initialize Trainer:
Create a Trainer instance with the model, training arguments, datasets, and data collator.
Specify evaluation metrics to monitor performance.
Train the Model:
Start the training process using the Trainer.
Monitor training progress and metrics.
Evaluation and Results:
Evaluate the model on the validation set.
Compute metrics like F1 score for performance assessment.
Save and Push Model:
Save the fine-tuned model locally or push to a model hub for sharing and further use.
#### Training Hyperparameters
The model's performance is evaluated using the F1 score for NER. The predictions are aligned with gold-standard labels, ignoring sub-token predictions where appropriate.
## Evaluation
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import pandas as pd
model_checkpoint = "MassMin/Multilingual-NER-tagging"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint).to(device)
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, framework="pt", device=0 if torch.cuda.is_available() else -1)
def tag_text_with_pipeline(text, ner_pipeline):
# Use the NER pipeline to get predictions
results = ner_pipeline(text)
# Convert results to a DataFrame for easy viewing
df = pd.DataFrame(results)
df = df[['word', 'entity', 'score']]
df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity
return df
text = "2000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern ."
result = tag_text_with_pipeline(text, ner_pipeline)
print(result)
#### Testing Data
0 1 2 3 4 5 6 7 8 9 10 11
Tokens 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern .
Tags O O O O B-LOC I-LOC O O B-LOC B-LOC I-LOC O