|
--- |
|
metrics: |
|
- accuracy |
|
pipeline_tag: token-classification |
|
tags: |
|
- code |
|
- map |
|
- News |
|
- Customer Support |
|
- chatbot |
|
language: |
|
- de |
|
- en |
|
--- |
|
--- |
|
|
|
|
|
# XLM-RoBERTa Token Classification for Named Entity Recognition (NER) |
|
|
|
### Model Description |
|
This model is a fine-tuned version of XLM-RoBERTa (xlm-roberta-base) for Named Entity Recognition (NER) tasks. It has been trained on the PAN-X subset of the XTREME dataset for German Language . The model identifies the following entity types: |
|
|
|
PER: Person names |
|
|
|
ORG: Organization names |
|
|
|
LOC: Location names |
|
|
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6625c89b3b64b5270e95bbe9/ef0A5MMJ-NTXTCTcQRmIW.png) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
- |
|
|
|
|
|
|
|
## Uses |
|
This model is suitable for multilingual NER tasks, especially in scenarios where extracting and classifying person, organization, and location names in text across different languages is required. |
|
|
|
Applications: |
|
Information extraction |
|
Multilingual NER tasks |
|
Automated text analysis for businesses |
|
|
|
|
|
|
|
|
|
## Training Details |
|
Base Model: xlm-roberta-base |
|
|
|
Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages. |
|
|
|
Training Framework: Hugging Face transformers library with PyTorch backend. |
|
|
|
Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens. |
|
|
|
|
|
|
|
### Training Procedure |
|
|
|
Here's a brief overview of the training procedure for the XLM-RoBERTa model for NER: |
|
|
|
Setup Environment: |
|
|
|
Clone the repository and set up dependencies. |
|
|
|
Import necessary libraries and modules. |
|
|
|
Load Data: |
|
|
|
Load the PAN-X subset from the XTREME dataset. |
|
|
|
Shuffle and sample data subsets for training and evaluation. |
|
|
|
Data Preparation: |
|
|
|
Convert raw dataset into a format suitable for token classification. |
|
|
|
Define a mapping for entity tags and apply tokenization. |
|
|
|
Align NER tags with tokenized inputs. |
|
|
|
Define Model: |
|
|
|
Initialize the XLM-RoBERTa model for token classification. |
|
|
|
Configure the model with the number of labels based on the dataset. |
|
|
|
Setup Training Arguments: |
|
|
|
Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy. |
|
|
|
Configure logging and checkpointing. |
|
|
|
Initialize Trainer: |
|
|
|
Create a Trainer instance with the model, training arguments, datasets, and data collator. |
|
|
|
Specify evaluation metrics to monitor performance. |
|
|
|
Train the Model: |
|
|
|
Start the training process using the Trainer. |
|
|
|
Monitor training progress and metrics. |
|
|
|
Evaluation and Results: |
|
|
|
Evaluate the model on the validation set. |
|
|
|
Compute metrics like F1 score for performance assessment. |
|
|
|
Save and Push Model: |
|
|
|
Save the fine-tuned model locally or push to a model hub for sharing and further use. |
|
|
|
|
|
|
|
|
|
#### Training Hyperparameters |
|
The model's performance is evaluated using the F1 score for NER. The predictions are aligned with gold-standard labels, ignoring sub-token predictions where appropriate. |
|
|
|
|
|
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
import pandas as pd |
|
|
|
model_checkpoint = "MassMin/Multilingual-NER-tagging" |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) |
|
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint).to(device) |
|
|
|
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, framework="pt", device=0 if torch.cuda.is_available() else -1) |
|
|
|
def tag_text_with_pipeline(text, ner_pipeline): |
|
# Use the NER pipeline to get predictions |
|
results = ner_pipeline(text) |
|
|
|
# Convert results to a DataFrame for easy viewing |
|
df = pd.DataFrame(results) |
|
df = df[['word', 'entity', 'score']] |
|
df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity |
|
return df |
|
|
|
text = "2000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern ." |
|
result = tag_text_with_pipeline(text, ner_pipeline) |
|
print(result) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 1 2 3 4 5 6 7 8 9 10 11 |
|
Tokens 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern . |
|
Tags O O O O B-LOC I-LOC O O B-LOC B-LOC I-LOC O |