XLM-RoBERTa Token Classification for Named Entity Recognition (NER)
Model Description
This model is a fine-tuned version of XLM-RoBERTa (xlm-roberta-base) for Named Entity Recognition (NER) tasks. It has been trained on the PAN-X subset of the XTREME dataset for German Language . The model identifies the following entity types:
PER: Person names
ORG: Organization names
LOC: Location names
Uses
This model is suitable for multilingual NER tasks, especially in scenarios where extracting and classifying person, organization, and location names in text across different languages is required.
Applications: Information extraction Multilingual NER tasks Automated text analysis for businesses
Training Details
Base Model: xlm-roberta-base
Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
Training Framework: Hugging Face transformers library with PyTorch backend.
Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
Training Procedure
Here's a brief overview of the training procedure for the XLM-RoBERTa model for NER:
Setup Environment:
Clone the repository and set up dependencies.
Import necessary libraries and modules.
Load Data:
Load the PAN-X subset from the XTREME dataset.
Shuffle and sample data subsets for training and evaluation.
Data Preparation:
Convert raw dataset into a format suitable for token classification.
Define a mapping for entity tags and apply tokenization.
Align NER tags with tokenized inputs.
Define Model:
Initialize the XLM-RoBERTa model for token classification.
Configure the model with the number of labels based on the dataset.
Setup Training Arguments:
Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
Configure logging and checkpointing.
Initialize Trainer:
Create a Trainer instance with the model, training arguments, datasets, and data collator.
Specify evaluation metrics to monitor performance.
Train the Model:
Start the training process using the Trainer.
Monitor training progress and metrics.
Evaluation and Results:
Evaluate the model on the validation set.
Compute metrics like F1 score for performance assessment.
Save and Push Model:
Save the fine-tuned model locally or push to a model hub for sharing and further use.
Training Hyperparameters
The model's performance is evaluated using the F1 score for NER. The predictions are aligned with gold-standard labels, ignoring sub-token predictions where appropriate.
Evaluation
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import pandas as pd
model_checkpoint = "MassMin/Multilingual-NER-tagging"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint).to(device)
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, framework="pt", device=0 if torch.cuda.is_available() else -1)
def tag_text_with_pipeline(text, ner_pipeline):
# Use the NER pipeline to get predictions
results = ner_pipeline(text)
# Convert results to a DataFrame for easy viewing
df = pd.DataFrame(results)
df = df[['word', 'entity', 'score']]
df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity
return df
text = "2000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern ."
result = tag_text_with_pipeline(text, ner_pipeline)
print(result)
#### Testing Data
0 1 2 3 4 5 6 7 8 9 10 11
Tokens 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern .
Tags O O O O B-LOC I-LOC O O B-LOC B-LOC I-LOC O
- Downloads last month
- 29