File size: 4,194 Bytes
ef2fa5f 105bcd8 5502f58 105bcd8 ee829a1 ef2fa5f ee829a1 80768d5 f3f513f 7ce1555 f3f513f 7ce1555 af96897 f3f513f ef2fa5f 40f5c0b ef2fa5f 5502f58 ef2fa5f ee829a1 ef2fa5f af96897 ef2fa5f af96897 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 ef2fa5f ee829a1 ef2fa5f ee829a1 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 a17943d ee829a1 a17943d ee829a1 ef2fa5f ee829a1 ef2fa5f af96897 ef2fa5f 4a5c6ae ee829a1 4a5c6ae d4a2d32 4a5c6ae 9c406f7 ef2fa5f ee829a1 ef2fa5f 5502f58 ee829a1 105bcd8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
---
metrics:
- accuracy
pipeline_tag: token-classification
tags:
- code
- map
- News
- Customer Support
- chatbot
language:
- de
- en
---
---
# XLM-RoBERTa Token Classification for Named Entity Recognition (NER)
### Model Description
This model is a fine-tuned version of XLM-RoBERTa (xlm-roberta-base) for Named Entity Recognition (NER) tasks. It has been trained on the PAN-X subset of the XTREME dataset for German Language . The model identifies the following entity types:
PER: Person names
ORG: Organization names
LOC: Location names
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6625c89b3b64b5270e95bbe9/ef0A5MMJ-NTXTCTcQRmIW.png)
-
## Uses
This model is suitable for multilingual NER tasks, especially in scenarios where extracting and classifying person, organization, and location names in text across different languages is required.
Applications:
Information extraction
Multilingual NER tasks
Automated text analysis for businesses
## Training Details
Base Model: xlm-roberta-base
Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
Training Framework: Hugging Face transformers library with PyTorch backend.
Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
### Training Procedure
Here's a brief overview of the training procedure for the XLM-RoBERTa model for NER:
Setup Environment:
Clone the repository and set up dependencies.
Import necessary libraries and modules.
Load Data:
Load the PAN-X subset from the XTREME dataset.
Shuffle and sample data subsets for training and evaluation.
Data Preparation:
Convert raw dataset into a format suitable for token classification.
Define a mapping for entity tags and apply tokenization.
Align NER tags with tokenized inputs.
Define Model:
Initialize the XLM-RoBERTa model for token classification.
Configure the model with the number of labels based on the dataset.
Setup Training Arguments:
Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
Configure logging and checkpointing.
Initialize Trainer:
Create a Trainer instance with the model, training arguments, datasets, and data collator.
Specify evaluation metrics to monitor performance.
Train the Model:
Start the training process using the Trainer.
Monitor training progress and metrics.
Evaluation and Results:
Evaluate the model on the validation set.
Compute metrics like F1 score for performance assessment.
Save and Push Model:
Save the fine-tuned model locally or push to a model hub for sharing and further use.
#### Training Hyperparameters
The model's performance is evaluated using the F1 score for NER. The predictions are aligned with gold-standard labels, ignoring sub-token predictions where appropriate.
## Evaluation
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import pandas as pd
model_checkpoint = "MassMin/Multilingual-NER-tagging"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint).to(device)
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, framework="pt", device=0 if torch.cuda.is_available() else -1)
def tag_text_with_pipeline(text, ner_pipeline):
# Use the NER pipeline to get predictions
results = ner_pipeline(text)
# Convert results to a DataFrame for easy viewing
df = pd.DataFrame(results)
df = df[['word', 'entity', 'score']]
df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity
return df
text = "2000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern ."
result = tag_text_with_pipeline(text, ner_pipeline)
print(result)
#### Testing Data
0 1 2 3 4 5 6 7 8 9 10 11
Tokens 2.000 Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern .
Tags O O O O B-LOC I-LOC O O B-LOC B-LOC I-LOC O |