MassMin commited on
Commit
a17943d
·
verified ·
1 Parent(s): 7ce1555

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -1
README.md CHANGED
@@ -56,8 +56,11 @@ Automated text analysis for businesses
56
 
57
  ## Training Details
58
  Base Model: xlm-roberta-base
 
59
  Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
 
60
  Training Framework: Hugging Face transformers library with PyTorch backend.
 
61
  Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
62
 
63
 
@@ -69,36 +72,53 @@ Here's a brief overview of the training procedure for the XLM-RoBERTa model for
69
  Setup Environment:
70
 
71
  Clone the repository and set up dependencies.
 
72
  Import necessary libraries and modules.
 
73
  Load Data:
74
 
75
  Load the PAN-X subset from the XTREME dataset.
 
76
  Shuffle and sample data subsets for training and evaluation.
 
77
  Data Preparation:
78
 
79
  Convert raw dataset into a format suitable for token classification.
 
80
  Define a mapping for entity tags and apply tokenization.
 
81
  Align NER tags with tokenized inputs.
 
82
  Define Model:
83
 
84
  Initialize the XLM-RoBERTa model for token classification.
 
85
  Configure the model with the number of labels based on the dataset.
 
86
  Setup Training Arguments:
87
 
88
  Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
 
89
  Configure logging and checkpointing.
 
90
  Initialize Trainer:
91
 
92
  Create a Trainer instance with the model, training arguments, datasets, and data collator.
 
93
  Specify evaluation metrics to monitor performance.
 
94
  Train the Model:
95
 
96
  Start the training process using the Trainer.
 
97
  Monitor training progress and metrics.
 
98
  Evaluation and Results:
99
 
100
  Evaluate the model on the validation set.
 
101
  Compute metrics like F1 score for performance assessment.
 
102
  Save and Push Model:
103
 
104
  Save the fine-tuned model locally or push to a model hub for sharing and further use.
@@ -140,7 +160,7 @@ def tag_text_with_pipeline(text, ner_pipeline):
140
  df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity
141
  return df
142
 
143
- text = "Jeff Dean works at Google in California."
144
  result = tag_text_with_pipeline(text, ner_pipeline)
145
  print(result)
146
 
 
56
 
57
  ## Training Details
58
  Base Model: xlm-roberta-base
59
+
60
  Training Dataset: The model is trained on the PAN-X subset of the XTREME dataset, which includes labeled NER data for multiple languages.
61
+
62
  Training Framework: Hugging Face transformers library with PyTorch backend.
63
+
64
  Data Preprocessing: Tokenization was performed using XLM-RoBERTa tokenizer, with attention paid to aligning token labels to subword tokens.
65
 
66
 
 
72
  Setup Environment:
73
 
74
  Clone the repository and set up dependencies.
75
+
76
  Import necessary libraries and modules.
77
+
78
  Load Data:
79
 
80
  Load the PAN-X subset from the XTREME dataset.
81
+
82
  Shuffle and sample data subsets for training and evaluation.
83
+
84
  Data Preparation:
85
 
86
  Convert raw dataset into a format suitable for token classification.
87
+
88
  Define a mapping for entity tags and apply tokenization.
89
+
90
  Align NER tags with tokenized inputs.
91
+
92
  Define Model:
93
 
94
  Initialize the XLM-RoBERTa model for token classification.
95
+
96
  Configure the model with the number of labels based on the dataset.
97
+
98
  Setup Training Arguments:
99
 
100
  Define hyperparameters such as learning rate, batch size, number of epochs, and evaluation strategy.
101
+
102
  Configure logging and checkpointing.
103
+
104
  Initialize Trainer:
105
 
106
  Create a Trainer instance with the model, training arguments, datasets, and data collator.
107
+
108
  Specify evaluation metrics to monitor performance.
109
+
110
  Train the Model:
111
 
112
  Start the training process using the Trainer.
113
+
114
  Monitor training progress and metrics.
115
+
116
  Evaluation and Results:
117
 
118
  Evaluate the model on the validation set.
119
+
120
  Compute metrics like F1 score for performance assessment.
121
+
122
  Save and Push Model:
123
 
124
  Save the fine-tuned model locally or push to a model hub for sharing and further use.
 
160
  df.columns = ['Tokens', 'Tags', 'Score'] # Rename columns for clarity
161
  return df
162
 
163
+ text = "Einwohnern an der Danziger Bucht in der polnischen Woiwodschaft Pommern ."
164
  result = tag_text_with_pipeline(text, ner_pipeline)
165
  print(result)
166