|
|
|
--- |
|
--- |
|
license: mit |
|
language: |
|
- multilingual |
|
tags: |
|
- zero-shot-classification |
|
- text-classification |
|
- pytorch |
|
metrics: |
|
- accuracy |
|
- f1-score |
|
--- |
|
# xlm-roberta-large-hungarian-publicopinion-cap-v3 |
|
## Model description |
|
An `xlm-roberta-large` model finetuned on multilingual training data containing texts of the `publicopinion` domain labelled with [major topic codes](https://www.comparativeagendas.net/pages/master-codebook) from the [Comparative Agendas Project](https://www.comparativeagendas.net/). |
|
|
|
## How to use the model |
|
#### Loading and tokenizing input data |
|
```python |
|
import pandas as pd |
|
import numpy as np |
|
from datasets import Dataset |
|
from transformers import (AutoModelForSequenceClassification, AutoTokenizer, |
|
Trainer, TrainingArguments) |
|
|
|
CAP_NUM_DICT = {0: '1', 1: '2', 2: '3', 3: '4', 4: '5', 5: '6', |
|
6: '7', 7: '8', 8: '9', 9: '10', 10: '12', 11: '13', 12: '14', |
|
13: '15', 14: '16', 15: '17', 16: '18', 17: '19', 18: '20', 19: |
|
'21', 20: '23', 21: '999'} |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-large') |
|
num_labels = len(CAP_NUM_DICT) |
|
|
|
def tokenize_dataset(data : pd.DataFrame): |
|
tokenized = tokenizer(data["text"], |
|
max_length=MAXLEN, |
|
truncation=True, |
|
padding="max_length") |
|
return tokenized |
|
|
|
hg_data = Dataset.from_pandas(data) |
|
dataset = hg_data.map(tokenize_dataset, batched=True, remove_columns=hg_data.column_names) |
|
``` |
|
|
|
#### Inference using the Trainer class |
|
```python |
|
model = AutoModelForSequenceClassification.from_pretrained('poltextlab/xlm-roberta-large-hungarian-publicopinion-cap-v3', |
|
num_labels=num_labels, |
|
problem_type="multi_label_classification", |
|
ignore_mismatched_sizes=True |
|
) |
|
|
|
training_args = TrainingArguments( |
|
output_dir='.', |
|
per_device_train_batch_size=8, |
|
per_device_eval_batch_size=8 |
|
) |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args |
|
) |
|
|
|
probs = trainer.predict(test_dataset=dataset).predictions |
|
predicted = pd.DataFrame(np.argmax(probs, axis=1)).replace({0: CAP_NUM_DICT}).rename( |
|
columns={0: 'predicted'}).reset_index(drop=True) |
|
|
|
``` |
|
|
|
### Fine-tuning procedure |
|
`xlm-roberta-large-hungarian-publicopinion-cap-v3` was fine-tuned using the Hugging Face Trainer class with the following hyperparameters: |
|
```python |
|
training_args = TrainingArguments( |
|
output_dir=f"../model/{model_dir}/tmp/", |
|
logging_dir=f"../logs/{model_dir}/", |
|
logging_strategy='epoch', |
|
num_train_epochs=10, |
|
per_device_train_batch_size=8, |
|
per_device_eval_batch_size=8, |
|
learning_rate=5e-06, |
|
seed=42, |
|
save_strategy='epoch', |
|
evaluation_strategy='epoch', |
|
save_total_limit=1, |
|
load_best_model_at_end=True |
|
) |
|
``` |
|
We also incorporated an EarlyStoppingCallback in the process with a patience of 2 epochs. |
|
|
|
## Model performance |
|
The model was evaluated on a test set of 586 examples (10% of the available data).<br> |
|
Model accuracy is **0.94**. |
|
| label | precision | recall | f1-score | support | |
|
|:-------------|------------:|---------:|-----------:|----------:| |
|
| 0 | 0.98 | 0.96 | 0.97 | 101 | |
|
| 1 | 0.92 | 0.9 | 0.91 | 39 | |
|
| 2 | 0.96 | 1 | 0.98 | 75 | |
|
| 3 | 0.9 | 0.9 | 0.9 | 21 | |
|
| 4 | 0.86 | 1 | 0.92 | 24 | |
|
| 5 | 1 | 0.82 | 0.9 | 11 | |
|
| 6 | 1 | 1 | 1 | 23 | |
|
| 7 | 0.93 | 1 | 0.97 | 28 | |
|
| 8 | 0 | 0 | 0 | 2 | |
|
| 9 | 0.92 | 0.94 | 0.93 | 48 | |
|
| 10 | 0.92 | 0.97 | 0.94 | 67 | |
|
| 11 | 0 | 0 | 0 | 4 | |
|
| 12 | 0.83 | 0.83 | 0.83 | 6 | |
|
| 13 | 0.91 | 0.84 | 0.87 | 25 | |
|
| 14 | 0.94 | 0.94 | 0.94 | 18 | |
|
| 15 | 1 | 0.57 | 0.73 | 7 | |
|
| 16 | 0 | 0 | 0 | 0 | |
|
| 17 | 0.93 | 0.98 | 0.95 | 51 | |
|
| 18 | 0.97 | 1 | 0.99 | 34 | |
|
| 19 | 0 | 0 | 0 | 2 | |
|
| macro avg | 0.75 | 0.73 | 0.74 | 586 | |
|
| weighted avg | 0.93 | 0.94 | 0.93 | 586 | |
|
|
|
## Inference platform |
|
This model is used by the [CAP Babel Machine](https://babel.poltextlab.com), an open-source and free natural language processing tool, designed to simplify and speed up projects for comparative research. |
|
|
|
## Cooperation |
|
Model performance can be significantly improved by extending our training sets. We appreciate every submission of CAP-coded corpora (of any domain and language) at poltextlab{at}poltextlab{dot}com or by using the [CAP Babel Machine](https://babel.poltextlab.com). |
|
|
|
## Debugging and issues |
|
This architecture uses the `sentencepiece` tokenizer. In order to run the model before `transformers==4.27` you need to install it manually. |
|
|
|
If you encounter a `RuntimeError` when loading the model using the `from_pretrained()` method, adding `ignore_mismatched_sizes=True` should solve the issue. |
|
|