MyPoliBERT-ver02

Model Overview

MyPoliBERT-ver02 is a fine-tuned version of bert-base-uncased designed for multi-label and multi-class classification of political texts in Malaysia. It predicts 12 political topics (Democracy, Economy, Race, Leadership, Development, Corruption, Instability, Safety, Administration, Education, Religion, Environment) and their associated sentiments (Unknown: 0, Negative: 1, Neutral: 2, Positive: 3). The model is optimized for texts sourced from Malaysian contexts, including social media, news articles, and political discussions.

Evaluation and Performance

It achieves the following results on the evaluation set:

  • Overall Metrics:

    • Loss: 0.2953
    • F1 Score: 0.9255
    • Accuracy: 0.9256
  • Topic-Specific Metrics:

Topic F1 Score Accuracy
Democracy 0.9401 0.9392
Economy 0.9191 0.9182
Race 0.9521 0.9516
Leadership 0.8198 0.8196
Development 0.8877 0.8869
Corruption 0.9487 0.9498
Instability 0.9254 0.9283
Safety 0.9209 0.9207
Administration 0.8993 0.9019
Education 0.9632 0.9632
Religion 0.9557 0.9554
Environment 0.9734 0.9729

Dataset

  • Data Sources:

    • tnwei/ms-newspapers dataset
    • Malaysian political posts from Reddit
    • Malaysian political posts from Instagram
    • Malaysian political posts from Facebook

    These sources were combined into a single dataset containing approximately 30,268 records. 80% of the dataset was used for training, and 20% was reserved for validation.

  • Task:
    The model performs multi-task learning, simultaneously predicting 12 topics and their respective sentiment classes.

Model Architecture

  • Base Model: bert-base-uncased
  • Output Layer: The model generates logits for 12 topics, each with four sentiment classes (Unknown, Negative, Neutral, Positive).

Training procedure

  • Training hyperparameters

    • learning_rate: 5e-05
    • train_batch_size: 16
    • eval_batch_size: 16
    • seed: 42
    • gradient_accumulation_steps: 2
    • total_train_batch_size: 32
    • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
    • lr_scheduler_type: linear
    • num_epochs: 16
    • mixed_precision_training: Native AMP
  • Training Configuration (TrainingArguments):

    • evaluation_strategy: "epoch"
    • save_strategy: "epoch"
    • load_best_model_at_end: True
    • metric_for_best_model: "overall_f1"
    • greater_is_better: True
  • Custom Trainer:
    The compute_loss method calculates the cross-entropy loss for each label and averages the losses across all labels.

Training results

Training Loss Epoch Step Validation Loss Democracy F1 Democracy Accuracy Economy F1 Economy Accuracy Race F1 Race Accuracy Leadership F1 Leadership Accuracy Development F1 Development Accuracy Corruption F1 Corruption Accuracy Instability F1 Instability Accuracy Safety F1 Safety Accuracy Administration F1 Administration Accuracy Education F1 Education Accuracy Religion F1 Religion Accuracy Environment F1 Environment Accuracy Overall F1 Overall Accuracy
0.4126 1.0 757 0.2492 0.9175 0.9283 0.9010 0.9096 0.9374 0.9433 0.7897 0.7896 0.8463 0.8655 0.9349 0.9422 0.9088 0.9201 0.9165 0.9199 0.8760 0.8918 0.9546 0.9592 0.9480 0.9493 0.9704 0.9713 0.9084 0.9158
0.2014 2.0 1514 0.2217 0.9346 0.9389 0.9180 0.9191 0.9508 0.9528 0.8148 0.8195 0.8814 0.8905 0.9433 0.9427 0.9133 0.9106 0.9155 0.9153 0.8913 0.9052 0.9641 0.9658 0.9513 0.9519 0.9723 0.9741 0.9209 0.9239
0.1422 3.0 2271 0.2244 0.9369 0.9397 0.9202 0.9224 0.9503 0.9500 0.8140 0.8188 0.8824 0.8888 0.9453 0.9447 0.9227 0.9250 0.9182 0.9172 0.8926 0.9014 0.9646 0.9658 0.9529 0.9549 0.9745 0.9746 0.9229 0.9253
0.098 4.0 3028 0.2310 0.9373 0.9405 0.9239 0.9257 0.9532 0.9541 0.8112 0.8142 0.8894 0.8941 0.9470 0.9466 0.9233 0.9243 0.9180 0.9196 0.8967 0.9068 0.9619 0.9630 0.9527 0.9534 0.9742 0.9751 0.9241 0.9265
0.0722 5.0 3785 0.2507 0.9368 0.9379 0.9254 0.9277 0.9531 0.9536 0.8115 0.8117 0.8862 0.8880 0.9424 0.9405 0.9262 0.9285 0.9139 0.9126 0.8961 0.8987 0.9631 0.9642 0.9548 0.9552 0.9750 0.9746 0.9237 0.9244
0.053 6.0 4542 0.2619 0.9405 0.9424 0.9216 0.9220 0.9536 0.9546 0.8155 0.8132 0.8877 0.8907 0.9522 0.9542 0.9215 0.9179 0.9210 0.9207 0.8976 0.8999 0.9611 0.9605 0.9547 0.9544 0.9750 0.9751 0.9252 0.9255
0.0413 7.0 5299 0.2727 0.9414 0.9440 0.9236 0.9235 0.9534 0.9541 0.8247 0.8244 0.8869 0.8890 0.9491 0.9495 0.9264 0.9275 0.9224 0.9227 0.8977 0.9057 0.9635 0.9642 0.9527 0.9523 0.9738 0.9739 0.9263 0.9276
0.0322 8.0 6056 0.2880 0.9389 0.9410 0.9198 0.9234 0.9544 0.9547 0.8142 0.8099 0.8872 0.8878 0.9522 0.9549 0.9274 0.9288 0.9208 0.9214 0.9027 0.9068 0.9632 0.9640 0.9534 0.9536 0.9745 0.9744 0.9257 0.9267
0.0256 9.0 6813 0.2953 0.9401 0.9392 0.9191 0.9182 0.9521 0.9516 0.8198 0.8196 0.8877 0.8869 0.9487 0.9498 0.9254 0.9283 0.9209 0.9207 0.8993 0.9019 0.9632 0.9632 0.9557 0.9554 0.9734 0.9729 0.9255 0.9256

Framework versions

  • Transformers 4.18.0
  • Pytorch 2.5.1+cu121
  • Datasets 3.2.0
  • Tokenizers 0.12.1
Downloads last month
16
Inference API
Unable to determine this model's library. Check the docs .