5CD-AI/visobert-14gb-corpus

Overview

We continually pretrain uitnlp/visobert on a merged 14GB dataset, the training dataset includes:

Crawled data (100M comments and 15M posts on Facebook)
UIT data, which is used to pretrain uitnlp/visobert
MC4 ecommerce

Here are the results on 4 downstream tasks on Vietnamese social media texts, including Emotion Recognition(UIT-VSMEC), Hate Speech Detection(UIT-HSD), Spam Reviews Detection(ViSpamReviews), Hate Speech Spans Detection(ViHOS):

Model	Avg MF1	Emotion Recognition			Hate Speech Detection			Spam Reviews Detection			Hate Speech Spans Detection
Model	Avg MF1	Acc	WF1	MF1	Acc	WF1	MF1	Acc	WF1	MF1	Acc	WF1	MF1
viBERT	78.16	61.91	61.98	59.7	85.34	85.01	62.07	89.93	89.79	76.8	90.42	90.45	84.55
vELECTRA	79.23	64.79	64.71	61.95	86.96	86.37	63.95	89.83	89.68	76.23	90.59	90.58	85.12
PhoBERT-Base	79.3	63.49	63.36	61.41	87.12	86.81	65.01	89.83	89.75	76.18	91.32	91.38	85.92
PhoBERT-Large	79.82	64.71	64.66	62.55	87.32	86.98	65.14	90.12	90.03	76.88	91.44	91.46	86.56
ViSoBERT	81.58	68.1	68.37	65.88	88.51	88.31	68.77	90.99	90.92	79.06	91.62	91.57	86.8
visobert-14gb-corpus	82.2	68.69	68.75	66.03	88.79	88.6	69.57	91.02	90.88	77.13	93.69	93.63	89.66

Usage (HuggingFace Transformers)

Install transformers package:

pip install transformers

Then you can use this model for fill-mask task like this:

from transformers import pipeline

model_path = "5CD-AI/visobert-14gb-corpus"
mask_filler = pipeline("fill-mask", model_path)

mask_filler("shop làm ăn như cái <mask>", top_k=10)

Fine-tune Configuration

We fine-tune 5CD-AI/visobert-14gb-corpus on 4 downstream tasks with transformers library with the following configuration:

seed: 42
gradient_accumulation_steps: 1
weight_decay: 0.01
optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
training_epochs: 30
model_max_length: 128
learning_rate: 1e-5
metric_for_best_model: wf1
strategy: epoch

And different additional configurations for each task:

Emotion Recognition	Hate Speech Detection	Spam Reviews Detection	Hate Speech Spans Detection
- train_batch_size: 64 - lr_scheduler_type: linear	- train_batch_size: 32 - lr_scheduler_type: linear	- train_batch_size: 32 - lr_scheduler_type: cosine	- train_batch_size: 32 - lr_scheduler_type: cosine