5CD-AI/visobert-14gb-corpus
Overview
We continually pretrain uitnlp/visobert
on a merged 14GB dataset, the training dataset includes:
- Crawled data (100M comments and 15M posts on Facebook)
- UIT data, which is used to pretrain
uitnlp/visobert
- MC4 ecommerce
Here are the results on 4 downstream tasks on Vietnamese social media texts, including Emotion Recognition(UIT-VSMEC), Hate Speech Detection(UIT-HSD), Spam Reviews Detection(ViSpamReviews), Hate Speech Spans Detection(ViHOS):
Model | Avg MF1 | Emotion Recognition | Hate Speech Detection | Spam Reviews Detection | Hate Speech Spans Detection | ||||||||
Acc | WF1 | MF1 | Acc | WF1 | MF1 | Acc | WF1 | MF1 | Acc | WF1 | MF1 | ||
viBERT | 78.16 | 61.91 | 61.98 | 59.7 | 85.34 | 85.01 | 62.07 | 89.93 | 89.79 | 76.8 | 90.42 | 90.45 | 84.55 |
vELECTRA | 79.23 | 64.79 | 64.71 | 61.95 | 86.96 | 86.37 | 63.95 | 89.83 | 89.68 | 76.23 | 90.59 | 90.58 | 85.12 |
PhoBERT-Base | 79.3 | 63.49 | 63.36 | 61.41 | 87.12 | 86.81 | 65.01 | 89.83 | 89.75 | 76.18 | 91.32 | 91.38 | 85.92 |
PhoBERT-Large | 79.82 | 64.71 | 64.66 | 62.55 | 87.32 | 86.98 | 65.14 | 90.12 | 90.03 | 76.88 | 91.44 | 91.46 | 86.56 |
ViSoBERT | 81.58 | 68.1 | 68.37 | 65.88 | 88.51 | 88.31 | 68.77 | 90.99 | 90.92 | 79.06 | 91.62 | 91.57 | 86.8 |
visobert-14gb-corpus | 82.2 | 68.69 | 68.75 | 66.03 | 88.79 | 88.6 | 69.57 | 91.02 | 90.88 | 77.13 | 93.69 | 93.63 | 89.66 |
Usage (HuggingFace Transformers)
Install transformers
package:
pip install transformers
Then you can use this model for fill-mask task like this:
from transformers import pipeline
model_path = "5CD-AI/visobert-14gb-corpus"
mask_filler = pipeline("fill-mask", model_path)
mask_filler("shop làm ăn như cái <mask>", top_k=10)
Fine-tune Configuration
We fine-tune 5CD-AI/visobert-14gb-corpus
on 4 downstream tasks with transformers
library with the following configuration:
- seed: 42
- gradient_accumulation_steps: 1
- weight_decay: 0.01
- optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
- training_epochs: 30
- model_max_length: 128
- learning_rate: 1e-5
- metric_for_best_model: wf1
- strategy: epoch
And different additional configurations for each task:
Emotion Recognition | Hate Speech Detection | Spam Reviews Detection | Hate Speech Spans Detection |
---|---|---|---|
- train_batch_size: 64 - lr_scheduler_type: linear |
- train_batch_size: 32 - lr_scheduler_type: linear |
- train_batch_size: 32 - lr_scheduler_type: cosine |
- train_batch_size: 32 - lr_scheduler_type: cosine |
- Downloads last month
- 28
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.