5CD-AI/visobert-14gb-corpus
Overview
We continually pretrain uitnlp/visobert
on a merged 14GB dataset, the training dataset includes:
- Crawled data (100M comments and 15M posts on Facebook)
- UIT data, which is used to pretrain
uitnlp/visobert
- MC4 ecommerce
Here are the results on 4 downstream tasks on Vietnamese social media texts, including Emotion Recognition(UIT-VSMEC), Hate Speech Detection(UIT-HSD), Spam Reviews Detection(ViSpamReviews), Hate Speech Spans Detection(ViHOS):
Model | Avg MF1 | Emotion Recognition | Hate Speech Detection | Spam Reviews Detection | Hate Speech Spans Detection | ||||||||
Acc | WF1 | MF1 | Acc | WF1 | MF1 | Acc | WF1 | MF1 | Acc | WF1 | MF1 | ||
viBERT | 78.16 | 61.91 | 61.98 | 59.7 | 85.34 | 85.01 | 62.07 | 89.93 | 89.79 | 76.8 | 90.42 | 90.45 | 84.55 |
vELECTRA | 79.23 | 64.79 | 64.71 | 61.95 | 86.96 | 86.37 | 63.95 | 89.83 | 89.68 | 76.23 | 90.59 | 90.58 | 85.12 |
PhoBERT-Base | 79.3 | 63.49 | 63.36 | 61.41 | 87.12 | 86.81 | 65.01 | 89.83 | 89.75 | 76.18 | 91.32 | 91.38 | 85.92 |
PhoBERT-Large | 79.82 | 64.71 | 64.66 | 62.55 | 87.32 | 86.98 | 65.14 | 90.12 | 90.03 | 76.88 | 91.44 | 91.46 | 86.56 |
ViSoBERT | 81.58 | 68.1 | 68.37 | 65.88 | 88.51 | 88.31 | 68.77 | 90.99 | 90.92 | 79.06 | 91.62 | 91.57 | 86.8 |
visobert-14gb-corpus | 82.2 | 68.69 | 68.75 | 66.03 | 88.79 | 88.6 | 69.57 | 91.02 | 90.88 | 77.13 | 93.69 | 93.63 | 89.66 |
Usage (HuggingFace Transformers)
Install transformers
package:
pip install transformers
Then you can use this model for fill-mask task like this:
from transformers import pipeline
model_path = "5CD-AI/visobert-14gb-corpus"
mask_filler = pipeline("fill-mask", model_path)
mask_filler("shop làm ăn như cái <mask>", top_k=10)
Fine-tune Configuration
We fine-tune 5CD-AI/visobert-14gb-corpus
on 4 downstream tasks with transformers
library with the following configuration:
- seed: 42
- gradient_accumulation_steps: 1
- weight_decay: 0.01
- optimizer: AdamW with betas=(0.9, 0.999) and epsilon=1e-08
- training_epochs: 30
- model_max_length: 128
- learning_rate: 1e-5
- metric_for_best_model: wf1
- strategy: epoch
And different additional configurations for each task:
Emotion Recognition | Hate Speech Detection | Spam Reviews Detection | Hate Speech Spans Detection |
---|---|---|---|
- train_batch_size: 64 - lr_scheduler_type: linear |
- train_batch_size: 32 - lr_scheduler_type: linear |
- train_batch_size: 32 - lr_scheduler_type: cosine |
- train_batch_size: 32 - lr_scheduler_type: cosine |
- Downloads last month
- 238
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.