--- library_name: transformers tags: [] pipeline_tag: text2text-generation widget: - text: Dành cho hàng th iết khi mua xe tay ga và Super Cub (khách hàng mua xe 1/2017). 🍓 Mua góp lã ất dẫn c từ 🍓 Mua góp nhận vẹt gốc example_title: Example 1 --- # 5CD-AI/visocial-T5-base ## Overview We trimmed vocabulary size to 50,589 and continually pretrained `google/mt5-base`[1] on a merged 20GB dataset, the training dataset includes: - Crawled data (100M comments and 15M posts on Facebook) - UIT data[2], which is used to pretrain `uitnlp/visobert`[2] - MC4 ecommerce - 10.7M comments on VOZ Forum from `tarudesu/VOZ-HSD`[7] - 3.6M reviews from Amazon[3] translated into Vietnamese from `5CD-AI/Vietnamese-amazon_polarity-gg-translated` Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
Model Average MF1 Hate Speech Detection Toxic Speech Detection Hate Spans Detection
Acc WF1 MF1 Acc WF1 MF1 Acc WF1 MF1
PhoBERT[4] 69.63 86.75 86.52 64.76 90.78 90.27 71.31 84.65 81.12 72.81
PhoBERT_v2[4] 70.50 87.42 87.33 66.60 90.23 89.78 71.39 84.92 81.51 73.51
viBERT[5] 67.80 86.33 85.79 62.85 88.81 88.17 67.65 84.63 81.28 72.91
ViSoBERT[6] 75.07 88.17 87.86 67.71 90.35 90.16 71.45 90.16 90.07 86.04
ViHateT5[7] 75.56 88.76 89.14 68.67 90.80 91.78 71.63 91.00 90.20 86.37
visocial-T5-base(Ours) 78.01 89.51 89.78 71.19 92.2 93.47 73.81 92.57 92.20 89.04
Visocial-T5-base versus other T5-based models in terms of Vietnamese HSD-related task performance with Macro F1-score:
Model MF1
Hate Speech Detection Toxic Speech Detection Hate Spans Detection
mT5[1] 66.76 69.93 86.60
ViT5[8] 66.95 64.82 86.90
ViHateT5[7] 68.67 71.63 86.37
visocial-T5-base(Ours) 71.90 73.81 89.04
## Fine-tune Configuration We fine-tune `5CD-AI/visocial-T5-base` on 3 downstream tasks with `transformers` library with the following configuration: - seed: 42 - training_epochs: 4 - train_batch_size: 4 - gradient_accumulation_steps: 8 - learning_rate: 3e-4 - lr_scheduler_type: linear - model_max_length: 256 - metric_for_best_model: eval_loss - evaluation_strategy: steps - eval_steps=0.1 ## References [1] [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) [2] [ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/) [3] [The Amazon Polarity dataset](https://paperswithcode.com/dataset/amazon-polarity-1) [4] [PhoBERT: Pre-trained language models for Vietnamese](https://aclanthology.org/2020.findings-emnlp.92/) [5] [Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models](https://arxiv.org/abs/2006.15994) [6] [ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/) [7] [ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model](https://arxiv.org/abs/2405.14141) [8] [ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation](https://aclanthology.org/2022.naacl-srw.18/)