ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining

ViHealthBERT is the a strong baseline language models for Vietnamese in Healthcare domain.

We empirically investigate our model with different training strategies, achieving state of the art (SOTA) performances on 3 downstream tasks: NER (COVID-19 & ViMQ), Acronym Disambiguation, and Summarization.

We introduce two Vietnamese datasets: the acronym dataset (acrDrAid) and the FAQ summarization dataset in the healthcare domain. Our acrDrAid dataset is annotated with 135 sets of keywords. The general approaches and experimental results of ViHealthBERT can be found in our LREC-2022 Poster paper (updated soon):

@article{vihealthbert,
title     = {{ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining}},
author    = {Minh Phuc Nguyen, Vu Hoang Tran, Vu Hoang, Ta Duc Huy, Trung H. Bui, Steven Q. H. Truong },
journal   = {13th Edition of its Language Resources and Evaluation Conference},
year      = {2022}
}

Installation

Python 3.6+, and PyTorch >= 1.6
Install transformers:
pip install transformers==4.2.0

Pre-trained models

Model	#params	Arch.	Tokenizer
`demdecuong/vihealthbert-base-word`	135M	base	Word-level
`demdecuong/vihealthbert-base-syllable`	135M	base	Syllable-level

Example usage

import torch
from transformers import AutoModel, AutoTokenizer

vihealthbert = AutoModel.from_pretrained("demdecuong/vihealthbert-base-word")
tokenizer = AutoTokenizer.from_pretrained("demdecuong/vihealthbert-base-word")

# INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
line = "Tôi là sinh_viên trường đại_học Công_nghệ ."

input_ids = torch.tensor([tokenizer.encode(line)])
with torch.no_grad():
    features = vihealthbert(input_ids)  # Models outputs are now tuples

Example usage for raw text

Since ViHealthBERT used the RDRSegmenter from VnCoreNLP to pre-process the pre-training data. We highly recommend use the same word-segmenter for ViHealthBERT downstream applications.

Installation

# Install the vncorenlp python wrapper
pip3 install vncorenlp

# Download VnCoreNLP-1.1.1.jar & its word segmentation component (i.e. RDRSegmenter) 
mkdir -p vncorenlp/models/wordsegmenter
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab
wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr
mv VnCoreNLP-1.1.1.jar vncorenlp/ 
mv vi-vocab vncorenlp/models/wordsegmenter/
mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/

VnCoreNLP-1.1.1.jar (27MB) and folder models/ must be placed in the same working folder.

Example usage

# See more details at: https://github.com/vncorenlp/VnCoreNLP

# Load rdrsegmenter from VnCoreNLP
from vncorenlp import VnCoreNLP
rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m') 

# Input 
text = "Ông Nguyễn Khắc Chúc  đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."

# To perform word (and sentence) segmentation
sentences = rdrsegmenter.tokenize(text) 
for sentence in sentences:
    print(" ".join(sentence))