metadata

license: mit
datasets:
  - scb_mt_enth_2020
  - oscar
  - wikipedia
  - best2009
language:
  - th
library_name: transformers

HoogBERTa

This repository includes the Thai pretrained language representation (HoogBERTa_base) and the fine-tuned model for multitask sequence labeling.

Documentation

Prerequisite

Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using BEST standard before inputting into HoogBERTa

pip install attacut

Getting Start

To initialize the model from hub, use the following commands

from transformers import AutoTokenizer, AutoModel
from attacut import tokenize

tokenizer = AutoTokenizer.from_pretrained("new5558/HoogBERTa")
model = AutoModel.from_pretrained("new5558/HoogBERTa")

To annotate POS, NE, and clause boundary, use the following commands

To extract token features, based on the RoBERTa architecture, use the following commands

with torch.no_grad():
    model.eval()
    sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
    all_sent = []
    sentences = sentence.split(" ")
    for sent in sentences:
        all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

    sentence = " _ ".join(all_sent)
    tokenized_text = tokenizer(sentence, return_tensors = 'pt')
    token_ids = tokenized_text['input_ids']
    features = model(**tokenized_text)

For batch processing,

with torch.no_grad():
    model.eval()
    sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
    inputList = []
    for sentX in sentenceL:
        sentences = sentX.split(" ")
        all_sent = []
        for sent in sentences:
            all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

        sentence = " _ ".join(all_sent)
        inputList.append(sentence)
    tokenized_text = tokenizer(inputList, padding = True, return_tensors = 'pt')
    token_ids = tokenized_text['input_ids']
    features = model(**tokenized_text)

To use HoogBERTa as an embedding layer, use

with torch.no_grad():
  features = model(token_ids) # where token_ids is a tensor with type "long".

Citation

Please cite as:

@inproceedings{porkaew2021hoogberta,
  title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
  year = {2021},
  address={Online}
}

Download full-text PDF
Check out the code on Github