HoogBERTa

This repository includes the Thai pretrained language representation (HoogBERTa_base) fine-tuned for Sentence Boundary Classification Task.

Documentation

Prerequisite

Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using BEST standard before inputting into HoogBERTa

pip install attacut

Getting Start

To initialize the model from hub, use the following commands

from transformers import RobertaTokenizerFast, RobertaForTokenClassification
from attacut import tokenize
import torch

tokenizer = RobertaTokenizerFast.from_pretrained("lst-nectec/HoogBERTa-SENTENCE-lst20")
model = RobertaForTokenClassification.from_pretrained("lst-nectec/HoogBERTa-SENTENCE-lst20")

To do Sentence Boundary Classification, use the following commands

from transformers import pipeline

nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")

sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
    all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

sentence = " _ ".join(all_sent)

print(nlp(sentence))

For batch processing,

from transformers import pipeline

nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")

sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
  sentences = sentX.split(" ")
  all_sent = []
  for sent in sentences:
      all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

  sentence = " _ ".join(all_sent)
  inputList.append(sentence)

print(nlp(inputList))

Huggingface Models

  1. HoogBERTaEncoder
  • HoogBERTa: Feature Extraction and Mask Language Modeling
  1. HoogBERTaMuliTaskTagger:

Citation

Please cite as:

@inproceedings{porkaew2021hoogberta,
  title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
  year = {2021},
  address={Online}
}

Download full-text PDF
Check out the code on Github

Downloads last month
12
Safetensors
Model size
143M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train lst-nectec/HoogBERTa-SENTENCE-lst20

Collection including lst-nectec/HoogBERTa-SENTENCE-lst20