File size: 3,469 Bytes

e4b1b40
 
 
c64c467
e4b1b40
 
 
 
a3c1288
13b06c7
a3c1288
1abd6ae
e4b1b40
 
 
1abd6ae
e4b1b40
 
 
 
812e130
d8e4832
 
 
 
 
 
 
e4b1b40
6c0605a
e4b1b40
c64c467
 
e4b1b40
 
 
 
 
6c0605a
e4b1b40
 
 
 
 
 
 
c64c467
 
 
 
 
 
 
 
 
 
 
e4b1b40
c64c467
e4b1b40
 
 
 
 
c64c467
 
 
 
 
 
 
 
 
 
 
 
 
3cad39a
e4b1b40
c64c467
e4b1b40
 
 
 
 
 
3cad39a
e4b1b40
 
812e130
 
 
 
 
 
e4b1b40
 
 
 
 
 
 
 
 
 
 
 
 
 
1abd6ae

---
license: mit
datasets:
- best2009
- scb_mt_enth_2020
- oscar
- wikipedia
language:
- th
widget:
  - text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป <mask> วัดพระแก้ว _ ที่ กรุงเทพ
library_name: transformers
---
# HoogBERTa

This repository includes the Thai pretrained language representation (HoogBERTa_base) and the fine-tuned model for multitask sequence labeling.  


# Documentation


## Prerequisite
Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using [BEST](https://huggingface.co/datasets/best2009) standard before inputting into HoogBERTa 
```
pip install attacut
```

## Getting Start
To initialize the model from hub, use the following commands
```python
from transformers import AutoTokenizer, AutoModel
from attacut import tokenized
import torch

tokenizer = AutoTokenizer.from_pretrained("new5558/HoogBERTa")
model = AutoModel.from_pretrained("new5558/HoogBERTa")
```

To annotate POS, NE, and clause boundary, use the following commands
```

```

To extract token features, based on the RoBERTa architecture, use the following commands

```python
model.eval()
sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
    all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

sentence = " _ ".join(all_sent)
tokenized_text = tokenizer(sentence, return_tensors = 'pt')
token_ids = tokenized_text['input_ids']

with torch.no_grad():
  features = model(**tokenized_text, output_hidden_states = True).hidden_states[-1]
```

For batch processing,

```python
model.eval()
sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
  sentences = sentX.split(" ")
  all_sent = []
  for sent in sentences:
      all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))

  sentence = " _ ".join(all_sent)
  inputList.append(sentence)
tokenized_text = tokenizer(inputList, padding = True, return_tensors = 'pt')
token_ids = tokenized_text['input_ids']

with torch.no_grad():
    features = model(**tokenized_text, output_hidden_states = True).hidden_states[-1]
```

To use HoogBERTa as an embedding layer, use

```python
with torch.no_grad():
  features = model(token_ids, output_hidden_states = True).hidden_states[-1] # where token_ids is a tensor with type "long".
```


## Conversion Code
If you are interested in how to convert Fairseq and subword-nmt Roberta into Huggingface hub here is my code used to do the conversion and test for parity match: 
https://www.kaggle.com/norapatbuppodom/hoogberta-conversion


# Citation

Please cite as:

``` bibtex
@inproceedings{porkaew2021hoogberta,
  title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
  year = {2021},
  address={Online}
}
```

Download full-text [PDF](https://drive.google.com/file/d/1hwdyIssR5U_knhPE2HJigrc0rlkqWeLF/view?usp=sharing)  
Check out the code on [Github](https://github.com/lstnlp/HoogBERTa)