new5558 commited on
Commit
bc06b07
·
1 Parent(s): b6305b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -1
README.md CHANGED
@@ -6,4 +6,93 @@ language:
6
  widget:
7
  - text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป เที่ยว วัดพระแก้ว _ ที่ กรุงเทพ
8
  library_name: transformers
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  widget:
7
  - text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป เที่ยว วัดพระแก้ว _ ที่ กรุงเทพ
8
  library_name: transformers
9
+ ---
10
+ # HoogBERTa
11
+
12
+ This repository includes the Thai pretrained language representation (HoogBERTa_base) fine-tuned for Sentence Boundary Classification Task.
13
+
14
+
15
+ # Documentation
16
+
17
+
18
+ ## Prerequisite
19
+ Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using [BEST](https://huggingface.co/datasets/best2009) standard before inputting into HoogBERTa
20
+ ```
21
+ pip install attacut
22
+ ```
23
+
24
+ ## Getting Start
25
+ To initialize the model from hub, use the following commands
26
+ ```python
27
+ from transformers import RobertaTokenizerFast, RobertaForTokenClassification
28
+ from attacut import tokenized
29
+ import torch
30
+
31
+ tokenizer = RobertaTokenizerFast.from_pretrained("new5558/HoogBERTa-POS-lst20")
32
+ model = RobertaForTokenClassification.from_pretrained("new5558/HoogBERTa-POS-lst20")
33
+ ```
34
+
35
+ To do Sentence Boundary Classification, use the following commands
36
+
37
+ ```python
38
+ from transformers import pipeline
39
+
40
+ nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
41
+
42
+ sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
43
+ all_sent = []
44
+ sentences = sentence.split(" ")
45
+ for sent in sentences:
46
+ all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
47
+
48
+ sentence = " _ ".join(all_sent)
49
+
50
+ print(nlp(sentence))
51
+ ```
52
+
53
+ For batch processing,
54
+
55
+ ```python
56
+ from transformers import pipeline
57
+
58
+ nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
59
+
60
+ sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
61
+ inputList = []
62
+ for sentX in sentenceL:
63
+ sentences = sentX.split(" ")
64
+ all_sent = []
65
+ for sent in sentences:
66
+ all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
67
+
68
+ sentence = " _ ".join(all_sent)
69
+ inputList.append(sentence)
70
+
71
+ print(nlp(inputList))
72
+ ```
73
+
74
+ # Huggingface Models
75
+ 1. `HoogBERTaEncoder`
76
+ - [HoogBERTa](https://huggingface.co/new5558/HoogBERTa): `Feature Extraction` and `Mask Language Modeling`
77
+ 2. `HoogBERTaMuliTaskTagger`:
78
+ - [HoogBERTa-NER-lst20](https://huggingface.co/new5558/HoogBERTa-NER-lst20): `Named-entity recognition (NER)` based on LST20
79
+ - [HoogBERTa-POS-lst20](https://huggingface.co/new5558/HoogBERTa-POS-lst20): `Part-of-speech tagging (POS)` based on LST20
80
+ - [HoogBERTa-SENTENCE-lst20](https://huggingface.co/new5558/HoogBERTa-SENTENCE-lst20): `Clause Boundary Classification` based on LST20
81
+
82
+
83
+ # Citation
84
+
85
+ Please cite as:
86
+
87
+ ``` bibtex
88
+ @inproceedings{porkaew2021hoogberta,
89
+ title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
90
+ author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
91
+ booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
92
+ year = {2021},
93
+ address={Online}
94
+ }
95
+ ```
96
+
97
+ Download full-text [PDF](https://drive.google.com/file/d/1hwdyIssR5U_knhPE2HJigrc0rlkqWeLF/view?usp=sharing)
98
+ Check out the code on [Github](https://github.com/lstnlp/HoogBERTa)