Vinbrain commited on
Commit
f89e80b
·
1 Parent(s): 9f01ea1

init commit

Browse files
Files changed (5) hide show
  1. README.md +82 -0
  2. bpe.codes +0 -0
  3. config.json +30 -0
  4. pytorch_model.bin +3 -0
  5. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # <a name="introduction"></a> ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining
2
+
3
+ ViHealthBERT is the a strong baseline language models for Vietnamese in Healthcare domain.
4
+
5
+ We empirically investigate our model with different training strategies, achieving state of the art (SOTA) performances on 3 downstream tasks: NER (COVID-19 & ViMQ), Acronym Disambiguation, and Summarization.
6
+
7
+ We introduce two Vietnamese datasets: the acronym dataset (acrDrAid) and the FAQ summarization dataset in the healthcare domain. Our acrDrAid dataset is annotated with 135 sets of keywords.
8
+ The general approaches and experimental results of ViHealthBERT can be found in our LREC-2022 Poster [paper]() (updated soon):
9
+
10
+ @article{vihealthbert,
11
+ title = {{ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining}},
12
+ author = {Minh Phuc Nguyen, Vu Hoang Tran, Vu Hoang, Ta Duc Huy, Trung H. Bui, Steven Q. H. Truong },
13
+ journal = {13th Edition of its Language Resources and Evaluation Conference},
14
+ year = {2022}
15
+ }
16
+
17
+ ### Installation <a name="install2"></a>
18
+ - Python 3.6+, and PyTorch >= 1.6
19
+ - Install `transformers`:
20
+ `pip install transformers==4.2.0`
21
+
22
+ ### Pre-trained models <a name="models2"></a>
23
+
24
+ Model | #params | Arch. | Tokenizer
25
+ ---|---|---|---
26
+ `demdecuong/vihealthbert-base-word` | 135M | base | Word-level
27
+ `demdecuong/vihealthbert-base-syllable` | 135M | base | Syllable-level
28
+
29
+ ### Example usage <a name="usage1"></a>
30
+
31
+ ```python
32
+ import torch
33
+ from transformers import AutoModel, AutoTokenizer
34
+
35
+ vihealthbert = AutoModel.from_pretrained("demdecuong/vihealthbert-base-word")
36
+ tokenizer = AutoTokenizer.from_pretrained("demdecuong/vihealthbert-base-word")
37
+
38
+ # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
39
+ line = "Tôi là sinh_viên trường đại_học Công_nghệ ."
40
+
41
+ input_ids = torch.tensor([tokenizer.encode(line)])
42
+ with torch.no_grad():
43
+ features = vihealthbert(input_ids) # Models outputs are now tuples
44
+ ```
45
+
46
+ ### Example usage for raw text <a name="usage2"></a>
47
+ Since ViHealthBERT used the [RDRSegmenter](https://github.com/datquocnguyen/RDRsegmenter) from [VnCoreNLP](https://github.com/vncorenlp/VnCoreNLP) to pre-process the pre-training data.
48
+ We highly recommend use the same word-segmenter for ViHealthBERT downstream applications.
49
+
50
+ #### Installation
51
+ ```
52
+ # Install the vncorenlp python wrapper
53
+ pip3 install vncorenlp
54
+
55
+ # Download VnCoreNLP-1.1.1.jar & its word segmentation component (i.e. RDRSegmenter)
56
+ mkdir -p vncorenlp/models/wordsegmenter
57
+ wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/VnCoreNLP-1.1.1.jar
58
+ wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/vi-vocab
59
+ wget https://raw.githubusercontent.com/vncorenlp/VnCoreNLP/master/models/wordsegmenter/wordsegmenter.rdr
60
+ mv VnCoreNLP-1.1.1.jar vncorenlp/
61
+ mv vi-vocab vncorenlp/models/wordsegmenter/
62
+ mv wordsegmenter.rdr vncorenlp/models/wordsegmenter/
63
+ ```
64
+
65
+ `VnCoreNLP-1.1.1.jar` (27MB) and folder `models/` must be placed in the same working folder.
66
+
67
+ #### Example usage
68
+ ```
69
+ # See more details at: https://github.com/vncorenlp/VnCoreNLP
70
+
71
+ # Load rdrsegmenter from VnCoreNLP
72
+ from vncorenlp import VnCoreNLP
73
+ rdrsegmenter = VnCoreNLP("/Absolute-path-to/vncorenlp/VnCoreNLP-1.1.1.jar", annotators="wseg", max_heap_size='-Xmx500m')
74
+
75
+ # Input
76
+ text = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây."
77
+
78
+ # To perform word (and sentence) segmentation
79
+ sentences = rdrsegmenter.tokenize(text)
80
+ for sentence in sentences:
81
+ print(" ".join(sentence))
82
+ ```
bpe.codes ADDED
The diff for this file is too large to render. See raw diff
 
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/workspace/vinbrain/minhnp/pretrainedLM/phobert-base",
3
+ "architectures": [
4
+ "RobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "finetuning_task": "word-level",
11
+ "gradient_checkpointing": false,
12
+ "hidden_act": "gelu",
13
+ "hidden_dropout_prob": 0.1,
14
+ "hidden_size": 768,
15
+ "initializer_range": 0.02,
16
+ "intermediate_size": 3072,
17
+ "layer_norm_eps": 1e-05,
18
+ "max_position_embeddings": 258,
19
+ "model_type": "roberta",
20
+ "num_attention_heads": 12,
21
+ "num_hidden_layers": 12,
22
+ "pad_token_id": 1,
23
+ "position_embedding_type": "absolute",
24
+ "tokenizer_class": "PhobertTokenizer",
25
+ "torch_dtype": "float32",
26
+ "transformers_version": "4.11.3",
27
+ "type_vocab_size": 1,
28
+ "use_cache": true,
29
+ "vocab_size": 64001
30
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3259d96cb56cb1c85b0a5a894e782a6a4ee66d783333d83368fba36d23be2d06
3
+ size 540072433
vocab.txt ADDED
The diff for this file is too large to render. See raw diff