cosmoquester commited on
Commit
808ffb8
โ€ข
1 Parent(s): 249bf6f

feat: Add Model

Browse files
README.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ko
3
+ ---
4
+
5
+ # Pretrained BART in Korean
6
+
7
+ This is pretrained BART model with multiple Korean Datasets.
8
+
9
+ I used multiple datasets for generalizing the model for both colloquial and written texts.
10
+
11
+ The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
12
+
13
+ The script which is used to pre-train model is [here](https://github.com/cosmoquester/transformers-bart-pretrain).
14
+
15
+ When you use the reference API, you must wrap the sentence with `[BOS]` and `[EOS]` like below example.
16
+
17
+ ```
18
+ [BOS] ์•ˆ๋…•ํ•˜์„ธ์š”? ๋ฐ˜๊ฐ€์›Œ์š”~~ [EOS]
19
+ ```
20
+
21
+ You can also test mask filling performance using `[MASK]` token like this.
22
+ ```
23
+ [BOS] [MASK] ๋จน์—ˆ์–ด? [EOS]
24
+ ```
25
+
26
+ ## Used Datasets
27
+
28
+ ### [๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/)
29
+ - ์ผ์ƒ ๋Œ€ํ™” ๋ง๋ญ‰์น˜ 2020
30
+ - ๊ตฌ์–ด ๋ง๋ญ‰์น˜
31
+ - ๋ฌธ์–ด ๋ง๋ญ‰์น˜
32
+ - ์‹ ๋ฌธ ๋ง๋ญ‰์น˜
33
+
34
+ ### AIhub
35
+ - [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ์ „๋ฌธ๋ถ„์•ผ๋ง๋ญ‰์น˜](https://aihub.or.kr/aidata/30717)
36
+ - [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด๋Œ€ํ™”์š”์•ฝ](https://aihub.or.kr/aidata/30714)
37
+ - [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ๊ฐ์„ฑ ๋Œ€ํ™” ๋ง๋ญ‰์น˜](https://aihub.or.kr/aidata/7978)
38
+ - [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด ์Œ์„ฑ](https://aihub.or.kr/aidata/105)
39
+ - [๊ฐœ๋ฐฉ๋ฐ์ดํ„ฐ ํ•œ๊ตญ์–ด SNS](https://aihub.or.kr/aidata/30718)
40
+
41
+ ### [์„ธ์ข… ๋ง๋ญ‰์น˜](https://ithub.korean.go.kr/)
config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "bart-ko-mini",
3
+ "activation_dropout": 0.1,
4
+ "activation_function": "gelu",
5
+ "architectures": [
6
+ "BartForConditionalGeneration"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "bos_token_id": 2,
10
+ "classifier_dropout": 0.0,
11
+ "d_model": 256,
12
+ "decoder_attention_heads": 4,
13
+ "decoder_ffn_dim": 1024,
14
+ "decoder_layerdrop": 0.0,
15
+ "decoder_layers": 2,
16
+ "decoder_start_token_id": 2,
17
+ "dropout": 0.1,
18
+ "encoder_attention_heads": 4,
19
+ "encoder_ffn_dim": 1024,
20
+ "encoder_layerdrop": 0.0,
21
+ "encoder_layers": 2,
22
+ "eos_token_id": 3,
23
+ "forced_eos_token_id": 3,
24
+ "gradient_checkpointing": false,
25
+ "id2label": {
26
+ "0": "LABEL_0",
27
+ "1": "LABEL_1",
28
+ "2": "LABEL_2"
29
+ },
30
+ "init_std": 0.02,
31
+ "is_encoder_decoder": true,
32
+ "label2id": {
33
+ "LABEL_0": 0,
34
+ "LABEL_1": 1,
35
+ "LABEL_2": 2
36
+ },
37
+ "max_position_embeddings": 2048,
38
+ "model_type": "bart",
39
+ "num_hidden_layers": 2,
40
+ "pad_token_id": 0,
41
+ "scale_embedding": false,
42
+ "transformers_version": "4.7.0",
43
+ "use_cache": false,
44
+ "vocab_size": 32000
45
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e4aa890dd0c2c6b4108db89c3e46826fa02731b4bdf0560df4e344b1e1db9041
3
+ size 51876625
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "mask_token": "[MASK]"}
tf_model.h5 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a7863617a9c4d1495b4ffda42552de2a96784969d8662bd9f9129735cc5754e
3
+ size 51823224
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": "[BOS]", "eos_token": "[EOS]", "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "mask_token": "[MASK]"}