TakSung commited on
Commit
f30b737
ยท
1 Parent(s): baf42a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -1
README.md CHANGED
@@ -2,4 +2,43 @@
2
  license: mit
3
  language:
4
  - ko
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: mit
3
  language:
4
  - ko
5
+ ---
6
+
7
+ # Kconvo-roberta: Korean conversation RoBERTa ([github](https://github.com/HeoTaksung/Domain-Robust-Retraining-of-Pretrained-Language-Model))
8
+ - There are many PLMs (Pretrained Language Models) for Korean, but most of them exist for written language.
9
+ - Here, we introduce a retrained PLM for prediction of Korean conversation data.
10
+
11
+ ## Usage
12
+ ```python
13
+ # Kconvo-roberta
14
+ from transformers import RobertaTokenizerFast, RobertaModel
15
+
16
+ tokenizer_roberta = RobertaTokenizerFast.from_pretrained("yeongjoon/Kconvo-roberta")
17
+ model_roberta = RobertaModel.from_pretrained("yeongjoon/Kconvo-roberta")
18
+ ```
19
+
20
+ -----------------
21
+ ## Domain Robust Retraining of Pretrained Language Model
22
+
23
+ - Kconvo-roberta uses [klue/roberta-base](https://huggingface.co/klue/roberta-base) as the basic model and additionally retrains the conversation dataset.
24
+ - The retrained dataset was collected through the [National Institute of the Korean Language](https://corpus.korean.go.kr/request/corpusRegist.do) and [AI-Hub](https://www.aihub.or.kr/aihubdata/data/list.do?pageIndex=1&currMenu=115&topMenu=100&dataSetSn=&srchdataClCode=DATACL001&srchOrder=&SrchdataClCode=DATACL002&searchKeyword=&srchDataRealmCode=REALM002&srchDataTy=DATA003), and the collected dataset is as follows.
25
+
26
+ ```
27
+ - National Institute of the Korean Language
28
+ * ์˜จ๋ผ์ธ ๋Œ€ํ™” ๋ง๋ญ‰์น˜ 2021
29
+ * ์ผ์ƒ ๋Œ€ํ™” ๋ง๋ญ‰์น˜ 2020
30
+ * ๊ตฌ์–ด ๋ง๋ญ‰์น˜
31
+ * ๋ฉ”์‹ ์ € ๋ง๋ญ‰์น˜
32
+
33
+ - AI-Hub
34
+ * ์˜จ๋ผ์ธ ๊ตฌ์–ด์ฒด ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ
35
+ * ์ƒ๋‹ด ์Œ์„ฑ
36
+ * ํ•œ๊ตญ์–ด ์Œ์„ฑ
37
+ * ์ž์œ ๋Œ€ํ™” ์Œ์„ฑ(์ผ๋ฐ˜๋‚จ์—ฌ)
38
+ * ์ผ์ƒ์ƒํ™œ ๋ฐ ๊ตฌ์–ด์ฒด ํ•œ-์˜ ๋ฒˆ์—ญ ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜ ๋ฐ์ดํ„ฐ
39
+ * ํ•œ๊ตญ์ธ ๋Œ€ํ™”์Œ์„ฑ
40
+ * ๊ฐ์„ฑ ๋Œ€ํ™” ๋ง๋ญ‰์น˜
41
+ * ์ฃผ์ œ๋ณ„ ํ…์ŠคํŠธ ์ผ์ƒ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ
42
+ * ์šฉ๋„๋ณ„ ๋ชฉ์ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ
43
+ * ํ•œ๊ตญ์–ด SNS
44
+ ```