Commit
•
34b9653
1
Parent(s):
812449f
Build out the model card (#2)
Browse files- Build out the model card (0f1e25c8e3f80163d0e39636262cb3d70ccc3f00)
Co-authored-by: Marissa Gerchick <Marissa@users.noreply.huggingface.co>
README.md
CHANGED
@@ -1,5 +1,6 @@
|
|
1 |
---
|
2 |
language: ko
|
|
|
3 |
tags:
|
4 |
- korean
|
5 |
- klue
|
@@ -10,9 +11,32 @@ widget:
|
|
10 |
|
11 |
# KLUE BERT base
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
14 |
|
15 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
|
17 |
```python
|
18 |
from transformers import AutoModel, AutoTokenizer
|
@@ -21,7 +45,102 @@ model = AutoModel.from_pretrained("klue/bert-base")
|
|
21 |
tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
|
22 |
```
|
23 |
|
24 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
```bibtex
|
27 |
@misc{park2021klue,
|
|
|
1 |
---
|
2 |
language: ko
|
3 |
+
license: cc-by-sa-4.0
|
4 |
tags:
|
5 |
- korean
|
6 |
- klue
|
|
|
11 |
|
12 |
# KLUE BERT base
|
13 |
|
14 |
+
## Table of Contents
|
15 |
+
- [Model Details](#model-details)
|
16 |
+
- [How to Get Started With the Model](#how-to-get-started-with-the-model)
|
17 |
+
- [Uses](#uses)
|
18 |
+
- [Risks, Limitations and Biases](#risks-limitations-and-biases)
|
19 |
+
- [Training](#training)
|
20 |
+
- [Evaluation](#evaluation)
|
21 |
+
- [Environmental Impact](#environmental-impact)
|
22 |
+
- [Technical Specifications](#technical-specifications)
|
23 |
+
- [Citation Information](#citation-information)
|
24 |
+
- [Model Card Authors](#model-card-authors)
|
25 |
|
26 |
+
## Model Details
|
27 |
+
|
28 |
+
**Model Description:** KLUE BERT base is a pre-trained BERT Model on Korean Language. The developers of KLUE BERT base developed the model in the context of the development of the [Korean Language Understanding Evaluation (KLUE) Benchmark](https://arxiv.org/pdf/2105.09680.pdf).
|
29 |
+
|
30 |
+
- **Developed by:** See [GitHub Repo](https://github.com/facebookresearch/fairseq/tree/main/examples/roberta) for model developers
|
31 |
+
- **Model Type:** Transformer-based language model
|
32 |
+
- **Language(s):** Korean
|
33 |
+
- **License:** cc-by-sa-4.0
|
34 |
+
- **Parent Model:** See the [BERT base uncased model](https://huggingface.co/bert-base-uncased) for more information about the BERT base model.
|
35 |
+
- **Resources for more information:**
|
36 |
+
- [Research Paper](https://arxiv.org/abs/2105.09680)
|
37 |
+
- [GitHub Repo](https://github.com/KLUE-benchmark/KLUE)
|
38 |
+
|
39 |
+
## How to Get Started With the Model
|
40 |
|
41 |
```python
|
42 |
from transformers import AutoModel, AutoTokenizer
|
|
|
45 |
tokenizer = AutoTokenizer.from_pretrained("klue/bert-base")
|
46 |
```
|
47 |
|
48 |
+
## Uses
|
49 |
+
|
50 |
+
#### Direct Use
|
51 |
+
|
52 |
+
The model can be used for tasks including topic classification, semantic textual similarity, natural language inference, named entity recognition, and other tasks outlined in the [KLUE Benchmark](https://github.com/KLUE-benchmark/KLUE).
|
53 |
+
|
54 |
+
#### Misuse and Out-of-scope Use
|
55 |
+
|
56 |
+
The model should not be used to intentionally create hostile or alienating environments for people. In addition, the model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
|
57 |
+
|
58 |
+
## Risks, Limitations and Biases
|
59 |
+
|
60 |
+
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). The model developers discuss several ethical considerations related to the model in the [paper](https://arxiv.org/pdf/2105.09680.pdf), including:
|
61 |
+
|
62 |
+
- Bias issues with the publicly available data used in the pretraining corpora (and considerations related to filtering)
|
63 |
+
- PII in the data used in the pretraining corpora (and efforts to pseudonymize the data)
|
64 |
+
|
65 |
+
For ethical considerations related to the KLUE Benchmark, also see the [paper](https://arxiv.org/pdf/2105.09680.pdf).
|
66 |
+
|
67 |
+
## Training
|
68 |
+
|
69 |
+
#### Training Data
|
70 |
+
|
71 |
+
The authors use the following pretraining corpora for the model, described in the [associated paper](https://arxiv.org/pdf/2105.09680.pdf):
|
72 |
+
|
73 |
+
> We gather the following five publicly available Korean corpora from diverse sources to cover a broad set of topics and many different styles. We combine these corpora to build the final pretraining corpus of size approximately 62GB.
|
74 |
+
>
|
75 |
+
> - **MODU:** [Modu Corpus](https://corpus.korean.go.kr) is a collection of Korean corpora distributed by [National Institute of Korean Languages](https://corpus.korean.go.kr/). It includes both formal articles (news and books) and colloquial text (dialogues).
|
76 |
+
> - **CC-100-Kor:** [CC-100](https://data.statmt.org/cc-100/) is the large-scale multilingual web crawled corpora by using CC-Net ([Wenzek et al., 2020](https://www.aclweb.org/anthology/2020.lrec-1.494)). This is used for training XLM-R ([Conneau et al., 2020](https://aclanthology.org/2020.acl-main.747/)). We use the Korean portion from this corpora.
|
77 |
+
> - **NAMUWIKI:** NAMUWIKI is a Korean web-based encyclopedia, similar to Wikipedia, but known to be less formal. Specifically, we download [the dump](http://dump.thewiki.kr) created on March 2nd, 2020.
|
78 |
+
> - **NEWSCRAWL:** NEWSCRAWL consists of 12,800,000 news articles published from 2011 to 2020, collected from a news aggregation platform.
|
79 |
+
> - **PETITION:** Petition is a collection of public petitions posted to the Blue House asking for administrative actions on social issues. We use the articles in the [Blue House National Petition](https://www1.president.go.kr/petitions) published from [August 2017 to March 2019](https://ko-nlp.github.io/Korpora/en-docs/corpuslist/korean_petitions.html).
|
80 |
+
|
81 |
+
The authors also describe ethical considerations related to the pretraining corpora in the [associated paper](https://arxiv.org/pdf/2105.09680.pdf).
|
82 |
+
|
83 |
+
#### Training Procedure
|
84 |
+
|
85 |
+
##### Preprocessing
|
86 |
+
|
87 |
+
The authors describe their preprocessing procedure in the [associated paper](https://arxiv.org/pdf/2105.09680.pdf):
|
88 |
+
|
89 |
+
> We filter noisy text and non-Korean text using the same methods from Section 2.3 (of the paper). Each document in the corpus is split into sentences using C++ implementation (v1.3.1.) of rule-based [Korean Sentence Splitter (KSS)](https://github.com/likejazz/korean-sentence-splitter). For CC-100-Kor and NEWSCRAWL, we keep sentences of length greater than equal to 200 characters, as a heuristics to keep well-formed sentences. We then remove sentences included in our benchmark task datasets, using BM25 as a sentence similarity metric ([reference](https://www.microsoft.com/en-us/research/publication/okapi-at-trec-3/)).
|
90 |
+
|
91 |
+
###### Tokenization
|
92 |
+
|
93 |
+
The authors describe their tokenization procedure in the [associated paper](https://arxiv.org/pdf/2105.09680.pdf):
|
94 |
+
|
95 |
+
> We design and use a new tokenization method, morpheme-based subword tokenization. When building a vocabulary, we pre-tokenize a raw text into morphemes using a morphological analyzer, and then we apply byte pair encoding (BPE) ([Senrich et al., 2016](https://aclanthology.org/P16-1162/)) to get the final vocabulary. For morpheme segmentation, we use [Mecab-ko](https://bitbucket.org/eunjeon/mecab-ko), MeCab ([Kudo, 2006](https://taku910.github.io/mecab/)) adapted for Korean, and for BPE segmentation, we use the wordpiece tokenizer from [Huggingface Tokenizers library](https://github.com/huggingface/tokenizers). We specify the vocabulary size to 32k. After building the vocabulary, we only use the BPE model during inference, which allows us to tokenize a word sequence by reflecting morphemes without a morphological analyzer. This improves both usability and speed.
|
96 |
+
|
97 |
+
The training configurations are further described in the [paper](https://arxiv.org/pdf/2105.09680.pdf).
|
98 |
+
|
99 |
+
## Evaluation
|
100 |
+
|
101 |
+
#### Testing Data, Factors and Metrics
|
102 |
+
|
103 |
+
The model was evaluated on the [KLUE Benchmark](https://github.com/KLUE-benchmark/KLUE). The tasks and metrics from the KLUE Benchmark that were used to evaluate this model are described briefly below. For more information about the KLUE Benchmark, see the [data card](https://huggingface.co/datasets/klue), [Github Repository](https://github.com/KLUE-benchmark/KLUE), and [associated paper](https://arxiv.org/pdf/2105.09680.pdf).
|
104 |
+
|
105 |
+
- **Task:** Topic Classification (TC) - Yonhap News Agency Topic Classification (YNAT), **Metrics:** Macro F1 score, defined as the mean of topic-wise F1 scores, giving the same importance to each topic.
|
106 |
+
|
107 |
+
- **Task:** Semantic Textual Similarity (STS), **Metrics:** Pearsons' correlation coefficient (Pearson’ r) and F1 score
|
108 |
+
|
109 |
+
- **Task:** Natural Language Inference (NLI), **Metrics:** Accuracy
|
110 |
+
|
111 |
+
- **Task:** Named Entity Recognition (NER), **Metrics:** Entity-level macro F1 (Entity F1) and character-level macro F1 (Char F1) scores
|
112 |
+
|
113 |
+
- **Task:** Relation Extraction (RE), **Metrics:** Micro F1 score on relation existing cases and area under the precision- recall curve (AUPRC) on all classes
|
114 |
+
|
115 |
+
- **Task:** Dependency Parsing (DP), **Metrics:** Unlabeled attachment score (UAS) and labeled attachment score (LAS)
|
116 |
+
|
117 |
+
- **Task:** Machine Reading Comprehension (MRC), **Metrics:** Exact match (EM) and character-level ROUGE-W (ROUGE), which can be viewed as longest common consecutive subsequence (LCCS)-based F1 score.
|
118 |
+
|
119 |
+
- **Task:** Dialogue State Tracking (DST), **Metrics:** Joint goal accuracy (JGA) and slot micro F1 score (Slot F1)
|
120 |
+
|
121 |
+
#### Results
|
122 |
+
|
123 |
+
| Task | TC | STS | | NLI | NER | | RE | | DP | | MRC | | DST | |
|
124 |
+
| :---: |:---: | :---: | :---: |:---:| :---: | :---: |:---:| :---:| :---: |:---: | :---: | :---:| :---: | :---: |
|
125 |
+
| Metric | F1 | Pearsons' r| F1 | ACC | Entity F1 | Char F1 | F1 | AUPRC| UAS | LAS | EM | ROUGE| JGA |Slot F1 |
|
126 |
+
| | 85.73| 90.85 | 82.84 |81.63| 83.97 | 91.39 |66.44| 66.17| 89.96 |88.05 | 62.32 | 68.51| 46.64 | 91.61 |
|
127 |
+
|
128 |
+
|
129 |
+
## Environmental Impact
|
130 |
+
|
131 |
+
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). We present the hardware type based on the [associated paper](https://arxiv.org/pdf/2105.09680.pdf).
|
132 |
+
|
133 |
+
- **Hardware Type:** TPU v3-8
|
134 |
+
- **Hours used:** Unknown
|
135 |
+
- **Cloud Provider:** Unknown
|
136 |
+
- **Compute Region:** Unknown
|
137 |
+
- **Carbon Emitted:** Unknown
|
138 |
+
|
139 |
+
## Technical Specifications
|
140 |
+
|
141 |
+
See the [associated paper](https://arxiv.org/pdf/2105.09680.pdf) for details on the modeling architecture (BERT), objective, compute infrastructure, and training details.
|
142 |
+
|
143 |
+
## Citation Information
|
144 |
|
145 |
```bibtex
|
146 |
@misc{park2021klue,
|