|
--- |
|
language: |
|
- ko |
|
|
|
--- |
|
|
|
# KR-FinBert & KR-FinBert-SC |
|
|
|
Much progress has been made in the NLP (Natural Language Processing) field, with numerous studies showing that domain adaptation using small-scale corpus and fine-tuning with labeled data is effective for overall performance improvement. |
|
we proposed KR-FinBert for the financial domain by further pre-training it on a financial corpus and fine-tuning it for sentiment analysis. As many studies have shown, the performance improvement through adaptation and conducting the downstream task was also clear in this experiment. |
|
|
|
![KR-FinBert](https://huggingface.co/snunlp/KR-FinBert/resolve/main/images/KR-FinBert.png) |
|
|
|
## Data |
|
|
|
The training data for this model is expanded from those of **[KR-BERT-MEDIUM](https://huggingface.co/snunlp/KR-Medium)**, texts from Korean Wikipedia, general news articles, legal texts crawled from the National Law Information Center and [Korean Comments dataset](https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments). For the transfer learning, **corporate related economic news articles from 72 media sources** such as the Financial Times, The Korean Economy Daily, etc and **analyst reports from 16 securities companies** such as Kiwoom Securities, Samsung Securities, etc are added. Included in the dataset is 440,067 news titles with their content and 11,237 analyst reports. **The total data size is about 13.22GB.** For mlm training, we split the data line by line and **the total no. of lines is 6,379,315.** |
|
KR-FinBert is trained for 5.5M steps with the maxlen of 512, training batch size of 32, and learning rate of 5e-5, taking 67.48 hours to train the model using NVIDIA TITAN XP. |
|
|
|
|
|
## Downstream tasks |
|
### Sentimental Classification model |
|
|
|
Downstream task performances with 50,000 labeled data. |
|
|
|
|Model|Accuracy| |
|
|-|-| |
|
|KR-FinBert|0.963| |
|
|KR-BERT-MEDIUM|0.958| |
|
|KcBert-large|0.955| |
|
|KcBert-base|0.953| |
|
|KoBert|0.817| |
|
|
|
### Inference sample |
|
|
|
|Positive|Negative| |
|
|-|-| |
|
|ํ๋๋ฐ์ด์ค, 'ํด๋ฆฌํ์
' ์ฝ๋ก๋19 ์น๋ฃ ๊ฐ๋ฅ์ฑ์ 19% ๊ธ๋ฑ | ์ํ๊ดๆ ช '์ฝ๋ก๋ ๋นํ๊ธฐ' ์ธ์ ๋๋๋โฆ"CJ CGV ์ฌ 4000์ต ์์ค ๋ ์๋"ย | |
|
|์ด์ํํ, 3๋ถ๊ธฐย ์์
์ตย 176์ตโฆ์ ๋
ๆฏย 80%โ | C์ผํฌ์ย ๋ฉ์ถย ํ์๋นํโฆ๋ํํญ๊ณตย 1๋ถ๊ธฐย ์์
์ ์ย 566์ตย | |
|
|"GKL, 7๋
ย ๋ง์ย ๋ย ์๋ฆฟ์ย ๋งค์ถ์ฑ์ฅย ์์" | '1000์ต๋ย ํก๋ นยท๋ฐฐ์'ย ์ต์ ์ย ํ์ฅ ๊ตฌ์โฆย SK๋คํธ์์คย "๊ฒฝ์ ๊ณต๋ฐฑ ๋ฐฉ์ง ์ต์ "ย | |
|
|์์ง์
์คํ๋์ค, ์ฝํ
์ธ ํ์ฝ์ ์ฌ์ ์ฒซ ๋งค์ถ 1000์ต์ ๋ํ | ๋ถํ ๊ณต๊ธ ์ฐจ์ง์โฆ๊ธฐ์์ฐจย ๊ด์ฃผ๊ณต์ฅ ์ ๋ฉด ๊ฐ๋ ์ค๋จย | |
|
|์ผ์ฑ์ ์, 2๋
๋ง์ ์ธ๋ ์ค๋งํธํฐ ์์ฅ ์ ์ ์จ 1์ '์์ข ํํ' | ํ๋์ ์ฒ , ์ง๋ํดย ์์
์ตย 3,313์ต์ยทยทยท์ ๋
ๆฏย 67.7%ย ๊ฐ์ย | |
|
|
|
|
|
### Citation |
|
|
|
``` |
|
@misc{kr-FinBert-SC, |
|
author = {Kim, Eunhee and Hyopil Shin}, |
|
title = {KR-FinBert: Fine-tuning KR-FinBert for Sentiment Analysis}, |
|
year = {2022}, |
|
publisher = {GitHub}, |
|
journal = {GitHub repository}, |
|
howpublished = {\url{https://huggingface.co/snunlp/KR-FinBert-SC}} |
|
} |
|
``` |