metadata

language: ja
license: cc-by-sa-4.0
tags:
  - finance
datasets:
  - securities reports
  - summaries of financial results
widget:
  - text: 流動[MASK]は、1億円となりました。

Additional pretrained BERT base Japanese finance

This is a BERT model pretrained on texts in the Japanese language.

The codes for the pretraining are available at retarfi/language-pretraining.

Model architecture

The model architecture is the same as BERT small in the original BERT paper; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

Training Data

The models are additionally trained on financial corpus from Tohoku University's BERT base Japanese model (cl-tohoku/bert-base-japanese).

The financial corpus consists of 2 corpora:

Summaries of financial results from October 9, 2012, to December 31, 2020
Securities reports from February 8, 2018, to December 31, 2020

The financial corpus file consists of approximately 27M sentences.

Tokenization

You can use tokenizer Tohoku University's BERT base Japanese model (cl-tohoku/bert-base-japanese).

You can use the tokenizer:

tokenizer = transformers.BertJapaneseTokenizer.from_pretrained('cl-tohoku/bert-base-japanese')

Training

The models are trained with the same configuration as BERT base in the original BERT paper; 512 tokens per instance, 256 instances per batch, and 1M training steps.

Citation

There will be another paper for this pretrained model. Be sure to check here again when you cite.

@inproceedings{suzuki2022additional-fin-bert,
  title={事前学習と追加事前学習による金融言語モデルの構築と検証},
  % title={Construction and Validation of a Pre-Training and Additional Pre-Training Financial Language Model},
  author={鈴木 雅弘 and 坂地 泰紀 and 平野 正徳 and 和泉 潔},
  % author={Masahiro Suzuki and Hiroki Sakaji and Masanori Hirano and Kiyoshi Izumi},
  booktitle={人工知能学会第28回金融情報学研究会(SIG-FIN)},
  % booktitle={Proceedings of JSAI Special Interest Group on Financial Infomatics (SIG-FIN) 28},
  pages={132-137},
  year={2022}
}

Licenses

The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.

Acknowledgments

This work was supported by JSPS KAKENHI Grant Number JP21K12010 and JST-Mirai Program Grant Number JPMJMI20B1.