language: ja
license: mit
datasets:
- mC4-ja
electra-base-japanese-discriminator (sudachitra-wordpiece, mC4 Japanese) - SHINOBU
This is an ELECTRA model pretrained on approximately 200M Japanese sentences.
The input text is tokenized by SudachiTra with the WordPiece subword tokenizer.
See tokenizer_config.json
for the setting details.
How to use
Please install SudachiTra
in advance.
$ pip install -U torch transformers sudachitra
You can load the model and the tokenizer via AutoModel and AutoTokenizer, respectively.
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("megagonlabs/electra-base-japanese-discriminator")
tokenizer = AutoTokenizer.from_pretrained("megagonlabs/electra-base-japanese-discriminator", trust_remote_code=True)
model(**tokenizer("まさにオールマイティーな商品だ。", return_tensors="pt")).last_hidden_state
tensor([[[-0.0498, -0.0285, 0.1042, ..., 0.0062, -0.1253, 0.0338],
[-0.0686, 0.0071, 0.0087, ..., -0.0210, -0.1042, -0.0320],
[-0.0636, 0.1465, 0.0263, ..., 0.0309, -0.1841, 0.0182],
...,
[-0.1500, -0.0368, -0.0816, ..., -0.0303, -0.1653, 0.0650],
[-0.0457, 0.0770, -0.0183, ..., -0.0108, -0.1903, 0.0694],
[-0.0981, -0.0387, 0.1009, ..., -0.0150, -0.0702, 0.0455]]],
grad_fn=<NativeLayerNormBackward>)
Model architecture
The model architecture is the same as the original ELECTRA base model; 12 layers, 768 dimensions of hidden states, and 12 attention heads.
Training data and libraries
This model is trained on the Japanese texts extracted from the mC4 Common Crawl's multilingual web crawl corpus. We used the Sudachi to split texts into sentences, and also applied a simple rule-based filter to remove nonlinguistic segments of mC4 multilingual corpus. The extracted texts contains over 600M sentences in total, and we used approximately 200M sentences for pretraining.
We used NVIDIA's TensorFlow2-based ELECTRA implementation for pretraining. The time required for the pretrainig was about 110 hours using GCP DGX A100 8gpu instance with enabling Automatic Mixed Precision.
Licenses
The pretrained models are distributed under the terms of the MIT License.
Citations
- mC4
Contains information from mC4
which is made available under the ODC Attribution License.
@article{2019t5,
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
journal = {arXiv e-prints},
year = {2019},
archivePrefix = {arXiv},
eprint = {1910.10683},
}