metadata

license: cc-by-sa-4.0
language:
  - en
  - ja
programming_language:
  - C
  - C++
  - C#
  - Go
  - Java
  - JavaScript
  - Lua
  - PHP
  - Python
  - Ruby
  - Rust
  - Scala
  - TypeScript
library_name: transformers
tags:
  - deberta
  - deberta-v3
  - fill-mask
datasets:
  - wikipedia
  - EleutherAI/pile
  - bigcode/the-stack
  - mc4
metrics:
  - accuracy
mask_token: '[MASK]'
widget:
  - text: 京都大学で自然言語処理を[MASK]する。

Model Card for Japanese DeBERTa V2 base

Model description

This is a Japanese DeBERTa V3 base model pre-trained on LLM-jp corpus v1.0.

How to use

You can use this model for masked language modeling as follows:

from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained('ku-nlp/deberta-v2-base-japanese')
model = AutoModelForMaskedLM.from_pretrained('ku-nlp/deberta-v2-base-japanese')

sentences = [
    "京都大学で自然言語処理を[MASK]する。",
    "I [MASK] NLP at Kyoto University.",
    'int main() { printf("Hello, [MASK]!"); return 0; }',
]
encodings = tokenizer(sentences, return_tensors='pt')
...

You can also fine-tune this model on downstream tasks.

Tokenization

The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model. The vocabulary entries were converted from llm-jp-tokenizer v2.2 (100k). Please refer to README.md of llm-jp/llm-ja-tokenizer for details on the vocabulary construction procedure.

Note that unlike ku-nlp/deberta-v2-base-japanese, pre-segmentation by a morphological analyzer (e.g., Juman++) is no longer required for this model.

Training data

We used the LLM-jp corpus v1.0.1 for pre-training. The corpus consists of the following corpora:

Japanese
- Wikipedia (1B tokens)
- mC4 (129B tokens)
English
- Wikipedia (4B tokens)
- The Pile (126B tokens)
Code
- The Stack (10B tokens)

We shuffled the corpora, which has 270B tokens in total, and trained the model for 2 epochs. Thus, the total number of tokens fed to the model was 540B.

Training procedure

We slightly modified the official implementation of DeBERTa V3 and followed the official training procedure. The modified code is available at nobu-g/DeBERTa.

The following hyperparameters were used during pre-training:

learning_rate: 1e-4
per_device_train_batch_size: 800
num_devices: 8
gradient_accumulation_steps: 3
total_train_batch_size: 2400
max_seq_length: 512
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-06
lr_scheduler_type: linear schedule with warmup
training_steps: 475,000
warmup_steps: 10,000

Acknowledgments

This work was supported by Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN) through General Collaboration Project no. jh221004, "Developing a Platform for Constructing and Sharing of Large-Scale Japanese Language Models". For training models, we used the mdx: a platform for the data-driven future.