|
--- |
|
license: apache-2.0 |
|
language: |
|
- ja |
|
--- |
|
|
|
|
|
|
|
|
|
|
|
|
|
# Model Card for japanese-spoken-language-bert |
|
|
|
日本語READMEは[こちら](./README_JA.md) |
|
|
|
<!-- Provide a quick summary of what the model is/does. [Optional] --> |
|
These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. |
|
We used CSJ and the Japanese diet record. |
|
CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). |
|
We only provide model parameters. You have to download other config files to use these models. |
|
|
|
We provide three models down below: |
|
- **1-6 layer-wise** (Folder Name: models/1-6_layer-wise) |
|
Fine-Tuned only 1st-6th layers in Encoder on CSJ. |
|
|
|
- **TAPT512 60k** (Folder Name: models/tapt512_60k) |
|
Fine-Tuned on CSJ. |
|
|
|
- **DAPT128-TAPT512** (Folder Name: models/dapt128-tap512) |
|
Fine-Tuned on the diet record and CSJ. |
|
|
|
# Table of Contents |
|
|
|
- [Model Card for japanese-spoken-language-bert](#model-card-for-japanese-spoken-language-bert) |
|
- [Table of Contents](#table-of-contents) |
|
- [Model Details](#model-details) |
|
- [Model Description](#model-description) |
|
- [Training Details](#training-details) |
|
- [Training Data](#training-data) |
|
- [Training Procedure](#training-procedure) |
|
- [Evaluation](#evaluation) |
|
- [Testing Data, Factors & Metrics](#testing-data-factors--metrics) |
|
- [Testing Data](#testing-data) |
|
- [Factors](#factors) |
|
- [Metrics](#metrics) |
|
- [Results](#results) |
|
- [Citation](#citation) |
|
- [More Information](#more-information-optional) |
|
- [Model Card Authors](#model-card-authors-optional) |
|
- [Model Card Contact](#model-card-contact) |
|
- [How to Get Started with the Model](#how-to-get-started-with-the-model) |
|
|
|
|
|
# Model Details |
|
|
|
## Model Description |
|
|
|
<!-- Provide a longer summary of what this model is/does. --> |
|
These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese. |
|
We used CSJ and the Japanese diet record. |
|
CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/). |
|
We only provide model parameters. You have to download other config files to use these models. |
|
|
|
We provide three models down below: |
|
- 1-6 layer-wise (Folder Name: models/1-6_layer-wise) |
|
Fine-Tuned only 1st-6th layers in Encoder on CSJ. |
|
|
|
- TAPT512 60k (Folder Name: models/tapt512_60k) |
|
Fine-Tuned on CSJ. |
|
|
|
- DAPT128-TAPT512 (Folder Name: models/dapt128-tap512) |
|
Fine-Tuned on the diet record and CSJ. |
|
|
|
**Model Information** |
|
- **Model type:** Language model |
|
- **Language(s) (NLP):** ja |
|
- **License:** Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”) |
|
|
|
|
|
# Training Details |
|
|
|
## Training Data |
|
|
|
<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
- 1-6 layer-wise: CSJ |
|
- TAPT512 60K: CSJ |
|
- DAPT128-TAPT512: The Japanese diet record and CSJ |
|
|
|
|
|
## Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
We continuously train the pre-trained Japanese BERT model ([cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking); written BERT). |
|
|
|
In detail, see [Japanese blog](https://tech.retrieva.jp/entry/2021/04/01/114943) or [Japanese paper](https://www.anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P4-17.pdf). |
|
|
|
# Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
## Testing Data, Factors & Metrics |
|
|
|
### Testing Data |
|
|
|
<!-- This should link to a Data Card if possible. --> |
|
|
|
We use CSJ for the evaluation. |
|
|
|
|
|
### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
We evaluate the following tasks on CSJ: |
|
- Dependency Parsing |
|
- Sentence Boundary |
|
- Important Sentence Extraction |
|
|
|
### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
- Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS) |
|
- Sentence Boundary: F1 Score |
|
- Important Sentence Extraction: F1 Score |
|
|
|
## Results |
|
|
|
| | Dependency Parsing | Sentence Boundary | Important Sentence Extraction | |
|
| :--- | ---: | ---: | ---: | |
|
| written BERT | 39.4 | 61.6 | 36.8 | |
|
| 1-6 layer wise | 44.6 | 64.8 | 35.4 | |
|
| TAPT 512 60K | - | - | 40.2 | |
|
| DAPT128-TAPT512 | 42.9 | 64.0 | 39.7 | |
|
|
|
|
|
# Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@inproceedings{csjbert2021, |
|
title = {CSJを用いた日本語話し言葉BERTの作成}, |
|
author = {勝又智 and 坂田大直}, |
|
booktitle = {言語処理学会第27回年次大会}, |
|
year = {2021}, |
|
} |
|
``` |
|
|
|
|
|
# More Information |
|
|
|
https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese) |
|
|
|
# Model Card Authors |
|
|
|
<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. --> |
|
|
|
Satoru Katsumata |
|
|
|
# Model Card Contact |
|
|
|
pr@retrieva.jp |
|
|
|
# How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
<details> |
|
<summary> Click to expand </summary> |
|
|
|
1. Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia. |
|
|
|
```bash |
|
python download_wikipedia_bert.py |
|
``` |
|
|
|
This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub. |
|
https://github.com/cl-tohoku/bert-japanese |
|
|
|
2. Run sample_mlm.py to confirm you can use our models. |
|
|
|
```bash |
|
python sample_mlm.py |
|
``` |
|
|
|
</details> |