japanese-spoken-language-bert / README.md

Update README.md

cd82242 over 1 year ago

5.85 kB

	---
	license: apache-2.0
	language:
	- ja
	---






	# Model Card for japanese-spoken-language-bert

	日本語READMEは[こちら](./README_JA.md)

	<!-- Provide a quick summary of what the model is/does. [Optional] -->
	These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese.
	We used CSJ and the Japanese diet record.
	CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/).
	We only provide model parameters. You have to download other config files to use these models.

	We provide three models down below:
	- 1-6 layer-wise (Folder Name: models/1-6_layer-wise)
	Fine-Tuned only 1st-6th layers in Encoder on CSJ.

	- TAPT512 60k (Folder Name: models/tapt512_60k)
	Fine-Tuned on CSJ.

	- DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
	Fine-Tuned on the diet record and CSJ.

	# Table of Contents

	- [Model Card for japanese-spoken-language-bert](#model-card-for-japanese-spoken-language-bert)
	- [Table of Contents](#table-of-contents)
	- [Model Details](#model-details)
	- [Model Description](#model-description)
	- [Training Details](#training-details)
	- [Training Data](#training-data)
	- [Training Procedure](#training-procedure)
	- [Evaluation](#evaluation)
	- [Testing Data, Factors & Metrics](#testing-data-factors--metrics)
	- [Testing Data](#testing-data)
	- [Factors](#factors)
	- [Metrics](#metrics)
	- [Results](#results)
	- [Citation](#citation)
	- [More Information](#more-information-optional)
	- [Model Card Authors](#model-card-authors-optional)
	- [Model Card Contact](#model-card-contact)
	- [How to Get Started with the Model](#how-to-get-started-with-the-model)


	# Model Details

	## Model Description

	<!-- Provide a longer summary of what this model is/does. -->
	These BERT models are pre-trained on written Japanese (Wikipedia) and fine-tuned on Spoken Japanese.
	We used CSJ and the Japanese diet record.
	CSJ (Corpus of Spontaneous Japanese) is provided by NINJAL (https://www.ninjal.ac.jp/).
	We only provide model parameters. You have to download other config files to use these models.

	We provide three models down below:
	- 1-6 layer-wise (Folder Name: models/1-6_layer-wise)
	Fine-Tuned only 1st-6th layers in Encoder on CSJ.

	- TAPT512 60k (Folder Name: models/tapt512_60k)
	Fine-Tuned on CSJ.

	- DAPT128-TAPT512 (Folder Name: models/dapt128-tap512)
	Fine-Tuned on the diet record and CSJ.

	Model Information
	- Model type: Language model
	- Language(s) (NLP): ja
	- License: Copyright (c) 2021 National Institute for Japanese Language and Linguistics and Retrieva, Inc. Licensed under the Apache License, Version 2.0 (the “License”)


	# Training Details

	## Training Data

	<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	- 1-6 layer-wise: CSJ
	- TAPT512 60K: CSJ
	- DAPT128-TAPT512: The Japanese diet record and CSJ


	## Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	We continuously train the pre-trained Japanese BERT model ([cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking); written BERT).

	In detail, see [Japanese blog](https://tech.retrieva.jp/entry/2021/04/01/114943) or [Japanese paper](https://www.anlp.jp/proceedings/annual_meeting/2021/pdf_dir/P4-17.pdf).

	# Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	## Testing Data, Factors & Metrics

	### Testing Data

	<!-- This should link to a Data Card if possible. -->

	We use CSJ for the evaluation.


	### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	We evaluate the following tasks on CSJ:
	- Dependency Parsing
	- Sentence Boundary
	- Important Sentence Extraction

	### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	- Dependency Parsing: Undirected Unlabeled Attachment Score (UUAS)
	- Sentence Boundary: F1 Score
	- Important Sentence Extraction: F1 Score

	## Results

	\| \| Dependency Parsing \| Sentence Boundary \| Important Sentence Extraction \|
	\| :--- \| ---: \| ---: \| ---: \|
	\| written BERT \| 39.4 \| 61.6 \| 36.8 \|
	\| 1-6 layer wise \| 44.6 \| 64.8 \| 35.4 \|
	\| TAPT 512 60K \| - \| - \| 40.2 \|
	\| DAPT128-TAPT512 \| 42.9 \| 64.0 \| 39.7 \|


	# Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

	BibTeX:

	```bibtex
	@inproceedings{csjbert2021,
	title = {CSJを用いた日本語話し言葉BERTの作成},
	author = {勝又智 and 坂田大直},
	booktitle = {言語処理学会第27回年次大会},
	year = {2021},
	}
	```


	# More Information

	https://tech.retrieva.jp/entry/2021/04/01/114943 (In Japanese)

	# Model Card Authors

	<!-- This section provides another layer of transparency and accountability. Whose views is this model card representing? How many voices were included in its construction? Etc. -->

	Satoru Katsumata

	# Model Card Contact

	pr@retrieva.jp

	# How to Get Started with the Model

	Use the code below to get started with the model.

	<details>
	<summary> Click to expand </summary>

	1. Run download_wikipedia_bert.py to download BERT model which is trained on Wikipedia.

	```bash
	python download_wikipedia_bert.py
	```

	This script downloads config files and a vocab file provided by Inui Laboratory of Tohoku University from Hugging Face Model Hub.
	https://github.com/cl-tohoku/bert-japanese

	2. Run sample_mlm.py to confirm you can use our models.

	```bash
	python sample_mlm.py
	```

	</details>