bart-base-japanese-news(base-sized model)

This repository provides a Japanese BART model. The model was trained by Stockmark Inc.

An introductory article on the model can be found at the following URL.

https://tech.stockmark.co.jp/blog/bart-japanese-base-news/

Model description

BART is a transformer encoder-decoder (seq2seq) model with a bidirectional (BERT-like) encoder and an autoregressive (GPT-like) decoder. BART is pre-trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text.

BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering).

Intended uses & limitations

You can use the raw model for text infilling. However, the model is mostly meant to be fine-tuned on a supervised dataset.

How to use the model

NOTE: Since we are using a custom tokenizer, please use trust_remote_code=True to initialize the tokenizer.

Simple use

from transformers import AutoTokenizer, BartModel

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartModel.from_pretrained(model_name)

inputs = tokenizer("今日は良い天気です。", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

Sentence Permutation

import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
    model = model.to("cuda")

# correct order text is "明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。"
text = "電車は止まる可能性があります。ですから、自宅から働きます。明日は大雨です。"

inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 明日は大雨です。電車は止まる可能性があります。ですから、自宅から働きます。

Mask filling

import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
    model = model.to("cuda")

text = "今日の天気は<mask>のため、傘が必要でしょう。"

inputs = tokenizer([text], max_length=128, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, max_length=128)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 今日の天気は、雨のため、傘が必要でしょう。

Text generation

NOTE: You can use the raw model for text generation. However, the model is mostly meant to be fine-tuned on a supervised dataset.

import torch
from transformers import AutoTokenizer, BartForConditionalGeneration

model_name = "stockmark/bart-base-japanese-news"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = BartForConditionalGeneration.from_pretrained(model_name)

if torch.cuda.is_available():
   model = model.to("cuda")

text = "自然言語処理（しぜんげんごしょり、略称：NLP）は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、人工知能と言語学の一分野である。「計算言語学」（computational linguistics）との類似もあるが、自然言語処理は工学的な視点からの言語処理をさすのに対して、計算言語学は言語学的視点を重視する手法をさす事が多い。"

inputs = tokenizer([text], max_length=512, return_tensors="pt", truncation=True)
text_ids = model.generate(inputs["input_ids"].to(model.device), num_beams=3, min_length=0, max_length=40)
output = tokenizer.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]

print(output)
# sample output: 自然言語処理(しぜんげんごしょり、略称:NLP)は、人間が日常的に使っている自然言語をコンピュータに処理させる一連の技術であり、言語学の一分野である。

Training

The model was trained on Japanese News Articles.

Tokenization

The model uses a sentencepiece-based tokenizer. The vocabulary was first trained on a selected subset from the training data using the official sentencepiece training script.

Licenses

The pretrained models are distributed under the terms of the MIT License.

NOTE: Only tokenization_bart_japanese_news.py is Apache License, Version 2.0. Please see tokenization_bart_japanese_news.py for license details.

Contact

If you have any questions, please contact us using our contact form.

Acknowledgement

This comparison study supported with Cloud TPUs from Google’s TensorFlow Research Cloud (TFRC).