thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
license: llama2
datasets:
- mc4
- wikipedia
- EleutherAI/pile
- oscar-corpus/colossal-oscar-1.0
- cc100
language:
- ja
- en
inference: false
rinna/youri-7b
Overview
We conduct continual pre-training of llama2-7b on 40B tokens from a mixture of Japanese and English datasets. The continual pre-training significantly improves the model's performance on Japanese tasks.
The name youri
comes from the Japanese word 妖狸/ようり/Youri
, which is a kind of Japanese mythical creature (妖怪/ようかい/Youkai
).
Library
The model was trained using code based on EleutherAI/gpt-neox.
Model architecture
A 32-layer, 4096-hidden-size transformer-based language model. Refer to the llama2 paper for architecture details.
Continual pre-training
The model was initialized with the llama2-7b model and continually trained on around 40B tokens from a mixture of the following corpora
- Japanese CC-100
- Japanese C4
- Japanese OSCAR
- The Pile
- Wikipedia
- rinna curated Japanese dataset
Authors
Benchmarking
Evaluation experiments suggest that rinna's youri-7b
series outperforms other open-source Japanese LLMs on Japanese tasks according to our runs.
Model | Model type | 4-task score | 6-task score | 8-task score |
---|---|---|---|---|
rinna/youri-7b-instruction | SFT | 83.88 | 80.93 | 63.63 |
rinna/youri-7b-chat | SFT | 78.29 | 78.47 | 62.18 |
matsuo-lab/weblab-10b-instruction-sft | SFT | 78.75 | 75.05 | 59.11 |
rinna/youri-7b | pre-trained | 73.32 | 74.58 | 58.87 |
stabilityai/japanese-stablelm-instruct-alpha-7b | SFT | 70.10 | 71.32 | 54.71 |
elyza/ELYZA-japanese-Llama-2-7b | pre-trained | 71.72 | 69.28 | 53.17 |
elyza/ELYZA-japanese-Llama-2-7b-instruct | SFT | 70.57 | 68.12 | 53.14 |
stabilityai/japanese-stablelm-base-alpha-7b | pre-trained | 61.03 | 65.83 | 51.05 |
matsuo-lab/weblab-10b | pre-trained | 66.33 | 65.58 | 50.74 |
meta/llama2-7b | pre-trained | 56.33 | 54.80 | 42.97 |
rinna/japanese-gpt-neox-3.6b | pre-trained | 47.20 | 54.68 | 41.80 |
rinna/bilingual-gpt-neox-4b | pre-trained | 46.60 | 52.04 | 40.03 |
How to use the model
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rinna/youri-7b")
model = AutoModelForCausalLM.from_pretrained("rinna/youri-7b")
if torch.cuda.is_available():
model = model.to("cuda")
text = "西田幾多郎は、"
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
with torch.no_grad():
output_ids = model.generate(
token_ids.to(model.device),
max_new_tokens=200,
min_new_tokens=200,
do_sample=True,
temperature=1.0,
top_p=0.95,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)
output = tokenizer.decode(output_ids.tolist()[0])
print(output)
"""
西田幾多郎は、プラトンの復権を主張し、対する従来の西洋哲学は、近代の合理主義哲学に委ね、「従来の哲学は破 壊されてしまった」と述べている。 西田幾多郎は、西洋近代哲学の「徹底的な検討」を拒んだ。それは、「現代的理解の脆弱性を補う筈の、従来のヨーロッパに伝わる哲学的な方法では到底それができなかったからである」とい
"""
Tokenization
The model uses the original llama-2 tokenizer.
How to cite
@misc{RinnaYouri7b,
url={https://huggingface.co/rinna/youri-7b},
title={rinna/youri-7b},
author={Zhao, Tianyu and Kaga, Akio and Sawada, Kei}
}
Citations
@software{gpt-neox-library,
title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
url = {https://www.github.com/eleutherai/gpt-neox},
doi = {10.5281/zenodo.5879544},
month = {8},
year = {2021},
version = {0.0.1},
}