File size: 7,581 Bytes
254bcbe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 |
---
language:
- ja
tags:
- japanese-stablelm
- causal-lm
pipeline_tag: text-generation
license: apache-2.0
---
# Japanese-StableLM-Instruct-Alpha-7B-v2
![japanese-stablelm-icon](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b/resolve/main/japanese-stablelm-parrot.jpg)
> "A parrot able to speak Japanese, ukiyoe, edo period" โ [Stable Diffusion XL](https://clipdrop.co/stable-diffusion)
## Model Description
`japanese-stablelm-instruct-alpha-7b-v2` is a 7B parameter decoder-only language models pre-trained built on top of the [`Japanese-StableLM-Base-Alpha-7B`](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b) model and further fine-tuned on various instruction-following datasets.
## Usage
First install additional dependencies in [requirements.txt](https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b/blob/main/requirements.txt):
```sh
pip install sentencepiece einops
```
Then start generating text with `japanese-stablelm-instruct-alpha-7b-v2` by using the following code snippet:
```python
import torch
from transformers import LlamaTokenizer, AutoModelForCausalLM
tokenizer = LlamaTokenizer.from_pretrained(
"novelai/nerdstash-tokenizer-v1", additional_special_tokens=["โโ"]
)
model = AutoModelForCausalLM.from_pretrained(
"stabilityai/japanese-stablelm-instruct-alpha-7b-v2",
trust_remote_code=True,
torch_dtype=torch.float16,
variant="fp16",
)
model.eval()
if torch.cuda.is_available():
model = model.to("cuda")
def build_prompt(user_query, inputs="", sep="\n\n### "):
sys_msg = "ไปฅไธใฏใใฟในใฏใ่ชฌๆใใๆ็คบใจใๆ่ใฎใใๅ
ฅๅใฎ็ตใฟๅใใใงใใ่ฆๆฑใ้ฉๅใซๆบใใๅฟ็ญใๆธใใชใใใ"
p = sys_msg
roles = ["ๆ็คบ", "ๅฟ็ญ"]
msgs = [": \n" + user_query, ": \n"]
if inputs:
roles.insert(1, "ๅ
ฅๅ")
msgs.insert(1, ": \n" + inputs)
for role, msg in zip(roles, msgs):
p += sep + role + msg
return p
# Infer with prompt without any additional input
user_inputs = {
"user_query": "ไธใใใใใใจใใใฎๆๅณใๅฐๅญฆ็ใงใๅใใใใใซๆใใฆใใ ใใใ",
"inputs": "ๆ
ใใฏไบบใฎใใใชใใ"
}
prompt = build_prompt(**user_inputs)
input_ids = tokenizer.encode(
prompt,
add_special_tokens=False,
return_tensors="pt"
)
tokens = model.generate(
input_ids.to(device=model.device),
max_new_tokens=256,
temperature=1,
top_p=0.95,
do_sample=True,
)
out = tokenizer.decode(tokens[0][input_ids.shape[1]:], skip_special_tokens=True).strip()
print(out)
"""
ใๆ
ใใฏไบบใฎใใใชใใใใฏใใๆ
ใใใใใใจใใฎไบบใฎใใใซใชใใชใใใจใใๆๅณใงใฏใใใพใใใ
ใใฎใใจใใใฏใใใจใใจใ่ชฐใใฎใใใซ่กๅใใใจใใฎ่กๅใๅใๅใฃใฆ่ชๅใซ่ฟใฃใฆใใใใจใใใใจใ่ชฌใใใใจใใใงใใ
"""
```
## Model Details
* **Model type**: `japanese-stablelm-instruct-alpha-7b-v2` is an auto-regressive language model based on the NeoX transformer architecture.
* **Language(s)**: Japanese
* **Library**: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
* **License**: This model is licensed under [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).
## Training
| Parameters | Hidden Size | Layers | Heads | Sequence Length |
|------------|-------------|--------|-------|-----------------|
| 7B | 4096 | 32 | 32 | 1024 |
### Training Dataset
`japanese-stablelm-instruct-alpha-7b-v2` is fine-tuned on a combination of following datasets:
- [Japanese translation of the Databricks Dolly-15k dataset](https://huggingface.co/datasets/kunishou/databricks-dolly-15k-ja)
- [Japanese translation of the subset of the Anthropic HH dataset](https://huggingface.co/datasets/fujiki/japanese_hh-rlhf-49k)
- [Wikinews](https://ja.wikinews.org/wi) [subset](https://huggingface.co/datasets/fujiki/llm-japanese-dataset_wikinews) of the [izumi-lab/llm-japanese-dataset](https://huggingface.co/datasets/izumi-lab/llm-japanese-dataset)
## Use and Limitations
### Intended Use
This model is intended to be used by the open-source community in chat-like applications in adherence with [Apache-2.0 license](https://www.apache.org/licenses/LICENSE-2.0).
### Limitations and bias
Although the aforementioned datasets help to steer the base language models into "safer" distributions of text, not all biases and toxicity can be mitigated through fine-tuning. We ask that users be mindful of such potential issues that can arise in generated responses. Do not treat model outputs as substitutes for human judgment or as sources of truth. Please use responsibly.
## Authors
- [Meng Lee](https://huggingface.co/leemeng)
- [Fujiki Nakamura](https://huggingface.co/fujiki)
- [Makoto Shing](https://huggingface.co/mkshing)
- [Paul McCann](https://huggingface.co/polm-stability)
- [Takuya Akiba](https://huggingface.co/iwiwi)
- [Naoki Orii](https://huggingface.co/mrorii)
## Acknowledgements
We are utilizing the v1 version of the [novelai-tokenizer](https://github.com/NovelAI/novelai-tokenizer), introduced by [NovelAI](https://novelai.net/), because it processes both Japanese and English text both effectively and efficiently. We extend our gratitude to NovelAI for allowing us to use their remarkable work. For more details about the tokenizer, please refer to their [blog post](https://blog.novelai.net/novelais-new-llm-tokenizer-5bc140e17642).
We are grateful for the contributions of the EleutherAI Polyglot-JA team in helping us to collect a large amount of pre-training data in Japanese. Polyglot-JA members includes Hyunwoong Ko (Project Lead), Fujiki Nakamura (originally started this project when he committed to the Polyglot team), Yunho Mo, Minji Jung, KeunSeok Im, and Su-Kyeong Jang.
We are also appreciative of [AI Novelist/Sta (Bit192, Inc.)](https://ai-novel.com/index.php) and the numerous contributors from [Stable Community Japan](https://discord.gg/VPrcE475HB) for assisting us in gathering a large amount of high-quality Japanese textual data for model training.
## How to cite
```bibtext
@misc{JapaneseStableLMInstructAlpha7Bv2,
url={[https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b-v2](https://huggingface.co/stabilityai/japanese-stablelm-instruct-alpha-7b-v2)},
title={Japanese StableLM Instruct Alpha 7B v2},
author={Lee, Meng and Nakamura, Fujiki and Shing, Makoto and McCann, Paul and Akiba, Takuya and Orii, Naoki}
}
```
## Citations
```bibtex
@misc{alpaca,
author = {Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto },
title = {Stanford Alpaca: An Instruction-following LLaMA model},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/tatsu-lab/stanford_alpaca}},
}
```
```bibtext
@software{gpt-neox-library,
title = {{GPT-NeoX: Large Scale Autoregressive Language Modeling in PyTorch}},
author = {Andonian, Alex and Anthony, Quentin and Biderman, Stella and Black, Sid and Gali, Preetham and Gao, Leo and Hallahan, Eric and Levy-Kramer, Josh and Leahy, Connor and Nestler, Lucas and Parker, Kip and Pieler, Michael and Purohit, Shivanshu and Songz, Tri and Phil, Wang and Weinbach, Samuel},
url = {https://www.github.com/eleutherai/gpt-neox},
doi = {10.5281/zenodo.5879544},
month = {8},
year = {2021},
version = {0.0.1},
}
``` |