|
--- |
|
license: mit |
|
language: |
|
- ja |
|
library_name: transformers |
|
pipeline_tag: text-generation |
|
tags: |
|
- japanese |
|
- llama-2 |
|
- Powered by AWS Trainium |
|
--- |
|
|
|
# stockmark/stockmark-13b |
|
|
|
Stockmark-13b is a 13 billion parameter LLM pretrained from scratch based on Japanese corpus of about 220B tokens. This model is developed by [Stockmark Inc.](https://stockmark.co.jp/) |
|
|
|
Please see our [blog](https://tech.stockmark.co.jp/blog/202310_stockmark_13b/) for more details. |
|
|
|
This project is supported by [AWS LLM development support program](https://aws.amazon.com/jp/local/llm-development-support-program/). |
|
|
|
We also provide [stockmark-13b-instruct](https://huggingface.co/stockmark/stockmark-13b-instruct), which is the instruction tuned version of stockmark-13b. |
|
|
|
## How to use |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
# For A100 or H100 GPU |
|
model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-13b", device_map="auto", torch_dtype=torch.bfloat16) |
|
|
|
# If you use a T4 or V100 GPU, please load a model in 8 bit with the below code. |
|
# To do so, you need to install `bitsandbytes` via `pip install bitsandbytes`. |
|
# model = AutoModelForCausalLM.from_pretrained("stockmark/stockmark-13b", device_map={"": 0}, load_in_8bit=True) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("stockmark/stockmark-13b") |
|
|
|
inputs = tokenizer("自然言語処理とは", return_tensors="pt").to(model.device) |
|
with torch.no_grad(): |
|
tokens = model.generate( |
|
**inputs, |
|
max_new_tokens=128, |
|
do_sample=True, |
|
temperature=0.7 |
|
) |
|
|
|
output = tokenizer.decode(tokens[0], skip_special_tokens=True) |
|
print(output) |
|
``` |
|
|
|
## Examples: |
|
|
|
- LoRA tuning: https://huggingface.co/stockmark/stockmark-13b/blob/main/notebooks/LoRA.ipynb |
|
|
|
## Training dataset |
|
|
|
We have used Japanese corpus of total of about 220 billion tokens. |
|
|
|
|corpus|tokens after preprocessing| |
|
|:---:|:---:| |
|
|Stockmark Web Corpus (This dataset will not be released)|9.1 billion| |
|
|Patent|34.8 billion| |
|
|Wikipedia|1.0 billion| |
|
|CC100|10.9 billion| |
|
|mC4|53.2 billion| |
|
|CommonCrawl (snapshot: 2023-23, 2022-49, 2022-21, 2021-21)|112.9 billion| |
|
|
|
|
|
## Accelerator and Library |
|
- Accelerator: AWS Trainium |
|
- https://aws.amazon.com/machine-learning/trainium/ |
|
- Library for distributed training: neuronx-nemo-megatron |
|
- https://github.com/aws-neuron/neuronx-nemo-megatron |
|
|
|
## License |
|
[MIT](https://opensource.org/licenses/MIT) |
|
|
|
## Developed by |
|
[Stockmark Inc.](https://stockmark.co.jp/) |
|
|
|
## Author |
|
[Takahiro Omi](https://huggingface.co/omitakahiro) |
|
|