metadata
license: apache-2.0
datasets:
- graelo/wikipedia
- uonlp/CulturaX
- HuggingFaceH4/ultrachat_200k
language:
- ja
- en
Evaluation
How to use
Hugggingface
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("lightblue/karasu-7B")
model = AutoModelForCausalLM.from_pretrained("lightblue/karasu-7B", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
messages = [{"role": "system", "content": "あなたはAIアシスタントです。"}]
messages.append({"role": "user", "content": "イギリスの首相は誰ですか?"})
prompt = tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
pipe(prompt, max_new_tokens=100, do_sample=False, temperature=0.0, return_full_text=False)
VLLM
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(temperature=0.0, max_tokens=100)
llm = LLM(model="lightblue/karasu-7B")
messages = [{"role": "system", "content": "あなたはAIアシスタントです。"}]
messages.append({"role": "user", "content": "イギリスの首相は誰ですか?"})
prompt = llm.llm_engine.tokenizer.apply_chat_template(conversation=messages, add_generation_prompt=True, tokenize=False)
prompts = [prompt]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Base checkpoint
augmxnt/shisa-7b-v1
- Mistral-7B base
- Pre-trained on 8B of MADLAD-Ja
- Finetuned on Japanese instructions
- Highest scoring 7B model on conversation benchmark (JA MT-Bench)
Training datasets (total ~7B)
- Aozora Bunko
- Japanese Law Precedent Dataset
- Japanese Wikipedia
- .lg.jp, .go.jp, .ac.jp domain webscrapes from CulturaX (Any documents with same first 25 characters were de-duplicated)
- English Ultrachat200K-gen (So that it doesn't forget English and chatting ability learned in the base checkpoint)
Developed by
Engineers
Peter Devine
Sho Higuchi
Advisors
Yuuki Yamanaka
Atom Sonoda
Project manager
Shunichi Taniguchi
Dataset evaluator
Renju Aoki