|
--- |
|
library_name: transformers |
|
license: cc-by-4.0 |
|
datasets: |
|
- DeL-TaiseiOzaki/Tengentoppa-sft-v1.0 |
|
- llm-jp/magpie-sft-v1.0 |
|
- ntotsuka123/clean3-ultraboros-20k-ja-filter |
|
language: |
|
- ja |
|
- en |
|
base_model: |
|
- llm-jp/llm-jp-3-13b |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This is Full Parameter Fine Tuned model based on `llm-jp/llm-jp-3-13B`. |
|
See the base details [here](https://huggingface.co/llm-jp/llm-jp-3-13b). |
|
|
|
Made for the task of `elyza-tasks-100-TV` which Matsuo Lab made in a class. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
- **Developed by:** [Yuto-24](https://github.com/Yuto-24/) |
|
- **Model type:** Text Generation |
|
- **Language(s) (NLP):** Japanese, English |
|
- **License:** CC-BY-4.0 |
|
- **Finetuned from model:** [llm-jp/llm-jp-3-13B](https://huggingface.co/llm-jp/llm-jp-3-13b) |
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** coming soon... |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
|
|
```txt:requirements.txt |
|
numpy |
|
torch>=2.3.0 |
|
datasets |
|
transformers>=4.40.1 |
|
accelerate>=0.29.3 |
|
flash-attn>=2.5.8 |
|
FlagEmbedding |
|
``` |
|
|
|
~~~python |
|
import torch |
|
import numpy as np |
|
|
|
from datasets import Dataset, load_dataset |
|
from FlagEmbedding import BGEM3FlagModel |
|
from transformers import ( |
|
AutoModelForCausalLM, |
|
AutoTokenizer, |
|
TextStreamer, |
|
BitsAndBytesConfig, |
|
) |
|
|
|
elyza_tasks_datasets = load_dataset("elyza/ELYZA-tasks-100") |
|
|
|
model = BGEM3FlagModel("BAAI/bge-m3") |
|
target_texts = elyza_tasks_datasets["test"]["input"].copy() |
|
target_embeds = model.encode(target_texts)["dense_vecs"] |
|
|
|
|
|
def retrieve(input_text): |
|
global target_embeds |
|
|
|
input_texts = [input_text] |
|
input_embeds = model.encode(input_texts)["dense_vecs"] |
|
|
|
# 類似度の計算 |
|
similarity = input_embeds @ target_embeds.T |
|
most_similar_text = target_texts[np.argmax(similarity)] |
|
|
|
target_index = target_texts.index(most_similar_text) |
|
return target_index |
|
|
|
|
|
class CallLLM: |
|
def __init__(self, model_name_or_path: str) -> None: |
|
self.model = AutoModelForCausalLM.from_pretrained( |
|
model_name_or_path, |
|
trust_remote_code=True, |
|
device_map="auto", |
|
).eval() |
|
self.tokenizer = AutoTokenizer.from_pretrained( |
|
model_name_or_path, |
|
trust_remote_code=True, |
|
) |
|
self.streamer = TextStreamer( |
|
self.tokenizer, |
|
) |
|
self.call_type = None |
|
print(f"{self.model.device = }") |
|
|
|
def __call__(self, input_text: str, call_type: str = None, stream=False, **kwargs): |
|
self.call_type = call_type |
|
# print(f"Using call_type: {self.call_type}") |
|
|
|
call_type_dict = { |
|
"chat_template": self.__call_chat_template, |
|
} |
|
|
|
if self.call_type not in call_type_dict.keys(): |
|
raise ValueError( |
|
f"Please set the call_type. You can select from {call_type_dict.keys()}" |
|
) |
|
output = call_type_dict[call_type](input_text.strip(), stream=stream, **kwargs) |
|
return output |
|
|
|
def merge_adapter(self, lora_adapter_path): |
|
# PEFTモデルとしてLoRAアダプターをベースモデルに結合 |
|
self.model = PeftModel.from_pretrained(self.model, lora_adapter_path) |
|
self.model = self.model.merge_and_unload() |
|
|
|
def __call_chat_template(self, input_text: str = "", system_prompt: str = "あなたは、大塚商会の誠実で優秀なアシスタントです。", ** kwargs): |
|
prompt = [] |
|
if system_prompt and system_prompt != "": |
|
prompt.append({"role": "system", "content": system_prompt}) |
|
if input_text and input_text != "": |
|
prompt.append({"role": "user", "content": input_text}) |
|
|
|
tokenized_input = self.tokenizer.apply_chat_template( |
|
prompt, |
|
return_tensors="pt", |
|
) |
|
|
|
output = self.__inference(tokenized_input, **kwargs) |
|
return output |
|
|
|
output = self.__inference(tokenized_input, **kwargs) |
|
return output |
|
|
|
def __inference(self, tokenized_input, stream: bool, **kwargs): |
|
tokenized_input = tokenized_input.to(self.model.device) |
|
attention_mask = torch.ones_like(tokenized_input) |
|
|
|
default_inference_params = { |
|
"attention_mask": attention_mask, |
|
"max_new_tokens": 512, |
|
"do_sample": False, |
|
"repetition_penalty": 1.2, |
|
"eos_token_id": self.tokenizer.eos_token_id, |
|
"pad_token_id": self.tokenizer.eos_token_id, |
|
# "eos_token_id": self.tokenizer.encode("<|im_end|>"), |
|
} |
|
|
|
inference_params = default_inference_params.copy() |
|
inference_params.update(**kwargs) |
|
if stream: |
|
inference_params.update(streamer=self.streamer) |
|
|
|
# Inference |
|
with torch.no_grad(): |
|
outputs = self.model.generate( |
|
tokenized_input, |
|
**inference_params, |
|
)[0] |
|
output = self.tokenizer.decode( |
|
outputs[tokenized_input.size(1):], |
|
skip_special_tokens=True, |
|
) |
|
return output |
|
|
|
model_path_or_id = "Yuto-24/llm-jp-3-13B-Tengentoppa_magpie" |
|
|
|
# Loading model here. |
|
llm = CallLLM(model_path_or_id) |
|
|
|
SYSTEM_PROMPT = """ |
|
# あなたが必ず従うべき事項 |
|
|
|
## 役割 |
|
|
|
あなたは誠実で優秀なアシスタントです。 |
|
質問に対し、簡潔に答えます。 |
|
ハルシネーションをしません。 |
|
必ず正しい情報のみを答えます。 |
|
|
|
## 指示 |
|
|
|
- 評価観点に沿った出力を作成します。 |
|
- ユーザから特別な指示が与えられている場合には、必ず従います。 |
|
- 具体例には評価観点が含まれていますが、あなたが考える「出力」のみを回答してください。 |
|
- 評価観点は、人間があなたの出力を評価するために利用します。 |
|
- 論理的にステップバイステップで考えてください。 |
|
|
|
## 具体例 |
|
|
|
```markdown |
|
{examples} |
|
``` |
|
""".strip() |
|
|
|
EXAMPLE_TEMPLATE = """ |
|
### 入力 |
|
|
|
{dataset_input} |
|
|
|
### 評価観点 |
|
|
|
{dataset_eval_aspect} |
|
|
|
### 出力 |
|
|
|
{dataset_answer} |
|
""".strip() |
|
|
|
|
|
# タスクとなるデータの読み込み |
|
# omnicampusの開発環境では、左にタスクのjsonlをドラッグアンドドロップしてから実行 |
|
|
|
import os |
|
import json |
|
|
|
datasets = [] |
|
with open(f"{os.path.dirname(os.path.abspath('**file**'))}/workspace/elyza-tasks-100-TV_0.jsonl", "r") as f: |
|
item = "" |
|
for line in f: |
|
line = line.strip() |
|
item += line |
|
if item.endswith("}"): |
|
datasets.append(json.loads(item)) |
|
item = "" |
|
|
|
# モデルによるタスクの推論。 |
|
import re |
|
from tqdm import tqdm |
|
|
|
results = [] |
|
n = 2 |
|
|
|
|
|
for data in tqdm(datasets, smoothing=0.0): |
|
input_text = data["input"] |
|
dataset_index_list = retrieve(input_text, n) |
|
|
|
examples = "" |
|
for dataset_index in dataset_index_list: |
|
examples += EXAMPLE_TEMPLATE.format( |
|
dataset_input=elyza_tasks_datasets["test"]["input"][dataset_index].strip(), |
|
dataset_eval_aspect=elyza_tasks_datasets["test"]["eval_aspect"][dataset_index].strip(), |
|
dataset_answer=elyza_tasks_datasets["test"]["output"][dataset_index].strip(), |
|
) |
|
|
|
system_prompt = SYSTEM_PROMPT.format( |
|
examples=examples.strip(), |
|
) |
|
# print(examples) |
|
# print(input_text) |
|
|
|
output = llm(input_text=input_text, |
|
system_prompt=system_prompt, |
|
call_type="chat_template", |
|
repetition_penalty=1.15, |
|
# stream=True, |
|
).strip() |
|
# print("-----------------------------------------------------------------------------------------------------------------------------------") |
|
print(output.strip()) |
|
print("===================================================================================================================================") |
|
print(re.sub(r"^[\s\S]*?### 出力", "", re.sub(r"^[\s\S]*?\*\*出力\*\*:", "", output)).strip()) |
|
print("-----------------------------------------------------------------------------------------------------------------------------------") |
|
|
|
results.append({ |
|
"task_id": data["task_id"], |
|
"input": input_text, |
|
"output_org": output.strip(), |
|
"output": re.sub(r"^[\s\S]*?### 出力", "", output).strip(), |
|
"elyza_tasks_id": dataset_index, |
|
"dataset_input": elyza_tasks_datasets["test"]["input"][dataset_index], |
|
"dataset_eval_aspect": elyza_tasks_datasets["test"]["eval_aspect"][dataset_index], |
|
"dataset_answer": elyza_tasks_datasets["test"]["output"][dataset_index], |
|
}) |
|
|
|
# results にタスクの解答が入っている |
|
|
|
from pprint import pprint |
|
import pandas as pd |
|
|
|
|
|
# 最大表示「列」数の指定 |
|
pd.set_option("display.max_columns", 0) |
|
# 最大表示「行」数の指定 |
|
pd.set_option("display.max_rows", 100) |
|
pd.set_option("display.max_colwidth", 550) |
|
|
|
|
|
json4df = { |
|
"task_id": [], |
|
"input": [], |
|
"output": [], |
|
"output_org": [], |
|
# "elyza_tasks_id": [], |
|
# "dataset_input": [], |
|
# "dataset_eval_aspect": [], |
|
# "dataset_answer": [], |
|
} |
|
|
|
for result in results: |
|
json4df["task_id"].append(result["task_id"]) |
|
json4df["input"].append(result["input"]) |
|
json4df["output_org"].append(result["output_org"]) |
|
json4df["output"].append(result["output"]) |
|
|
|
JSON_FILE_NAME = "llm-jp-3-13B-Tengentoppa-FPFT-magpie-FPFT-elyza-RAG_v2" |
|
|
|
result4out = results.copy() |
|
results |
|
|
|
|
|
# 本コードではinputとeval_aspectも含んでいますが、なくても問題ありません。 |
|
# 必須なのはtask_idとoutputとなります。 |
|
|
|
import re |
|
import sys |
|
from os.path import dirname, abspath, join, isfile |
|
|
|
|
|
result4out = results.copy() |
|
|
|
|
|
WD = dirname(abspath("__file__")) |
|
json_dir = join( |
|
WD, |
|
"..", |
|
"jsonl", |
|
) |
|
|
|
|
|
if JSON_FILE_NAME != "": |
|
file_path = join(json_dir, f"{JSON_FILE_NAME}.jsonl") |
|
else: |
|
jsonl_id = re.sub(".*/", "", merged_model_path) |
|
file_path = join(json_dir, f"{jsonl_id}-outputs.jsonl") |
|
|
|
assert not isfile(file_path), f"Error: File `{file_path}` is already exist." |
|
|
|
with open(file_path, "w", encoding="utf-8") as f: |
|
for result in result4out: |
|
result = {k: v for k, v in result.items() if k != "elyza_tasks_id" and k != "dataset_input" and k != |
|
"dataset_eval_aspect" and k != "dataset_answer"} |
|
json.dump( |
|
result, f, ensure_ascii=False |
|
) # ensure_ascii=False for handling non-ASCII characters |
|
f.write("\n") |
|
|
|
|
|
~~~ |
|
|
|
### Downstream Use [optional] |
|
|
|
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
[More Information Needed] |
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
|
[More Information Needed] |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
[More Information Needed] |
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
[More Information Needed] |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
- [DeL-TaiseiOzaki/Tengentoppa-sft-v1.0](https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-v1.0) |
|
- [llm-jp/magpie-sft-v1.0](https://huggingface.co/datasets/llm-jp/magpie-sft-v1.0) |
|
- [ntotsuka123/clean3-ultraboros-20k-ja-filter](https://huggingface.co/datasets/ntotsuka123/clean3-ultraboros-20k-ja-filter) |
|
|
|
### Training Procedure |
|
|
|
using axolotl and yaml below. |
|
|
|
```yaml: For the first training |
|
base_model: llm-jp/llm-jp-3-13b |
|
model_type: AutoModelForCausalLM |
|
tokenizer_type: AutoTokenizer |
|
|
|
load_in_8bit: false |
|
load_in_4bit: false |
|
strict: false |
|
|
|
# domain_yyyymmdd |
|
output_dir: outputs/matsuo/llm-jp/3/13B/FPFT_20241213 |
|
|
|
chat_template: chatml |
|
default_system_message: あなたは、大塚商会の誠実で優秀なアシスタントです。 |
|
|
|
shuffle_merged_datasets: true |
|
datasets: |
|
# # General |
|
# - path: data/general/magpie-sft-v1.0.jsonl |
|
# ds_type: json |
|
# type: chat_template |
|
# chat_template: chatml |
|
# field_messages: conversations |
|
# message_field_role: role |
|
# message_field_content: content |
|
# roles: |
|
# user: |
|
# - user |
|
# assistant: |
|
# - assistant |
|
# system: |
|
# - system |
|
- path: data/general/Tengentoppa-sft-v1.0.jsonl |
|
ds_type: json |
|
type: alpaca |
|
# - path: data/general/clean3-ultraboros-20k-ja-filter_train.jsonl |
|
# ds_type: json |
|
# type: chat_template |
|
# # chat_template: chatml |
|
# field_messages: conversations |
|
# message_field_role: role |
|
# message_field_content: value |
|
# roles: |
|
# user: |
|
# - human |
|
# assistant: |
|
# - gpt |
|
# system: |
|
# - system |
|
# train_on_eos: turn |
|
|
|
val_set_size: 0.05 |
|
|
|
sequence_len: 4096 |
|
sample_packing: true |
|
pad_to_sequence_len: true |
|
|
|
gradient_accumulation_steps: 4 |
|
micro_batch_size: 1 |
|
num_epochs: 2 |
|
optimizer: paged_adamw_8bit |
|
lr_scheduler: cosine |
|
learning_rate: 0.00002 |
|
|
|
train_on_inputs: false |
|
group_by_length: false |
|
bf16: auto |
|
fp16: |
|
tf32: true |
|
|
|
gradient_checkpointing: true |
|
gradient_checkpointing_kwargs: |
|
use_reentrant: false |
|
early_stopping_patience: |
|
resume_from_checkpoint: |
|
logging_steps: 1 |
|
xformers_attention: |
|
flash_attention: true |
|
|
|
# warmup_steps: 100 |
|
warmup_ratio: 0.1 |
|
evals_per_epoch: 1 |
|
eval_table_size: |
|
saves_per_epoch: 1 |
|
debug: |
|
deepspeed: deepspeed_configs/zero3.json |
|
weight_decay: 0.0 |
|
fsdp: |
|
fsdp_config: |
|
special_tokens: |
|
eos_token: <|im_end|> |
|
``` |
|
|
|
```yaml: For the second training |
|
base_model: outputs/matsuo/llm-jp/3/13B/FPFT_20241213 |
|
model_type: AutoModelForCausalLM |
|
tokenizer_type: AutoTokenizer |
|
|
|
load_in_8bit: false |
|
load_in_4bit: false |
|
strict: false |
|
|
|
# domain_yyyymmdd |
|
output_dir: outputs/matsuo/llm-jp/3/13B/FPFT_20241215 |
|
|
|
chat_template: chatml |
|
default_system_message: あなたは、大塚商会の誠実で優秀なアシスタントです。 |
|
|
|
shuffle_merged_datasets: true |
|
datasets: |
|
- path: data/general/magpie-sft-v1.0.jsonl |
|
ds_type: json |
|
type: chat_template |
|
chat_template: chatml |
|
field_messages: conversations |
|
message_field_role: role |
|
message_field_content: content |
|
roles: |
|
user: |
|
- user |
|
assistant: |
|
- assistant |
|
system: |
|
- system |
|
# - path: data/general/Tengentoppa-sft-v1.0.jsonl |
|
# ds_type: json |
|
# type: alpaca |
|
- path: data/general/clean3-ultraboros-20k-ja-filter_train.jsonl |
|
ds_type: json |
|
type: chat_template |
|
chat_template: chatml |
|
field_messages: conversations |
|
message_field_role: role |
|
message_field_content: value |
|
roles: |
|
user: |
|
- human |
|
assistant: |
|
- gpt |
|
system: |
|
- system |
|
## NOTE: Leaving the below empty will default to using the simple legacy tokenization strategy where only last message is trained on. |
|
# Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss. |
|
roles_to_train: ["gpt", "assistant"] |
|
# Optional[str]. Which EOS tokens to train on in the conversation. Possible values are: |
|
# - all: train on all EOS tokens |
|
# - turn: train on the EOS token at the end of each trainable turn |
|
# - last: train on the last EOS token in the conversation |
|
train_on_eos: last |
|
|
|
val_set_size: 0.05 |
|
|
|
sequence_len: 4096 |
|
sample_packing: true |
|
pad_to_sequence_len: true |
|
|
|
gradient_accumulation_steps: 4 |
|
micro_batch_size: 1 |
|
num_epochs: 2 |
|
optimizer: paged_adamw_8bit |
|
lr_scheduler: cosine |
|
learning_rate: 0.00002 |
|
|
|
train_on_inputs: false |
|
group_by_length: false |
|
bf16: auto |
|
fp16: |
|
tf32: true |
|
|
|
gradient_checkpointing: true |
|
gradient_checkpointing_kwargs: |
|
use_reentrant: false |
|
early_stopping_patience: |
|
resume_from_checkpoint: |
|
logging_steps: 1 |
|
xformers_attention: |
|
flash_attention: true |
|
|
|
# warmup_steps: 100 |
|
warmup_ratio: 0.1 |
|
evals_per_epoch: 1 |
|
eval_table_size: |
|
saves_per_epoch: 1 |
|
debug: |
|
deepspeed: deepspeed_configs/zero3.json |
|
weight_decay: 0.0 |
|
fsdp: |
|
fsdp_config: |
|
special_tokens: |
|
eos_token: <|im_end|> |
|
``` |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
<!-- This should link to a Dataset Card if possible. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Factors |
|
|
|
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. --> |
|
|
|
[More Information Needed] |
|
|
|
#### Metrics |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
[More Information Needed] |
|
|
|
### Results |
|
|
|
[More Information Needed] |
|
|
|
#### Summary |
|
|
|
## Model Examination [optional] |
|
|
|
<!-- Relevant interpretability work for the model goes here --> |
|
|
|
[More Information Needed] |
|
|
|
## Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
- **Hardware Type:** [More Information Needed] |
|
- **Hours used:** [More Information Needed] |
|
- **Cloud Provider:** [More Information Needed] |
|
- **Compute Region:** [More Information Needed] |
|
- **Carbon Emitted:** [More Information Needed] |
|
|
|
## Technical Specifications [optional] |
|
|
|
### Model Architecture and Objective |
|
|
|
[More Information Needed] |
|
|
|
### Compute Infrastructure |
|
|
|
[More Information Needed] |
|
|
|
#### Hardware |
|
|
|
[More Information Needed] |
|
|
|
#### Software |
|
|
|
[More Information Needed] |
|
|
|
## Citation [optional] |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
**BibTeX:** |
|
|
|
[More Information Needed] |
|
|
|
**APA:** |
|
|
|
[More Information Needed] |
|
|
|
## Glossary [optional] |
|
|
|
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. --> |
|
|
|
[More Information Needed] |
|
|
|
## More Information [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Authors [optional] |
|
|
|
[More Information Needed] |
|
|
|
## Model Card Contact |
|
|
|
[More Information Needed] |
|
|