Yuto-24
/

llm-jp-3-13B-Tengentoppa_magpie

@@ -14,9 +14,10 @@ base_model:
 # Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
@@ -26,21 +27,17 @@ base_model:
 This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
@@ -52,18 +49,20 @@ This is the model card of a 🤗 transformers model that has been pushed on the
 ```txt:requirements.txt
 numpy
-torch
 datasets
-transformers
 FlagEmbedding
 ```
 ~~~python
-from FlagEmbedding import BGEM3FlagModel
 import torch
 import numpy as np
 from datasets import Dataset, load_dataset
 from transformers import (
     AutoModelForCausalLM,
     AutoTokenizer,
@@ -83,16 +82,15 @@ def retrieve(input_text):
     input_texts = [input_text]
     input_embeds = model.encode(input_texts)["dense_vecs"]
-    # print(input_embeds)
     # 類似度の計算
     similarity = input_embeds @ target_embeds.T
     most_similar_text = target_texts[np.argmax(similarity)]
     target_index = target_texts.index(most_similar_text)
     return target_index
 class CallLLM:
     def __init__(self, model_name_or_path: str) -> None:
         self.model = AutoModelForCausalLM.from_pretrained(
@@ -185,44 +183,53 @@ model_path_or_id = "Yuto-24/llm-jp-3-13B-Tengentoppa_magpie"
 llm = CallLLM(model_path_or_id)
 SYSTEM_PROMPT = """
-# 役割
 あなたは誠実で優秀なアシスタントです。
 ハルシネーションをしません。
 必ず正しい情報のみを答えます。
 ## 指示
-ユーザから特別な指示が与えられている場合には、必ず従います。
-具体例には評価観点が含まれていますが、あなたが考える「出力」のみを回答してください。
-評価観点は、人間があなたの出力を評価するために利用します。
-### 具体例
 ```markdown
-入力
 {dataset_input}
-評価観点
 {dataset_eval_aspect}
-出力
 {dataset_answer}
-```
 """.strip()
-# タスクとなるデータの読み込み。
-# omnicampusの開発環境では、左にタスクのjsonlをドラッグアンドドロップしてから実行。
 import os
 import json
 datasets = []
-with open(f"{os.path.dirname(os.path.abspath('__file__'))}/workspace/elyza-tasks-100-TV_0.jsonl", "r") as f:
     item = ""
     for line in f:
         line = line.strip()
@@ -264,10 +271,10 @@ for data in tqdm(datasets, smoothing=0.0):
                  # stream=True,
                  ).strip()
     # print("-----------------------------------------------------------------------------------------------------------------------------------")
-    # print(output.strip())
-    # print("===================================================================================================================================")
-    # print(re.sub(r"^[\s\S]*?### 出力", "", re.sub(r"^[\s\S]*?\*\*出力\*\*:", "", output)).strip())
-    # print("-----------------------------------------------------------------------------------------------------------------------------------")
     results.append({
         "task_id": data["task_id"],
@@ -275,16 +282,85 @@ for data in tqdm(datasets, smoothing=0.0):
         "output_org": output.strip(),
         "output": re.sub(r"^[\s\S]*?### 出力", "", output).strip(),
         "elyza_tasks_id": dataset_index,
     })
-# results にタスクの解答が入っている
-~~~
-[More Information Needed]
 ### Downstream Use [optional]
@@ -322,26 +398,205 @@ Use the code below to get started with the model.
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
@@ -373,8 +628,6 @@ Use the code below to get started with the model.
 #### Summary
 ## Model Examination [optional]
 <!-- Relevant interpretability work for the model goes here -->
@@ -439,4 +692,4 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 ## Model Card Contact
-[More Information Needed]

 # Model Card for Model ID
+This is Full Parameter Fine Tuned model based on `llm-jp/llm-jp-3-13B`.
+See the base details [here](https://huggingface.co/llm-jp/llm-jp-3-13b).
+Made for the task of `elyza-tasks-100-TV` which Matsuo Lab made in a class.
 ## Model Details
 This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [Yuto-24](https://github.com/Yuto-24/)
+- **Model type:** Text Generation
+- **Language(s) (NLP):** Japanese, English
+- **License:** CC-BY-4.0
+- **Finetuned from model:** [llm-jp/llm-jp-3-13B](https://huggingface.co/llm-jp/llm-jp-3-13b)
 ### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
+- **Repository:** coming soon...
 ## Uses
 ```txt:requirements.txt
 numpy
+torch>=2.3.0
 datasets
+transformers>=4.40.1
+accelerate>=0.29.3
+flash-attn>=2.5.8
 FlagEmbedding
 ```
 ~~~python
 import torch
 import numpy as np
 from datasets import Dataset, load_dataset
+from FlagEmbedding import BGEM3FlagModel
 from transformers import (
     AutoModelForCausalLM,
     AutoTokenizer,
     input_texts = [input_text]
     input_embeds = model.encode(input_texts)["dense_vecs"]
     # 類似度の計算
     similarity = input_embeds @ target_embeds.T
     most_similar_text = target_texts[np.argmax(similarity)]
     target_index = target_texts.index(most_similar_text)
     return target_index
 class CallLLM:
     def __init__(self, model_name_or_path: str) -> None:
         self.model = AutoModelForCausalLM.from_pretrained(
 llm = CallLLM(model_path_or_id)
 SYSTEM_PROMPT = """
+# あなたが必ず従うべき事項
+## 役割
 あなたは誠実で優秀なアシスタントです。
+質問に対し、簡潔に答えます。
 ハルシネーションをしません。
 必ず正しい情報のみを答えます。
 ## 指示
+- 評価観点に沿った出力を作成します。
+- ユーザから特別な指示が与えられている場合には、必ず従います。
+- 具体例には評価観点が含まれていますが、あなたが考える「出力」のみを回答してください。
+- 評価観点は、人間があなたの出力を評価するために利用します。
+- 論理的にステップバイステップで考えてください。
+## 具体例
 ```markdown
+{examples}
+```
+""".strip()
+EXAMPLE_TEMPLATE = """
+### 入力
 {dataset_input}
+### 評価観点
 {dataset_eval_aspect}
+### 出力
 {dataset_answer}
 """.strip()
+# タスクとなるデータの読み込み
+# omnicampusの開発環境では、左にタスクのjsonlをドラッグアンドドロップしてから実行
 import os
 import json
 datasets = []
+with open(f"{os.path.dirname(os.path.abspath('**file**'))}/workspace/elyza-tasks-100-TV_0.jsonl", "r") as f:
     item = ""
     for line in f:
         line = line.strip()
                  # stream=True,
                  ).strip()
     # print("-----------------------------------------------------------------------------------------------------------------------------------")
+    print(output.strip())
+    print("===================================================================================================================================")
+    print(re.sub(r"^[\s\S]*?### 出力", "", re.sub(r"^[\s\S]*?\*\*出力\*\*:", "", output)).strip())
+    print("-----------------------------------------------------------------------------------------------------------------------------------")
     results.append({
         "task_id": data["task_id"],
         "output_org": output.strip(),
         "output": re.sub(r"^[\s\S]*?### 出力", "", output).strip(),
         "elyza_tasks_id": dataset_index,
+        "dataset_input": elyza_tasks_datasets["test"]["input"][dataset_index],
+        "dataset_eval_aspect": elyza_tasks_datasets["test"]["eval_aspect"][dataset_index],
+        "dataset_answer": elyza_tasks_datasets["test"]["output"][dataset_index],
     })
+# results にタスクの解答が入っている
+from pprint import pprint
+import pandas as pd
+# 最大表示「列」数の指定
+pd.set_option("display.max_columns", 0)
+# 最大表示「行」数の指定
+pd.set_option("display.max_rows", 100)
+pd.set_option("display.max_colwidth", 550)
+json4df = {
+    "task_id": [],
+    "input": [],
+    "output": [],
+    "output_org": [],
+    # "elyza_tasks_id": [],
+    # "dataset_input": [],
+    # "dataset_eval_aspect": [],
+    # "dataset_answer": [],
+}
+for result in results:
+    json4df["task_id"].append(result["task_id"])
+    json4df["input"].append(result["input"])
+    json4df["output_org"].append(result["output_org"])
+    json4df["output"].append(result["output"])
+JSON_FILE_NAME = "llm-jp-3-13B-Tengentoppa-FPFT-magpie-FPFT-elyza-RAG_v2"
+result4out = results.copy()
+results
+# 本コードではinputとeval_aspectも含んでいますが、なくても問題ありません。
+# 必須なのはtask_idとoutputとなります。
+import re
+import sys
+from os.path import dirname, abspath, join, isfile
+result4out = results.copy()
+WD = dirname(abspath("__file__"))
+json_dir = join(
+    WD,
+    "..",
+    "jsonl",
+)
+if JSON_FILE_NAME != "":
+    file_path = join(json_dir, f"{JSON_FILE_NAME}.jsonl")
+else:
+    jsonl_id = re.sub(".*/", "", merged_model_path)
+    file_path = join(json_dir, f"{jsonl_id}-outputs.jsonl")
+assert not isfile(file_path), f"Error: File `{file_path}` is already exist."
+with open(file_path, "w", encoding="utf-8") as f:
+    for result in result4out:
+        result = {k: v for k, v in result.items() if k != "elyza_tasks_id" and k != "dataset_input" and k !=
+                  "dataset_eval_aspect" and k != "dataset_answer"}
+        json.dump(
+            result, f, ensure_ascii=False
+        )  # ensure_ascii=False for handling non-ASCII characters
+        f.write("\n")
+~~~
 ### Downstream Use [optional]
 <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+- [DeL-TaiseiOzaki/Tengentoppa-sft-v1.0](https://huggingface.co/datasets/DeL-TaiseiOzaki/Tengentoppa-sft-v1.0)
+- [llm-jp/magpie-sft-v1.0](https://huggingface.co/datasets/llm-jp/magpie-sft-v1.0)
+- [ntotsuka123/clean3-ultraboros-20k-ja-filter](https://huggingface.co/datasets/ntotsuka123/clean3-ultraboros-20k-ja-filter)
 ### Training Procedure
+using axolotl and yaml below.
+```yaml: For the first training
+base_model: llm-jp/llm-jp-3-13b
+model_type: AutoModelForCausalLM
+tokenizer_type: AutoTokenizer
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+# domain_yyyymmdd
+output_dir: outputs/matsuo/llm-jp/3/13B/FPFT_20241213
+chat_template: chatml
+default_system_message: あなたは、大塚商会の誠実で優秀なアシスタントです。
+shuffle_merged_datasets: true
+datasets:
+  # # General
+  # - path: data/general/magpie-sft-v1.0.jsonl
+  #   ds_type: json
+  #   type: chat_template
+  #   chat_template: chatml
+  #   field_messages: conversations
+  #   message_field_role: role
+  #   message_field_content: content
+  #   roles:
+  #     user:
+  #       - user
+  #     assistant:
+  #       - assistant
+  #     system:
+  #       - system
+  - path: data/general/Tengentoppa-sft-v1.0.jsonl
+    ds_type: json
+    type: alpaca
+  # - path: data/general/clean3-ultraboros-20k-ja-filter_train.jsonl
+  #   ds_type: json
+  #   type: chat_template
+  #   # chat_template: chatml
+  #   field_messages: conversations
+  #   message_field_role: role
+  #   message_field_content: value
+  #   roles:
+  #     user:
+  #       - human
+  #     assistant:
+  #       - gpt
+  #     system:
+  #       - system
+  #   train_on_eos: turn
+val_set_size: 0.05
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+gradient_accumulation_steps: 4
+micro_batch_size: 1
+num_epochs: 2
+optimizer: paged_adamw_8bit
+lr_scheduler: cosine
+learning_rate: 0.00002
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: true
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+early_stopping_patience:
+resume_from_checkpoint:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+# warmup_steps: 100
+warmup_ratio: 0.1
+evals_per_epoch: 1
+eval_table_size:
+saves_per_epoch: 1
+debug:
+deepspeed: deepspeed_configs/zero3.json
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  eos_token: <|im_end|>
+```
+```yaml: For the second training
+base_model: outputs/matsuo/llm-jp/3/13B/FPFT_20241213
+model_type: AutoModelForCausalLM
+tokenizer_type: AutoTokenizer
+load_in_8bit: false
+load_in_4bit: false
+strict: false
+# domain_yyyymmdd
+output_dir: outputs/matsuo/llm-jp/3/13B/FPFT_20241215
+chat_template: chatml
+default_system_message: あなたは、大塚商会の誠実で優秀なアシスタントです。
+shuffle_merged_datasets: true
+datasets:
+  - path: data/general/magpie-sft-v1.0.jsonl
+    ds_type: json
+    type: chat_template
+    chat_template: chatml
+    field_messages: conversations
+    message_field_role: role
+    message_field_content: content
+    roles:
+      user:
+        - user
+      assistant:
+        - assistant
+      system:
+        - system
+  # - path: data/general/Tengentoppa-sft-v1.0.jsonl
+  #   ds_type: json
+  #   type: alpaca
+  - path: data/general/clean3-ultraboros-20k-ja-filter_train.jsonl
+    ds_type: json
+    type: chat_template
+    chat_template: chatml
+    field_messages: conversations
+    message_field_role: role
+    message_field_content: value
+    roles:
+      user:
+        - human
+      assistant:
+        - gpt
+      system:
+        - system
+    ## NOTE: Leaving the below empty will default to using the simple legacy tokenization strategy where only last message is trained on.
+    # Optional[List[str]]. Roles to train on. The tokens from these roles will be considered for the loss.
+    roles_to_train: ["gpt", "assistant"]
+    # Optional[str]. Which EOS tokens to train on in the conversation. Possible values are:
+    # - all: train on all EOS tokens
+    # - turn: train on the EOS token at the end of each trainable turn
+    # - last: train on the last EOS token in the conversation
+    train_on_eos: last
+val_set_size: 0.05
+sequence_len: 4096
+sample_packing: true
+pad_to_sequence_len: true
+gradient_accumulation_steps: 4
+micro_batch_size: 1
+num_epochs: 2
+optimizer: paged_adamw_8bit
+lr_scheduler: cosine
+learning_rate: 0.00002
+train_on_inputs: false
+group_by_length: false
+bf16: auto
+fp16:
+tf32: true
+gradient_checkpointing: true
+gradient_checkpointing_kwargs:
+  use_reentrant: false
+early_stopping_patience:
+resume_from_checkpoint:
+logging_steps: 1
+xformers_attention:
+flash_attention: true
+# warmup_steps: 100
+warmup_ratio: 0.1
+evals_per_epoch: 1
+eval_table_size:
+saves_per_epoch: 1
+debug:
+deepspeed: deepspeed_configs/zero3.json
+weight_decay: 0.0
+fsdp:
+fsdp_config:
+special_tokens:
+  eos_token: <|im_end|>
+```
 ## Evaluation
 #### Summary
 ## Model Examination [optional]
 <!-- Relevant interpretability work for the model goes here -->
 ## Model Card Contact
+[More Information Needed]