See axolotl config

axolotl version: 0.4.1

base_model: Dans-DiscountModels/Meta-Llama-3.1-8B-ChatML
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

trust_remote_code:

# wandb configuration
wandb_project: l3.1-8b-dans-instruct
wandb_watch:
wandb_run_id:
wandb_log_model: 

# where to save the finished model to
output_dir: ./l3.1-8b-dans-instruct

# dataset settings (local or huggingface repo)
datasets:
  - path: PocketDoc/Dans-MemoryCore-CoreCurriculum-Small
    type: sharegpt
    conversation: chatml
  - path: AquaV/Energetic-Materials-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: AquaV/Chemical-Biological-Safety-Applications-Sharegpt
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Mathmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Benchmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Codemaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Taskmaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-ASCIIMaxx-Wordart
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Prosemaxx
    type: sharegpt
    conversation: chatml
  - path: PocketDoc/Dans-Toolmaxx
    type: sharegpt
    conversation: chatml

chat_template: chatml

plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: true

load_in_8bit: false
load_in_4bit: false
strict: false

dataset_prepared_path: ./l3.1-8b-dans-instruct-data
val_set_size: 0.03

lora_model_dir: 

sequence_len: 8192

# use efficient multi-packing with block diagonal attention and per sequence position_ids. Recommend set to 'true'
sample_packing: true
eval_sample_packing: true

# you can set these packing optimizations AFTER starting a training at least once.
# The trainer will provide recommended values for these values.

pad_to_sequence_len: true

#rope_scaling:
  #type:  # linear | dynamic
  #factor:  # float (2 for 2x)

adapter: # blank for full finetune
lora_r: 64
lora_alpha: 64
lora_dropout: 0.2
lora_target_linear: True
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj
  - gate_proj
  - down_proj
  - up_proj
lora_modules_to_save:
  - embed_tokens
  - lm_head
lora_fan_in_fan_out:

gradient_accumulation_steps: 32
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0000015
cosine_min_lr_ratio: 

train_on_inputs: false
group_by_length: true
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: false
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 15
eval_steps: 25
# save_steps: 100
saves_per_epoch: 3
debug: false
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:


special_tokens:
  pad_token: <|finetune_right_pad_id|>
  eos_token: <|im_end|>

l3.1-8b-dans-instruct

This model is a fine-tuned version of Dans-DiscountModels/Meta-Llama-3.1-8B-ChatML on the None dataset. It achieves the following results on the evaluation set:

Loss: 0.7432

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1.5e-06
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 32
total_train_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 15
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
1.0783	0.0077	1	1.0298
0.8528	0.1931	25	0.8603
0.7776	0.3862	50	0.7925
0.7089	0.5793	75	0.7697
0.6868	0.7724	100	0.7584
0.7158	0.9655	125	0.7524
0.6938	1.1566	150	0.7488
0.733	1.3499	175	0.7464
0.7956	1.5433	200	0.7450
0.6886	1.7366	225	0.7442
0.9065	1.9299	250	0.7437
0.7851	2.1210	275	0.7434
0.7256	2.3142	300	0.7433
0.7832	2.5074	325	0.7432
0.7317	2.7006	350	0.7432
0.7112	2.8937	375	0.7432

Framework versions

Transformers 4.44.2
Pytorch 2.4.0+cu121
Datasets 2.20.0
Tokenizers 0.19.1

Dans-DiscountModels
/

Dans-Instruct-Mix-8b-ChatML-V0.0.1

l3.1-8b-dans-instruct

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for Dans-DiscountModels/Dans-Instruct-Mix-8b-ChatML-V0.0.1

Evaluation results