Qwen2.5-0.5B-Instruct-ITA

This model is a fine-tuned version of Qwen/Qwen2.5-0.5B-Instruct on the ReDiX/DataForge dataset. It achieves the following results on the evaluation set:

Loss: 1.4100

Model description

This model is an example of finetuning a sLLM. Italian eval improved and the model learned as espected from the training data

Intended uses & limitations

More information needed

Training and evaluation data

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_it	2	none	0	acc	↑	0.2378	±	0.0125
		none	0	acc_norm	↑	0.2823	±	0.0132
hellaswag_it	1	none	0	acc	↑	0.3163	±	0.0049
		none	0	acc_norm	↑	0.3800	±	0.0051
m_mmlu_it	0	none	5	acc	↑	0.381	±	0.0042

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Use adamw_bnb_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 10
num_epochs: 2

See axolotl config

axolotl version: 0.5.0

base_model: Qwen/Qwen2.5-0.5B-Instruct

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: ./dataforge
    type: chat_template

    field_messages: conversations
    message_field_role: from
    message_field_content: value

# chat_template: chatml
dataset_prepared_path: last_run_prepared
val_set_size: 0.1
output_dir: ./outputs/qwen05B

unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# mlp.down_proj layers
- model.layers.0.mlp.down_proj
- model.layers.23.mlp.down_proj
- model.layers.1.mlp.down_proj
- model.layers.16.mlp.down_proj
- model.layers.4.mlp.down_proj
- model.layers.17.mlp.down_proj
# mlp.gate_proj layers
- model.layers.0.mlp.gate_proj
- model.layers.1.mlp.gate_proj
- model.layers.2.mlp.gate_proj
- model.layers.3.mlp.gate_proj
- model.layers.4.mlp.gate_proj
- model.layers.7.mlp.gate_proj
# mlp.up_proj layers
- model.layers.1.mlp.up_proj
- model.layers.0.mlp.up_proj
- model.layers.3.mlp.up_proj
- model.layers.4.mlp.up_proj
- model.layers.7.mlp.up_proj
- model.layers.9.mlp.up_proj
# self_attn.k_proj layers
- model.layers.18.self_attn.k_proj
- model.layers.7.self_attn.k_proj
- model.layers.19.self_attn.k_proj
- model.layers.2.self_attn.k_proj
- model.layers.6.self_attn.k_proj
- model.layers.9.self_attn.k_proj
# self_attn.o_proj layers
- model.layers.16.self_attn.o_proj
- model.layers.19.self_attn.o_proj
- model.layers.0.self_attn.o_proj
- model.layers.20.self_attn.o_proj
- model.layers.4.self_attn.o_proj
- model.layers.3.self_attn.o_proj
# self_attn.q_proj layers
- model.layers.13.self_attn.q_proj
- model.layers.16.self_attn.q_proj
- model.layers.21.self_attn.q_proj
- model.layers.11.self_attn.q_proj
- model.layers.15.self_attn.q_proj
- model.layers.6.self_attn.q_proj
# self_attn.v_proj layers
- model.layers.2.self_attn.v_proj
- model.layers.3.self_attn.v_proj
- model.layers.4.self_attn.v_proj
- model.layers.5.self_attn.v_proj
- model.layers.7.self_attn.v_proj
- model.layers.8.self_attn.v_proj



sequence_len: 4096
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true


wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name: qwen2.5-0.5B
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 4
num_epochs: 2
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 1.0e-04

train_on_inputs: false
group_by_length: false
bf16: true
fp16: 
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 5
xformers_attention:
flash_attention: true


warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 128
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|im_end|>"
  eos_token: "<|im_end|>"

Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.0013	1	1.7855
1.2567	0.2504	194	1.5639
1.2551	0.5008	388	1.4980
1.1845	0.7512	582	1.4501
1.3178	1.0019	776	1.4252
1.06	1.2523	970	1.4187
1.0697	1.5027	1164	1.4116
1.0362	1.7531	1358	1.4100

Framework versions

Transformers 4.46.2
Pytorch 2.5.1+cu124
Datasets 3.1.0
Tokenizers 0.20.3

ReDiX
/

Qwen2.5-0.5B-Instruct-ITA

Qwen2.5-0.5B-Instruct-ITA

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for ReDiX/Qwen2.5-0.5B-Instruct-ITA

Datasets used to train ReDiX/Qwen2.5-0.5B-Instruct-ITA

Collection including ReDiX/Qwen2.5-0.5B-Instruct-ITA

Models