AALF's picture
Update README.md
129cc75 verified
|
raw
history blame
3.03 kB
metadata
license: apache-2.0
base_model:
  - meta-llama/Llama-3.1-8B-Instruct

A preview version of FuseChat-3.0, under testing...

Training configs

# Model arguments
model_name_or_path: AALF/FuseChat-Llama-3.1-8B-SFT
torch_dtype: null
attn_implementation: flash_attention_2


# Data training arguments
dataset_mixer:  FuseChat-Mixture-v3-DPO
dataset_splits:
- train
- test
preprocessing_num_workers: 12

# DPOTrainer arguments
bf16: true
beta: 10
avg_logp: true
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: False
hub_model_id: wrpo-models
learning_rate: 8.0e-7
log_level: info
logging_steps: 5
lr_scheduler_type: cosine
max_length: 2048
max_prompt_length: 1800
num_train_epochs: 1
optim: adamw_torch
output_dir: outputs/FuseChat-Llama-3.1-8B-Instruct
run_name: FuseChat-Llama-3.1-8B-Instruct
per_device_train_batch_size: 2
per_device_eval_batch_size: 4
push_to_hub: false
save_strategy: "steps"
save_steps: 101
save_total_limit: 20
seed: 42
warmup_ratio: 0.1
save_only_model: true

Evaluation Results

Datasets Llama3.1-8B-Instruct FuseChat-Llama-3.1-8B-SFT FuseChat-Llama-3.1-8B-Instruct
AlpacaEval-2 (LC/WR) 28.3/28.7 41.3/37.7 65.4/63.3
Arena-Hard (WR/SC) 28.1/23.8 38.7/29 58.2/46.4
MT-Bench 8.38 8.54 9
AlignBench v1.1 4.61 6.25 6.69
LiveBench 0831 27.6 30.2 32
GSM8K 85.9 87 88
MATH 50.7 54.7 55.2
AMC 23 25 30 37.5
MMLU-Pro 50 47.8 49.2
MMLU-redux 67.2 68.4 69.2
GPQA-Diamond 33.8 37.9 34.9
HumanEval 69.5 69.5 71.3
MBPP 75.4 71.4 72
LiveCodeBench 2408-2411 (all/esay) 12.3/40.5 12.6/39 13.1/43.2