Training script
#13
by
ccdv
- opened
hey thanks for reaching out! I don't have anything on hand at the moment. I'll let you know if I dig through and find it, but essentially I used a variant of the longformer training notebook, key enabler being deepspeed.
deepspeed JSON
typically I use ZeRO-2 and roll with something like:
{
"optimizer":{
"type":"AdamW",
"params":{
"lr":"auto",
"betas":"auto",
"eps":"auto",
"weight_decay":"auto"
}
},
"zero_optimization":{
"stage":2,
"offload_optimizer":{
"device":"cpu",
"pin_memory":true
},
"allgather_partitions":true,
"allgather_bucket_size":2e8,
"overlap_comm":true,
"reduce_scatter":true,
"reduce_bucket_size":2e8,
"round_robin_gradients":true,
"contiguous_gradients":true
},
"bfloat16":{
"enabled":"auto"
},
"gradient_accumulation_steps":"auto",
"gradient_clipping":"auto",
"steps_per_print":4000,
"train_batch_size":"auto",
"train_micro_batch_size_per_gpu":"auto",
"wall_clock_breakdown":false
}
ok thanks
Got the training done on 4096 length, will try up to 16384 tokens now.
Hey, let me know if you have any other questions/issues with training. Either feel free to comment here/reopen, or message me on discord mrshadow773#0840 :)
pszemraj
changed discussion status to
closed