rinna/nekomata-7b-instruction · About hyper-parameters

Hello, and thanks for open-sourcing these great models.

I have a question regarding the hyperparameters used for instruction tuning.

Did you employ the hyperparameters outlined in the Qwen paper for instruction tuning?

The model’s training process utilizes the AdamW optimizer, with the following hyperparameters: β1
set to 0.9, β2 set to 0.95, and ϵ set to 10−8. The sequence length is limited to 2048, and the batch
size is 128. The model undergoes a total of 4000 steps, with the learning rate gradually increased
over the first 1430 steps, reaching a peak of 2 × 10−6. To prevent overfitting, weight decay is applied
with a value of 0.1, dropout is set to 0.1, and gradient clipping is enforced with a limit of 1.0.

Thanks!