distily_experiments_loss_reverse_kl

This student model is distilled from the teacher model Qwen/Qwen2-0.5B-Instruct using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

Training procedure

The following hyperparameters were used during training:

Peak GPU Memory: 19.8832 GB

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		13.0697	11.6518					21.6262
0	0	180187.8438	182062.6875	131.8108	90.6539	11.031	2.758	181762.375
500	0.0808	14699.2041	52797.9922	6.0418	90.8884	11.003	2.751	371252.0312
1000	0.1616	8812.4561	47709.9297	4.9882	90.8533	11.007	2.752	384212.3438
1500	0.2424	7321.3081	44922.375	4.6195	90.7179	11.023	2.756	400192.5625
2000	0.3232	6277.4165	42254.6719	4.2012	90.8257	11.01	2.753	423631.0938
2500	0.4040	5452.0264	39927.7812	3.9955	90.7803	11.016	2.754	445022.5938
3000	0.4848	4708.5049	37660.8359	3.7784	90.8232	11.01	2.753	447453.4375
3500	0.5657	4329.6147	35350.4805	3.6816	90.8654	11.005	2.751	455292.8125
4000	0.6465	3840.0864	33493.6836	3.5800	90.7858	11.015	2.754	446474.3125
4500	0.7273	3495.4482	31764.3340	3.4447	90.8083	11.012	2.753	447611.3438
5000	0.8081	3245.5376	30812.8379	3.3323	90.7976	11.014	2.753	448982.8438
5500	0.8889	3057.9595	29516.0742	3.2926	90.7385	11.021	2.755	459842.8125
6000	0.9697	2831.3643	28517.0625	3.1956	90.7677	11.017	2.754	441979.4375
6187	0.9999	2760.3779	28158.2578	3.1654	90.8509	11.007	2.752	441247.4688