--- license: apache-2.0 datasets: - euclaise/SuperMC - euclaise/prm800k_preferences --- Expirements in large-scale small-scale preference learning. falcon-rw-1b trained with PRO (preference ranking optimization, see https://arxiv.org/abs/2306.17492) on SuperMC and PRM800K for 3 epochs, using my supertrainer2000 framework. This is an expiremental model. Benchmarks coming soon. Hyperparameters: - AdamW, weight decay of 0.01, otherwise default hyperparams - Maximum LR of 1e-5 - Cosine schedule with a warmup of 5400 steps - Batch size of 4 (2 real x 2 accumulated) - Maximum of 5 epochs, early stopping (visual observation), stopped after 3 - Gradient clipping norm value of 1.0 - PRO beta of 4 Training prompt format: ``` ### Query [insert instruction here] ### Answer [insert response here] ```