Tulu 3 Datasets Collection All datasets released with Tulu 3 -- state of the art open post-training recipes. • 32 items • Updated 2 days ago • 48
ProX Refining Models Collection Adapted small language models used to generate data refining programs • 5 items • Updated Oct 10 • 2
Magpie-Qwen2 Datasets Collection Dataset built with Qwen2 72B and Qwen2 7B. • 6 items • Updated Sep 14 • 10
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Paper • 2406.17557 • Published Jun 25 • 86
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence Paper • 2406.11931 • Published Jun 17 • 57
Instruction Pre-Training: Language Models are Supervised Multitask Learners Paper • 2406.14491 • Published Jun 20 • 85
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 • 67
AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct Paper • 2405.14906 • Published May 23 • 23