leonardlin
's Collections
data
updated
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
•
2305.13169
•
Published
•
3
A Survey on Data Selection for Language Models
Paper
•
2402.16827
•
Published
•
4
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
•
3.24B
•
163k
•
591
Updated
•
29.1k
•
132
Viewer
•
Updated
•
7.18B
•
8.46k
•
491
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper
•
2404.07503
•
Published
•
29
Scaling Synthetic Data Creation with 1,000,000,000 Personas
Paper
•
2406.20094
•
Published
•
98
DDK: Distilling Domain Knowledge for Efficient Large Language Models
Paper
•
2407.16154
•
Published
•
22
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data
Assessment and Selection for Instruction Tuning of Language Models
Paper
•
2408.02085
•
Published
•
17
Better Alignment with Instruction Back-and-Forth Translation
Paper
•
2408.04614
•
Published
•
15
The ShareLM Collection and Plugin: Contributing Human-Model Chats for
the Benefit of the Community
Paper
•
2408.08291
•
Published
•
11