Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2404.07503

Synthetic Data Generation

Textbooks Are All You Need

Paper • 2306.11644 • Published Jun 20, 2023 • 142
Textbooks Are All You Need II: phi-1.5 technical report

Paper • 2309.05463 • Published Sep 11, 2023 • 87
TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Paper • 2305.07759 • Published May 12, 2023 • 33
Scaling Synthetic Data Creation with 1,000,000,000 Personas

Paper • 2406.20094 • Published Jun 28 • 95

LLM Synthetic Data

Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29

Surveys - Literature Reviews

A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models

Paper • 2406.11289 • Published Jun 17 • 5
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29
Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models

Paper • 2407.12327 • Published Jul 17 • 77
Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges

Paper • 2408.08946 • Published Aug 16 • 11

Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Paper • 2404.03715 • Published Apr 4 • 60
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

Paper • 2406.08464 • Published Jun 12 • 65
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Paper • 2402.13064 • Published Feb 20 • 47

A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

Paper • 2305.13169 • Published May 22, 2023 • 3
A Survey on Data Selection for Language Models

Paper • 2402.16827 • Published Feb 26 • 4
HuggingFaceFW/fineweb-edu

Viewer • Updated Oct 11 • 3B • 619k • 542
allenai/MADLAD-400

Updated Sep 9 • 83.7k • 127

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

Paper • 2401.16380 • Published Jan 29 • 48
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29
WizardLM: Empowering Large Language Models to Follow Complex Instructions

Paper • 2304.12244 • Published Apr 24, 2023 • 13
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Paper • 2402.13064 • Published Feb 20 • 47

Synthetic (text) Dataset Generation

Papers about synthetic dataset generation

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Paper • 2404.14361 • Published Apr 22 • 1
Generative AI for Synthetic Data Generation: Methods, Challenges and the Future

Paper • 2403.04190 • Published Mar 7
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29
A Multi-Faceted Evaluation Framework for Assessing Synthetic Data Generated by Large Language Models

Paper • 2404.14445 • Published Apr 20

Synthetic Data and Self-Improvement

Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18 • 144
Self-Discover: Large Language Models Self-Compose Reasoning Structures

Paper • 2402.03620 • Published Feb 6 • 109
OS-Copilot: Towards Generalist Computer Agents with Self-Improvement

Paper • 2402.07456 • Published Feb 12 • 41
Learning From Mistakes Makes LLM Better Reasoner

Paper • 2310.20689 • Published Oct 31, 2023 • 28

HuggingFaceTB/cosmopedia

Viewer • Updated Aug 12 • 31.1M • 8.76k • 563
Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29

Best Practices and Lessons Learned on Synthetic Data for Language Models

Paper • 2404.07503 • Published Apr 11 • 29
Better Synthetic Data by Retrieving and Transforming Existing Datasets

Paper • 2404.14361 • Published Apr 22 • 1
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Paper • 2409.08239 • Published Sep 12 • 16

Previous
1
2
Next

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs