Optimizing Pretraining Data Mixes with LLM-Estimated Utility

Community Article Published January 22, 2025

Our Approach compared to other popular methods
Figure 1: Scaling comparison of our proposed data mixing method compared with popular baselines.

Why Does Pretraining Data Mixing Matter?

Training Large Language Models (LLMs) requires massive datasets. This often means combining diverse sources—such as web data, academic papers, and programming code. However, not all datasets contribute equally to model performance. The challenge is: How do we decide the optimal mix of data to train on, given computational constraints?

Previous works have explored data mixing through manual curation, heuristic-based sampling, or learned data mixing models. In our extensive baseline evaluations, we find that the simplest heuristic-based sampling methods surprisingly outperform prior work despite assuming all data sources are equally useful. In this work, we introduce UtiliMax and Model Estimated Data Utility (MEDU) — methods which instead automatically estimate data utility and use this information to compute improved data mixes.


The Problem: How to Choose the Best Data Mix?

Given varied datasets, a key challenge is allocating training resources across each dataset to maximize model performance. For example, in this work, we use Dolma V1.7, the dataset used to train OLMo which is made up of 15 different sources from various domains.

In addressing the data mixing problem, prior works have explored:

  • Manual curation – Experts decide how much data to include from each source.
  • Heuristic-based sampling – Methods like proportional token allocation.
  • Learned data mixing models – Data distribution is dynamically adjusted during training.

However, prior methods have not been compared in a controlled setup, so it is unclear how they compare! To address this, we first run extensive baseline comparisons. We find that the simplest approach, Unimax, derived purely from dataset sizes surprisingly outperforms all other prior methods!

Baseline Data Mixing Methods Performance
Figure 2: Comparison of baseline data mixing methods. UniMax outperforms other baselines in both compute- and data-constrained settings.

Notably, UniMax does not distinguish between the quality or domains of different datasets. In the rest of our work, we explore whether incorporating these factors, summarized as the dataset utility, leads to even stronger results.


Our Approach: UtiliMax and MEDU

UtiliMax: Balancing Utility, Diversity, and Scale

UtiliMax extends heuristic-based data mixing by incorporating utility estimates from small-scale experiments on individual datasets as estimates of data utility. We then frame the problem of finding an optimal data mix as a portfolio optimization problem. In financial portfolios, an assets utility is its expected return and the risk of a portfolio is a function of how diversified it is! Using the convex optimization tools designed for portfolio optimization, we derive a method that balances between sampling all datasets, favoring high-utility datasets, and avoiding excessive repetition of small datasets.

We find that this formulation consistently outperforms alternative optimization procedures such as greedy sampling of high-utility data or UniMax.

UtiliMax Performance
Figure 3: UtiliMax vs. alternative optimization procedures. UtiliMax consistently outperforms other approaches.

MEDU: LLM-Based Utility Estimation

While UtiliMax improves efficiency, running ablation studies on every dataset is computationally expensive. MEDU leverages existing LLMs to remove this cost by estimating the usefulness of training data without additional training runs, leading to ~200× lower computational cost than ablation-based methods.

MEDU Performance Figure 4: MEDU compared with more costly data mixes derived directly for ablations.

MEDU first uses an LLM to describe the high-level skills and knowledge needed for domains based on benchmark questions. It then uses this description to classify documents from individual datasets into utility categories (Great, Good, Okay, Poor, Useless). Using a small sample, this lets us estimate the utility of individual datasets without training new models!

MEDU Workflow
Figure 5: A High-Level view of the MEDU pipeline.


Key Findings

1. Simple Heuristics Often Outperform Complex Methods

UniMax, which only balances data diversity and repetition constraints, outperforms many manual and learned data mixing models. This suggests that many more complex data mixing methods do not capture true training dynamics!

2. UtiliMax Provides Significant Compute Savings

By using small-scale utility estimates, UtiliMax allows for data allocation that lead to better models in fewer FLOPs.

3. LLMs Can Estimate Data Utility Effectively

MEDU replaces costly ablation studies while achieving comparable performance, making data selection faster and cheaper.

4. Diversity and Scale Matter for Generalization

A mix that prioritizes dataset diversity and size leads to better results than focusing solely on utility scores.


Implications and Future Work

These findings lay the groundwork for automated, compute-efficient data mixing that adapts to both compute- and data-constrained training settings and can be computed in advance of training.

We anticipate future research to uncover new signals for high-quality data—such as loss correlation across open-source models from Thrush et al. The UtiliMax optimization procedure is a principled way to incorporate these signals for even better results.

For more details, check out our full research paper here.


Community

Sign up or log in to comment