LLM Routing for Batched Instructions

Community Article Published March 19, 2025

Introduction

Large Language Models (LLMs) have become indispensable for numerous AI applications. However, using LLMs at scale presents a critical tradeoff between performance and cost. No single LLM is optimal for all tasks, and selecting the most appropriate model per instruction while managing budget constraints is a fundamental challenge.

In this report, we describe a predictive routing approach to assign instructions to LLMs that are both cost-effective and highly performant. We present three allocation strategies:

  1. Threshold-based assignment: Selecting the cheapest LLMs that satisfy a target performance threshold.
  2. Fixed budget allocation using a non-decreasing convex hull (NDCH): Determining the best mixture of LLMs to maximize performance under a cost constraint.
  3. Optimized budget allocation using a Mixed Integer Program (MIP): Formulating the allocation as a constrained optimization problem to maximize average precision.

We evaluate these strategies on real-world datasets and provide insights into their efficacy.

Problem Definition

Given a batch of instructions I={i1,i2,...,iN}I = \{i_1, i_2, ..., i_N\}, our goal is to assign each instruction iji_j to an LLM MkM_k from a pool of available models M={M1,M2,...,MK}M = \{M_1, M_2, ..., M_K\}, while optimizing for either:

  • Performance threshold: Selecting the cheapest models that meet a desired accuracy.
  • Fixed budget: Maximizing performance within a given budget.

For this purpose, we use a multi-label classifier trained to predict success probabilities for each LLM per instruction, calibrated using temperature scaling and histogram bucketing.

Predicting LLM Performances

Given a dataset of instructions and recorded LLM performance metrics, we can frame the task of predicting an LLM's probability of success for a given instruction as a multi-label classification problem. This is analogous to training a separate binary classifier for each LLM, but with most model parameters shared across classifiers, enabling efficient learning.

Our classifier is built on an embedding model, such as a masked language model like BERT, which is specifically trained for sentence embeddings. A classification head is then added on top to estimate the probability of success for each LLM.

To improve the reliability of probability estimates, a portion of the training set is reserved as a calibration dataset. Temperature scaling is applied to adjust the model's confidence scores, ensuring they better align with real-world performance. This step reduces overconfidence and enhances interpretability.

Once calibrated, the range of success probabilities is divided into discrete buckets. For each bucket, we compute the expected success rate of each LLM. This process transforms raw probability scores into more realistic and interpretable success estimates, allowing for better-informed routing decisions.

Allocation Strategies

Once we have a classifier that estimates success probabilities for each LLM on a given instruction, we can leverage these probabilities to guide the allocation of instructions to different LLMs using the following strategies.

1. Assignment based on Performance Thresholds

Given a target performance threshold (e.g., 0.75 accuracy), we identify the performance buckets satisfying the target performance from the calibrated classifier. Then, we select the cheapest LLMs with predicted success rates above the target performance threshold.

Pseudo-Code:

for instruction in batch:
    candidate_models = [M_k for M_k in models if P(instruction | M_k) >= threshold]
    best_model = min(candidate_models, key=lambda m: cost[m])
    assign(instruction, best_model)

2. Fixed Budget Allocation using NDCH

A non-decreasing convex hull (NDCH) is a method for constructing an optimal cost-performance tradeoff curve when selecting among different models. It ensures that as cost increases, performance does not decrease, forming a monotonically increasing frontier of achievable tradeoffs. Given a set of LLMs with varying costs and success rates, NDCH identifies the best possible allocation by linearly interpolating between models, enabling efficient budget allocation while maximizing performance.

NDCH works by filtering out models that are strictly dominated—those that are both more expensive and less accurate than another available model. It then constructs a convex hull over the remaining cost-performance points, ensuring that every point on the curve represents an optimal tradeoff between cost and accuracy. When given a fixed budget, NDCH determines the ideal proportion of instructions to be assigned to different models, balancing low-cost, lower-performance models with more expensive, high-accuracy ones. This method guarantees that any additional spending leads to meaningful improvements in performance, avoiding inefficient allocations.

Using a non-decreasing convex hull approach we:

  • Construct a cost-performance tradeoff curve for all LLMs.
  • Identify the best mix of cheap and expensive LLMs that maximizes performance while satisfying the budget constraint.

Pseudo-Code:

convex_hull = compute_non_decreasing_convex_hull(cost_performance_points)
budget_allocation = determine_mixture(convex_hull, total_budget)
for instruction in batch:
    model = sample_from_allocation(budget_allocation)
    assign(instruction, model)

The original NDCH approach does not take into account the success probabilities of individual instructions for each LLM. Instead, it randomly assigns instructions to models based on a calculated ratio. However, since we have a classifier that predicts success probabilities for each instruction across different LLMs, we introduce a refined variant called NDCH(P). This approach prioritizes routing instructions to cheaper LLMs when their predicted success probabilities are high, ensuring a more informed allocation. By leveraging these probabilities, NDCH(P) improves cost efficiency by assigning instructions to the least expensive models that are still likely to perform well, leading to better overall budget utilization while maintaining accuracy.

3. Budget Allocation using Mixed Integer Programming

In scenarios where a fixed budget must be strictly adhered to, yet performance needs to be maximized, we frame the allocation problem as a Mixed Integer Programming (MIP) optimization task. MIP provides a structured way to make globally optimal assignments by considering both cost and performance simultaneously.

The key objective of this approach is to maximize the average precision across all assigned instructions while staying within the allocated budget. Unlike heuristic-based approaches, MIP considers all possible assignments holistically, ensuring that every instruction is routed to the most suitable LLM in a way that balances cost-effectiveness and performance.

To achieve this, we define a set of constraints:

  1. Budget Constraint: The total cost of all selected model assignments must not exceed the given budget.
  2. Exclusive Assignment: Each instruction must be assigned to exactly one LLM.
  3. Performance Awareness: The predicted probability of success for each instruction-LLM pair should guide the selection process, ensuring that the highest-performing assignments are prioritized given the budget limitations.

By leveraging MIP, we effectively determine an optimal routing strategy that dynamically adjusts the proportion of instructions assigned to different models. This ensures that budget is utilized in a way that maximizes overall performance rather than being arbitrarily divided. Furthermore, MIP allows for incorporating additional constraints, such as latency requirements or fairness considerations across different LLMs, making it a highly flexible optimization framework.

Pseudo-Code:

m = create_optimization_model()
for i, instruction in enumerate(batch):
    for M_k in models:
        x[i, M_k] = add_binary_variable(m)  # Decision variable

m.setObjective(sum(x[i, M_k] * get_precision(M_k, P(i, | M_k)) for i in batch for M_k in models), MAXIMIZE)

m.addConstr(sum(x[i, M_k] * cost[M_k] for i in batch for M_k in models) <= budget)

for i in batch:
    m.addConstr(sum(x[i, M_k] for M_k in models) == 1)

solve(m)

Results and Performance Comparison

To evaluate the effectiveness of our routing strategies, we conducted experiments on two datasets, LLM-Uncertainty-Bench and Router R1 Math Reasoning, comparing the accuracy-cost tradeoffs across different allocation methods. We used 70% of the datasets as training set, 20% as test set; the remaining samples are used for calibration and validation. We repeated our experiments five times for different random seeds and present variance of our results on the figures.

LLM-Uncertainty-Bench Dataset

This dataset consists of 50,000 samples covering five key NLP tasks, with 10,000 samples per task. Each task is designed to assess different aspects of model reasoning and uncertainty quantification. The tasks include:

  1. Question Answering (QA): Evaluates general knowledge retrieval and reasoning, based on MMLU.
  2. Reading Comprehension (RC): Assesses contextual understanding using CosmosQA.
  3. Commonsense Inference (CI): Tests logical reasoning using HellaSwag.
  4. Dialogue Response Selection (DRS): Measures conversational context awareness using HaluEval.
  5. Document Summarization (DS): Evaluates summarization capabilities, also derived from HaluEval.

This dataset includes labels for a diverse set of 16 LLMs, including model families Qwen, Yi, MPT, Llama, Internlm, Mistral, Falcon, and DeepSeek. Success rates and costs of these models are presented in the table below. We set the cost of these models based on their sizes, ensuring they are generally comparable to the public API costs of models with similar sizes. However, for Yi-34B, we intentionally increased its cost beyond that of the largest model in the set, as it is the most performant model. This adjustment introduces a more challenging allocation problem for the router, forcing it to make more nuanced trade-offs between cost and performance.

Model Success Rate Cost (1M tokens)
Yi-34B 0.824 2.20
Qwen-72B 0.785 2.19
Qwen-14B 0.740 0.90
Llama-2-70b-hf 0.730 2.16
deepseek-llm-67b-base 0.728 2.00
Yi-6B 0.694 0.60
Mistral-7B-v0.1 0.643 0.69
Llama-2-13b-hf 0.606 0.90
Qwen-7B 0.603 0.69
internlm-7b 0.485 0.69
Llama-2-7b-hf 0.463 0.69
deepseek-llm-7b-base 0.440 0.69
Qwen-1_8B 0.418 0.18
falcon-40b 0.345 1.24
mpt-7b 0.263 0.69
falcon-7b 0.250 0.69

Our experimental results for this dataset are presented in the figure below, where we express the budget as a fraction of the cost of the most expensive model available, based on the values in the table above. The findings indicate that the MIP-based routing approach achieves 95% of the best model’s performance while using only 50% of its cost. While MIP significantly outperforms NDCH and NDCH(P), these methods also prove effective in reducing costs, achieving up to 90% of the best model’s performance within the same budget. This highlights that even heuristic-based approaches can yield substantial cost savings while maintaining high performance.

image.png

Router R1 Math Reasoning Dataset

The Router R1 Math Reasoning dataset is a publicly available dataset consisting of 3,234 samples. It is specifically curated to test mathematical reasoning and judgment tasks, playing a crucial role in evaluating synthetic data generation strategies and model robustness. The dataset includes labels for models: R1-1.5B, R1-7B, R1-14B, and R1-32B. The success rates and cost of these models are presented below.

Model Success Rate Cost (1M tokens)
R1 0.951 2.19
R1-32B 0.936 1.18
R1-14B 0.915 0.90
R1-7B 0.900 0.69
R1-1.5B 0.819 0.18

Our experimental results for this dataset are presented in the figure below, where we observe that all routing approaches perform similarly well, with NDCH(P) demonstrating a slight advantage. It consistently outperforms other approaches across different budget ratios relative to the best model. For instance, at 50% of the best model's budget, other routing methods achieve approximately 97% of its accuracy, whereas NDCH(P) reaches nearly 99%. This suggests that leveraging predicted success probabilities for allocation provides a meaningful improvement in cost efficiency without sacrificing performance. Let us note that the routing using predicted probabilities lead overall success ratio to exceed that of the best model with only 70% of its cost

image.png


Conclusions and Future Work

In this work, we explored different strategies for routing instructions to LLMs under budget constraints. While we introduced threshold-based assignment as a straightforward approach, it requires a predefined performance threshold, which is often impractical due to the lack of knowledge about the upper bound of achievable accuracy. In most real-world scenarios, budget constraints are more relevant than predefined performance targets, motivating our focus on budget-aware allocation strategies.

To address this, we evaluated NDCH, NDCH(P), and MIP-based routing, each designed to optimize LLM selection while staying within a fixed budget. NDCH-based allocation constructs an efficient cost-performance tradeoff curve, ensuring that resources are distributed optimally among available models. NDCH(P) refines this approach by incorporating predicted success probabilities, further improving efficiency. MIP-based optimization, on the other hand, formulates the allocation problem as a constrained optimization task, yielding the best possible performance within a given budget.

Our results demonstrate that MIP routing consistently delivers the highest efficiency, achieving near-optimal accuracy while significantly reducing costs. Meanwhile, NDCH and NDCH(P) provide competitive alternatives, especially in settings where computational efficiency is a priority.

We will release the code to reproduce presented results in coming weeks.

Future research will explore additional datasets and incorporate new constraints, such as latency-aware routing and cascaded decision-making. In latency-aware routing, models will be selected based on both performance and inference speed to meet latency requirements. Cascaded routing, on the other hand, introduces an adaptive selection process where an initial response is generated using a cheaper LLM, and its quality is assessed using a separate verification mechanism. If the response is deemed insufficient, the instruction is escalated to a more capable but costlier model. This approach ensures efficient resource utilization while maintaining high-quality outputs.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment