Power steering: Squeeze massive power from small LLMs

Community Article Published December 9, 2024

Feeling FOMO because you don't have the iron to run that hot new 405B, or even 70B LLM? If the task involves structured output, start by making sure you can't get a nice little 12/13B to do the job, with a bit of nudging. You may not need as massive an LLM as you think, if you give it a helping hand on the steering wheel

I had a very interesting Discord conversation a few days ago with someone chasing the largest LLM they could as a sort of brute force problem solver. I've had other occasions since then to observe that this is a common approach among people exploring DIY with LLMs, but I think in many cases, it needn't be.

How does the LLM secret weapon get thrown in the waste bin?

OP, as we'll call them, started things off by saying their 64GB RAM M1 Mac was useless because they needed to run Llama 70B, and a 4 bit quant was giving "bad results" while an 8 bit quant wouldn't fit. It was an unclear problem statement, and I had to bide my patience teasing out what OP's actual problem was. I'll spare you the back and forth and just summarize as follows:

They were using an LLM to analyze a bunch of internet posts, with structured output
- This was using Ollama on their 64GB Mac
They could get the results they wanted with Ollama on Nvidia (dual 3090s)
They tried basic generation with MLX (note: Ollama doesn't support MLX), but they said the generation was hanging

In effect, OP had gone through something most of us do: they had started with a problem, and kept having to use larger and larger models and quants before they started to get good results. It was then easy to consider size and benchmarks as the main factor for all their LLM workloads. They had bought an M1 Mac because it's well reputed to be an excellent way to run such workloads, and they were bringing along the mindset of chasing the biggest models they could.

MLX was in particular the main context for our conversation. I've often said that MLX is the secret weapon of AI DIY, for folks who like to tinker for themselves and follow along this GenAI revolution hands on. It's a large part of the reason why Apple Silicon is so great for AI: fast and developing rapidly, with very well engineered code—a bit of a rarity, sadly, in GenAI projects. On top of all that, the hardware is very energy efficient.

For my own part I build on MLX with Toolio, a package for LLM generation with guaranteed structured output, which I'll more specifically call schema-steered structured output (3SO), inspired by llm-structured-output, another open-source LLM-sampler-based structured output tool. I knew that Ollama didn't support 3SO (this has now changed), and asked OP whether they had tried that in any form.

I'd already guessed that the reason they needed a 70B parameter model with 8 bit or higher quant was not because the underlying problem was that complex, but because larger models can more readily deal with unguided requests for structured output. I told OP that if they instead used a guided request, they could probably use a much smaller model.

I've done a lot of LLMOps work for data extraction—extracting structured data from unstructured content. I've found that small LLMs are remarkably good at this, if you can find tools to steer them through the structural part so they can "focus" (yes, anthropomorphizing here) on the pure language and logic. Toolio came about at first to scratch my own such itches.

It was clear from OP's response that not only had they not considered or tried 3SO, but that they were assuming that a large enough LLM would guarantee the structure. I've been learning that this is a common misapprehension, and I even added the following to the Toolio README a couple of weeks ago:

There is sometimes confusion over the various ways to constrain LLM output

You can basically beg the model through prompt engineering (detailed instructions, few-shot, etc.), then attempt generation, check the results, and retry if it doesn't conform (perhaps with further LLM begging in the re-prompt). This gives uneven results, is slow and wasteful, and ends up requiring much more powerful LLMs.

Toolio's approach, which we call schema-steered structured output (3SO), is to convert the input format of the grammar (JSON schema in this case) into a state machine which applies those rules as hard constraints on the output sampler. Rather than begging the LLM, we steer it.

In either case you get better results if you've trained or fine-tuned the model with a lot of examples of the desired output syntax and structure, but the LLM's size, power and training are only part of the picture with S3O.

What does schema-steered structured output (3SO) really mean?

The distinction I added to the Toolio README is worth elaborating on, because there seems to be such fundamental misunderstanding out there on the topic; there are so many tools offering different approaches to the problem, and not always offering clarity on their approach.

Imagine the LLM is a smart programmer friend. You give them pen and paper and ask them to write out directions from their house to the local post office, but in a strict output format such as XML or JSON. Many programmers will be able to do this, with zero flaws in syntax, and functionally useful travel directions, most of the time. This is essentially two problems in one, tapping into two separate knowledge sources in their head.

Their knowledge and memory of where they live, and the route to the post office
Their knowledge of XML or JSON syntax (let's just stick with JSON from here on)

Here's the thing, though: every now and then, they'll slip up. Sure they might accidentally mention a left turn where they mean a right, but more likely they'll mess up something syntactic. They may forget a comma in a JSON object, or forget to escape a double quote within a string. Our brains are really not specialized for such strict syntax. The most experienced programmer might make such errors rarely enough, however, that you're fooled into thinking correct output is guaranteed from them, which of course is not the case.

Think of most LLMs as extremely smart programmers, for this purpose. If you tell them to generate JSON, they can be very good at doing so, but they can and will slip up; often in different ways than a person. The most common slip-up is for them to begin the output with some introductory text such as "Sure here is your JSON output", perhaps with added backticks—just trying to be friendly and helpful, as their training encourages. Unfortunately, this extra chaff ends up breaking JSON parsing, gumming up any automation which expects a syntactically perfect response.

There are many prompt engineering tasks which can help with this. You can add additional instructions for the LLM to not add any preambles. You can include few-shot examples. You can even fine-tune the LLM to reinforce the sort of output you want. This will improve your outcomes to some margin, but crucially, none of this will guarantee valid, structured output.

Many tools for structured output handle this through validation and retries. This is like checking your programmer friend's work and maybe saying "oops, you missed acomma; try again." With enough retries and prompt engineering tricks you an maybe get a guarantee of the output, but clearly it would be better to just have that guaranteed in one shot, as with 3SO.

Helper hand on the wheel

What do most programmers do, in practice, in order to make sure they at least don't have to worry about getting the syntax wrong, and that they can focus on the underlying problems? They use syntax helpers, such as text editor auto-complete. This way they can pretty much be assured of valid output. These wizards are basically helping the user steer correctly.

We can give an LLM the same help, except we can go even further and strictly steer the output to ensure its output is, for example syntactically correct JSON, and even that it conforms to a specific pattern of JSON, i,e, a JSON schema. In OP's case, we want to make sure it generates something like:

[
    {
      "summary": "ABC 123 Do re mi",
      "author": "Jackson 5",
      "link": "https://example.com/post/2312"
    },
    {
      "summary": "Stop! The love you save maybe your own",
      "author": "Jackson 5",
      "link": "https://example.com/post/9970"
    }
]

We want to make sure it doesn't make up field names such as url rather than link, or maybe add an unspecified timestamp field, or maybe invent some sort of extra structure such as an outer object rather than array.

A well-known example of such schema-guided generation is OpenAI's recent announcement, Introducing Structured Outputs in the API, in which they also clarify what's new about schema-aware steering.

While JSON mode [released by OpenAI in November, 2023] improves model reliability for generating valid JSON outputs, it does not guarantee that the model’s response will conform to a particular schema. Today [August, 2024] we’re introducing Structured Outputs in the API, a new feature designed to ensure model-generated outputs will exactly match JSON Schemas provided by developers.

The earliest example of this facility of which I became aware was llama.cpp grammars, which emerged in late 2023. llama.cpp is an open-source library for LLM inference. GBNF is a format for defining formal grammars to constrain LLM outputs.

Meanwhile, just this past Friday Ollama joined the party, announcing 3SO.

That covers the most mainstream, the earliest and most recent examples I'm aware of. Another is my own project, Toolio. Further examples of steered output generation are rather thin on the ground, but I believe adding 3SO to one's toolkit is key to effective use of GenAI.

Power steering makes the difference

Back to the JSON navigation output example, remember that we had two problem spaces in one:

Knowledge and memory of an area, and travel directions
Knowledge of JSON syntax

The way GenAI works, as the response is being generated, there is a huge statistical frame for predicting each next token. In the naïve approach without steering, the LLM reaches into both streams of knowledge at the same time, both contributing to the statistics of the possible next token. The process through which these stats are processed in order to make the selection, i.e. the sampling, is done in a relatively unsophisticated manner, so there is some friction on the core knowledge (1) in getting through. This not only sometimes increases the likelihood of errors, but also makes the entire system less efficient.

With 3SO, you let the sampling layer take care of (2). In effect, the sampler restricts the statistics for the next token according to the output schema. For example, if it's controlled by a JSON schema which says the outer structure is a list, the sampler adjusts things so that the probability of [ as the first token is close to 100%.

This not only ensures that you get valid output, but the tweaks to the token statistics have the added effect of reducing the friction on the main knowledge stream (1). It also means that if the LLM is a bit less powerful, and might struggle with the output structure requirements, we're giving it enough of a helping hand for it to be successful, as long as it's smart enough to handle the main knowledge stream (1).

In my experience, and (spoiler alert) as turned out to be the case for OP, people often end up looking for more and more powerful LLMs (Llama 70B in OP's case) not because the underlying problem (1) requires it, but because the combination of (1) and (2) require a lot of LLM horsepower to get a high enough rate of correctness. OP thought he was getting 3SO, because he never happened to witness a failure, but without steering support from the inference layer, he was always on a gamble, and also having to chase over-powered and resource-hungry LLMs.

Yes, back to our friend OP

OP was skeptical when I tried to explain the above to him, but to their credit, they did agree to give Toolio a try. I took a look at their code, and after helping them navigate a few quirks, such as malformed tokenization, I whipped up a quick schema for their desired output format, with a helping hand from Claude. I was then able to simplify their main prompt a great deal, because I could remove all the bits begging the LLM to behave, and the few-shot examples, and all that. All unnecessary with Toolio's 3SO.

OP had been trying to use the mlx-community/Llama-3-70B-Instruct-Gradient-262k-4bit model, but I wanted to just try it with my current favorite go-to small(ish) model, mlx-community/Mistral-Nemo-Instruct-2407-4bit, so 12B rather than 70B. The resulting code, using Toolio's 3SO is as follows.

import sys
import asyncio
from toolio.llm_helper import model_manager
from toolio.common import response_text

SCHEMA = '''\
{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "array",
    "items": {
        "type": "object",
        "required": ["summary", "author", "link"],
        "properties": {
            "summary": {
                "type": "string",
                "description": "Concise summary of the post"
            },
            "author": {
                "type": "string",
                "description": "Name of the person who shared or authored the response"
            },
            "link": {
                "type": "string",
                "description": "URL source of the original post or reference"
            }
        }
    }
}
'''

UPROMPT = '''\
You are a news analyst. Read the following material and extract information according to the provided schema.

''' + sys.stdin.read()

toolio_mm = model_manager('mlx-community/Mistral-Nemo-Instruct-2407-4bit')

async def main(tmm):
    msgs = [{"role": "user", "content": UPROMPT}]
    print(await response_text(tmm.complete(msgs, json_schema=SCHEMA, max_tokens=8192)))

asyncio.run(main(toolio_mm))

The json_schema=SCHEMA parameter is what sets up 3SO. The code also respects the concatenated posts to be summarized piped in standard input.

OP was astonished that such a small LLM handled this task like a champ, so a happy end to the discussion.

Wrap up

I should 'fess up to one over-simplification I've been making. Even 3SO may not be 100% guaranteed, in some cases. The OpenAI implementation might refuse to produce output counter to its content standards and guardrails, and these would manifest as a structured refusal response. Toolio and most state-machine approaches can, in rare cases, run into such a degree of schema ambiguity that they enter an undecidable state. This is very rare, though, and nowhere near the sort of gamble that comes with unsteered structured output.

3SO won't help you if you need to produce creative prose or poetry, or for many other common LLM use-cases, but if what you're doing is some sort of data processing within a larger code or API pipeline, 3SO is a crucial tool in the LLMOps kit.

Feature images generated using Recraft.ai

Upvote