Engineering

Route the cheap model first

Cost is not a setting you pick for your app. It is a decision you make per step, a small model to triage, a large one to judge, and most steps never reach the large one at all.

ASR

Apollo Space Research

Apollo Space

· 12 min read

A founder asks an agent a question. The agent has to decide one thing before it does anything else: is this a request to do something, or a request to know something? That’s a yes/no. And in a lot of systems, that yes/no is decided by the most expensive model on the menu, the one you’d want writing your contract, reasoning over your books, holding a hard argument in its head. It just answered a coin-flip. Then it answers the next one. And the one after that.

Most of what an agent does all day is coin-flips wearing the costume of hard problems.

The key idea of this post is simple: cost is a routing decision you make per step, not a setting you pick per app. Pick a small model to triage, save the large one for the steps that actually need a large one, and let most steps never reach it. That sentence is the whole architecture. The rest of this post takes it apart, step by step.

The naive way: one model, every step, every time

Here is how almost everyone starts, and it’s a completely reasonable place to start. You pick a model, usually the best one you can afford, and you point every step of the agent at it. Classification: best model. Tool selection: best model. Writing the reply: best model. Deciding whether the reply was good enough: best model again, grading the first one.

It works. The demo is clean. Every answer is sharp because the smartest thing in the building touched it. So you ship it.

Then the bill arrives, and it’s strange. You look at where the money went, and it didn’t go where you’d expect. It didn’t all go to the hard reasoning. A huge slice of it went to a model that’s brilliant at law and medicine and multi-step planning, spending its talent on the question “did the user ask me to do something, yes or no?” You paid genius rates for a turnstile.

The pain isn’t only money. It’s latency. The big model is slower, and now your agent is slow on every step, including the ten trivial ones it does before it reaches the one that’s actually hard. The user waits the same long beat for “let me check” as they do for the answer. You’ve made the cheap parts expensive and the fast parts slow, and you did it uniformly, because the model was a property of the app instead of a property of the step.

That’s the trap: you chose one altitude of intelligence and applied it to a problem that has many altitudes.

The reframe: a step has a difficulty, and difficulty is cheap to detect

Walk through what an agent actually does in a single turn. It is not one act of thinking. It’s a little assembly line, and the stations have wildly different difficulty.

First it has to understand the request, sort it into a bucket. Question or command? About money, or people, or the calendar? That’s pattern-matching. A small, fast model does it about as well as a large one, because the answer space is tiny and the signal is right there in the words.

Then, if it’s a command, it has to plan, pick the tool, fill the arguments, sequence the steps. Sometimes trivial (“record this expense”), sometimes genuinely hard (“close the books, but only after reconciling the two ledgers that disagree”). The difficulty here is variable, and that variability is the whole point.

Then it acts, calls the tool, gets a result. The model barely thinks here; the tool does the work.

Then someone has to judge whether the result is actually good, grounded, complete, not hallucinated, safe to show. This is the station where you want the careful mind. Judging is harder than doing, the way reviewing a proof is harder than following one.

The naive way runs all four stations on the same model. But look at them. The first is trivial. The middle two are variable. The last one is where the real intelligence earns its keep. Each station has a difficulty, and a step’s difficulty is far cheaper to detect than the step is to solve. A small model can almost always tell you “this one’s easy” or “this one needs the big brain” without being the big brain itself.

So you stop asking “which model runs the agent?” and start asking “which model runs this step?”, and you let a small, fast triager decide.

A single agent turn is an assembly line of four stations, understand, plan, act, judge, each with its own difficulty. The naive system runs all four on one expensive model; the routed system sends the easy understand-and-act stations to a small fast model and reserves the large model for the steps that are genuinely hard, so most steps never reach it.

The cascade: small triages, large judges, most steps never escalate

Here’s the shape that replaces “one model everywhere.” It’s an old idea borrowed from a place that has nothing to do with language models, and it’s worth saying where it comes from, because the lineage is the argument.

A spam filter does not run a heavyweight model on every email. It runs a cheap test first, a few keywords, a sender reputation, a fast classifier. The overwhelming majority of mail is decided right there, in microseconds, for almost nothing. Only the genuinely ambiguous messages, the ones the cheap test can’t call confidently, get escalated to the expensive analysis. The cheap stage doesn’t have to be right about the hard cases. It only has to be right about which cases are hard.

That’s a cascade. And it’s exactly the structure an agent wants.

In a cascade, the small model goes first on every step. It does the triage, the routing, the easy understanding, the obvious tool pick. When it’s confident, and on routine work it’s confident most of the time, the step is done, fast and cheap, and nothing else runs. When it’s not confident, or when the step is one we’ve marked as high-stakes by its very nature, it escalates: hand this one up to the large model. The large model isn’t the default. It’s the exception you reach for, on purpose, when the cheap stage raised its hand and said I’m not sure, and being unsure here is expensive.

The naive system pays the large-model price on every step. The cascade pays it only on the steps that escalate, and on a normal day’s work, most steps don’t.

There’s a sharp objection here, and it’s the right one to raise: isn’t this just going to be dumber? You’re letting a small model make decisions. No, and the reason is the second half of the cascade. The small model never gets the final word on anything that matters. The large model still judges. The cheap stage decides the easy things and triages the hard ones; the expensive stage decides the hard things and checks the consequential ones. You didn’t replace the smart model. You stopped wasting it on turnstiles so it has budget left for the courtroom.

The cheap model’s job isn’t to be right about the hard cases. It’s to be right about which cases are hard.

Where it breaks, and the rule that fixes it

The honest version of this includes the place it goes wrong, because a cascade has a failure mode and pretending it doesn’t is how you ship it badly.

The failure is a confident small model that’s confidently wrong. It looks at a step, decides it’s easy, handles it itself, and never escalates, except the step wasn’t easy, it just looked easy. A subtle request misread as a simple one. A high-stakes action that pattern-matched to a low-stakes template. The cascade’s whole economy depends on the cheap stage knowing its limits, and a cheap model that doesn’t know its limits will route a courtroom case to the turnstile.

So the cascade needs one rule, and it’s the rule that makes the whole thing safe to run: escalation is not only about confidence, it’s about consequence. Some steps go to the large model no matter how confident the small one feels, because the cost of being wrong is too high to gamble on a cheap stage’s self-assessment. Anything that spends money. Anything that touches a customer. Anything irreversible. Those aren’t routed by confidence; they’re routed by category, straight to the careful mind and through the judge, every time.

That gives you two escalation triggers, and you want both. The first is the cheap model is unsure, escalate the ambiguous. The second is the step is consequential, escalate the dangerous, even when the cheap model was sure. Confidence handles the hard-but-harmless. Consequence handles the easy-looking-but-costly. Together they close the gap where a fast model’s false confidence would otherwise leak a bad decision into the world.

There’s a second, quieter failure worth naming: the temptation to make the router smart. If your triager needs a large model to decide where to route, you’ve reinvented the problem one level up, now you’re paying genius rates to decide whether to pay genius rates. The router has to stay cheap. Its job is the easy meta-question “is this easy?”, and that question is, almost always, itself easy. When it isn’t, the safe move is to escalate, a cascade that errs toward the big model on doubt is still vastly cheaper than one that runs the big model on everything, and it never gets dumber, only slightly less thrifty.

Two ways to spend on a turn. On the left, every step routes to one large model, so the bill is uniform and the trivial steps cost as much as the hard ones. On the right, a cheap triager runs first and clears most steps itself, escalating to the large model only on low confidence or high consequence, so spend tracks difficulty instead of volume.

Why “per step” is the load-bearing word

It would be easy to read all of this as “use a cheaper model.” That’s not the lesson, and reading it that way throws away the actual idea.

“Use a cheaper model” is a per-app decision, you pick one model, a smaller one, and now you’re cheaper and dumber everywhere, including on the steps that needed the smart one. That’s the same mistake as the naive way, just pointed the other direction. You traded a uniform overpay for a uniform underperform. Uniform was never the right answer, in either direction.

The whole move is that cost stops being a property of the app and becomes a property of the step. The same agent, on the same turn, might run a tiny model to understand the request, a mid model to plan the action, no model at all to call the tool, and the largest model you have to judge whether the output is safe to show. One turn, four altitudes of intelligence, each matched to what its station actually demands. The bill tracks difficulty instead of volume. And the user feels it as speed: the easy parts return instantly because nothing heavy ran, and the hard parts are worth the wait because something heavy should have.

Say a typical turn touches a dozen model calls. If eleven of them are easy and one is hard, and on real work the ratio is lopsided like that far more often than not, then routing per step means you pay the high rate once instead of twelve times, and you pay it on the call that earned it. The savings aren’t a discount. They’re the difference between spending on thinking and spending on turnstiles.

This is also why you can’t bolt it on at the end. Per-step routing is a decision the system makes, not the model, which means the system has to be built to make it. Something has to sit above the models, look at each step, know its difficulty and its consequence, and pick. That “something” is the part most agent stacks are missing, because they were built as one model with a prompt, and one model with a prompt has nowhere to put a routing decision.

The turn: thrift is taste, not stinginess

Strip away the models and the cascades and what’s left is an old habit of anyone who’s run a real operation: you don’t put your most expensive resource on your cheapest problem.

You don’t send the senior partner to photocopy. You don’t wake the on-call engineer for a typo. You don’t ask the person whose judgment you trust most to spend their morning sorting which emails are spam, not because their judgment is wasted on it, but because every minute they spend on the turnstile is a minute they’re not spending on the case that actually needs them. Good operations have always routed work to the right altitude. It’s not stinginess. It’s respect for where the scarce thing should go.

That’s what routing the cheap model first really is. It’s the same discipline, applied to a workforce made of models instead of people. The small model isn’t a downgrade you tolerate to save money. It’s the right tool for the eleven steps that are genuinely easy, and the reason the large model has the budget, and the speed, and the attention left to be brilliant on the one step that’s genuinely hard.

The agents will keep getting smarter and the prices will keep moving, and none of that changes the shape. There will always be cheap problems and dear ones in the same breath, and the system that wins is the one that knows the difference before it spends, that meets each step at its own altitude instead of charging full fare for the whole stack.


That’s part of what we’re building at Apollo Space, not a smarter chatbot, but an operating system that decides, step by step, how much intelligence a moment is worth, and spends accordingly. The next time an agent answers a yes/no for you, the interesting question isn’t how smart the answer was. It’s how little it should have cost to know it.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist