Engineering

The model is the cheapest part of the agent

Swap the model and your company keeps running. Swap the harness around it, the memory, the tools, the evals, the gate, and it stops. The moat was never the model.

ASR

Apollo Space Research

Apollo Space

· 12 min read

A better model came out this morning. By lunch, somebody on every engineering team had swapped it in, run the benchmarks, and posted the new numbers in a channel. Nothing else in the product changed. The agent that books the meeting, remembers the customer, files the report, it kept doing exactly what it did yesterday, slightly faster.

Now run the opposite experiment. Keep the model frozen and rip out the thing wrapped around it: the memory it reads, the tools it can call, the evals that catch its mistakes, the gate that decides what’s allowed to ship. The model is just as smart as it was this morning. The product is dead.

That asymmetry is the whole post. The model is the cheapest part of the agent, the moat is the harness around it.

What people benchmark, and what actually breaks

Walk into almost any conversation about agents and the question is the same: which model. The leaderboards rank them. The launch posts compare them. The procurement decision turns on a column of scores.

It’s the wrong column.

The model is the one component you can replace in an afternoon. It arrives behind a stable interface, text in, text out, a few parameters. Three providers ship something interchangeable. The price of the good one keeps falling, and the gap between the best and the second-best keeps shrinking. A part with three interchangeable suppliers and a price dropping every quarter is, by definition, a commodity. You don’t build a moat on a commodity. You build it on the thing nobody can hand you off a shelf.

Here is the naive mental model, the one the leaderboards encourage: an agent is a model with a nice prompt. Make the model smarter, the agent gets better. So you chase the smartest model, and when it disappoints you, you wait for the next one.

Then you watch a real agent fail in production, and the diagnosis is never “the model wasn’t smart enough.” The model answered the question it was asked, correctly, in isolation. What broke was everything around the answer. It didn’t know the customer had already complained twice, because it couldn’t read the history. It tried to act and had no hands, because no tool was wired in. It produced a confident wrong answer and nothing caught it, because there was no test for that case. It shipped a regression because the only thing grading it was the same mind that wrote it.

None of those are model problems. The model is the cheapest part of the agent, the moat is the harness around it. The harness is the four things the leaderboard never measures: what the agent remembers, what it can touch, what catches it when it’s wrong, and what’s allowed to ship. Let me take them one at a time.

Memory: the part that doesn’t reset at midnight

Start with the failure everyone has felt. You tell an assistant something on Monday. On Tuesday it has no idea you ever spoke. Every conversation begins at zero, so every conversation re-litigates the basics, who you are, what you’re working on, what you already decided. The model is brilliant and amnesiac, which is a strange and useless combination, like hiring a genius who gets a head injury every night.

The naive fix is to stuff everything into the prompt. Paste the history, the documents, the last ten conversations, and let the model sort it out. This works for a toy and collapses for a company. There’s more context than any window holds. Most of it is irrelevant to this question. And dumping it all in doesn’t make the agent remember, it makes it skim, expensively, and miss the one line that mattered.

The real fix is a system that decides what to keep and how to fetch it: a durable store the agent writes to, an index that finds the three relevant facts out of ten thousand, and a discipline about what’s worth remembering versus what’s noise. That last part is the hard part. Remembering everything is the same as remembering nothing, the signal drowns. A useful memory is mostly a good forgetting policy.

Swap the model and the memory survives. Swap the memory and the smartest model in the world starts every day as a stranger.

Notice what that means for the swap test. The conversations you had, the decisions you logged, the relationships the system learned, none of it lives in the model. It lives in the store beside it. Upgrade the model and that history is untouched; the new model wakes up already knowing your company. Lose the store and no upgrade saves you. The model is the cheapest part of the agent, the moat is the harness around it, and memory is the first wall of it.

A model on its own is brilliant but resets every midnight, so each day starts as a stranger; wrapped in a durable memory store with an index that fetches the few relevant facts, the same model wakes up already knowing the company, and survives being swapped for a newer one.

Tools: a mind with no hands is a chatbot

A model can reason about sending the email. It cannot send it. That sentence is the entire difference between a chatbot and a coworker, and it has nothing to do with how smart the model is.

The naive version of tools is a list of functions bolted onto the prompt, here are forty things you can call, good luck. It demos beautifully and fails quietly. The agent picks the wrong tool. It calls the right tool with garbage arguments. It tries to write when the user only asked it to read. It fires an action that touches real money or real customers with no brake in front of it. The model isn’t broken; the wiring is. A list of functions is not a set of hands. It’s a pile of unlabeled levers.

A real tool layer is mostly the unglamorous part around each function. It’s the guardrail that won’t let a read-only question trigger a write. It’s the brake that pauses any action with consequences and asks a human first. It’s the recovery path for when a tool returns an error instead of a result. It’s the discipline that the agent gets a few sharp, well-described tools instead of forty fuzzy ones, because every extra lever is another way to pull the wrong one.

And here’s the part that survives the model swap: the integrations you built, the connection to the calendar, the CRM, the billing system, the guardrails around each, are yours, not the model’s. A new model inherits every one of them on day one and is immediately more useful than a smarter model with no hands. The reach of an agent is set by what it can touch, and what it can touch is something you build, once, carefully, and keep.

Evals: the only honest definition of “it works”

Now the part teams skip first and regret most.

Ask a writer agent if its change works and it will say yes. Of course it will, it just wrote it, it ran a test, the test was green. But it wrote the test too, so the test asserts what the code already does. It defined “done,” so “done” is whatever it happened to finish. A mind grading its own work is the least reliable judge there is, and “it works” from the author is a feeling wearing a checkmark.

The naive trust model is: the model said it’s done, so it’s done. Replace one word and the whole thing falls apart. “Done” and “looks done” are different claims, and the gap between them is exactly where bugs live.

An eval is the fix, and an eval is not a model. It’s a corpus of real flows, the things customers actually do, run against the agent on every change, graded by a judge that has no stake in the work. Did the agent route this request correctly? Did it remember the fact it should have remembered? Did it refuse the action it should have refused? Each flow is a question with a known good answer, and the eval is the difference between I believe it works and here are two hundred flows that passed and the three that didn’t, by name.

“It works” is a feeling. An eval is the only honest definition of done.

The reason this is harness and not model: the corpus is yours. It encodes what your product is supposed to do, gathered from how your users actually behave, including every weird edge they found that no benchmark would ever contain. Swap the model and you re-run the same corpus and instantly know whether the new one is better or just newer. A leaderboard tells you a model is smart in general. Your evals tell you whether it’s right at your job, and only one of those two numbers can be wrong in a way that costs you a customer.

The gate: the cheapest part says yes, the moat says no

Stack up memory, tools, and evals and you still need one more thing, the one that ties them together: something that decides what’s actually allowed to ship.

The naive pipeline has no gate. The agent does the work, says it’s done, and the work merges, or worse, reaches a customer, on the strength of its own say-so. We’ve all seen where that goes. The agent that confidently set a reminder that never fired. The change that passed every local test and broke the second it touched the real world. The “done” that nobody re-checked against what done was supposed to mean.

A gate is a deterministic checkpoint that the smartest agent on the team does not get to talk its way past. Not another model asked “does this look right?”, that one rubber-stamps. A gate that runs the evals, checks the proof, and refuses the change unless a real flow ran and a real result came back. It says yes by default to nothing. It is, on purpose, the most conservative thing in the building.

And it is the most valuable thing in the building, which is the part that surprises people. The component that produces the least, no answers, no code, no clever output, is the one that makes everything else trustworthy. The model is the part that says yes, I can do that. The gate is the part that says not yet, you didn’t prove it. One of those is a commodity you can buy from three vendors. The other is a standard you have to build and defend, and it’s the reason the output can be trusted without a human re-reading every line.

Two pipelines for shipping an agent's work. On the left the model self-grades and its output ships on its own say-so, so the bug reaches the customer. On the right the same output passes through the harness, memory checked, tools guarded, the real eval corpus run, and a deterministic gate that refuses anything unproven, and only a genuine pass reaches the customer.

The swap test, run both directions

Put the four walls together and you can run one clean experiment that settles the whole argument.

Swap the model. Keep the harness. Your memory still holds every conversation. Your tools still reach every system. Your evals still grade against your real flows. Your gate still refuses what isn’t proven. The new model drops into all of it and, if your evals say so, the product gets better by Tuesday. Nothing about the company that the agent serves had to change.

Now swap the harness. Keep the model. Same brilliant model, but it remembers nothing, can touch nothing, is graded by no one, and ships on its own word. That isn’t a worse product. It’s not a product at all. It’s a very smart text box, exactly like the one your competitor also has, because they bought the same model from the same vendor.

That’s the test, and it only points one way. The model is the cheapest part of the agent, the moat is the harness around it. The thing you can replace in an afternoon is not where your advantage lives. The advantage lives in the four things you can’t buy off a shelf: what the agent remembers, what it can do, what catches it, and what holds the line on what ships.

The turn: build the part that’s still yours next year

Here’s the part that isn’t about architecture.

If you’re choosing where to spend the scarce, expensive hours of a small team, the leaderboard is pulling you toward the one decision that will make itself. The best model a year from now is not the one shipping today, and whatever you pick, you’ll swap it, cheaply, in an afternoon, behind a stable interface, the same way you swapped the last one. Every hour spent agonizing over that column is an hour spent on the part of the system that was always going to be replaced.

The hours that compound are the other ones. The memory you taught your company’s facts to. The tools you wired and guarded so an agent could actually act. The corpus of real flows that tells you, honestly, whether anything works. The gate you’re willing to defend on the day everyone’s tired and the deadline is tonight. None of that comes in a model release. All of it is still yours next year, and still working, no matter how many times the model underneath it changes.

The smartest model in the world is a part you rent. The harness is the thing you own.


That’s what we’re building at Apollo, not a bet on which model wins, but the company-shaped harness around whatever model wins: a brain that remembers, hands that act, evals that tell the truth, and a gate that says not yet. The model will keep getting cheaper and smarter, and that’s fine. We were never building the cheap part.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist