Determinism is a feature: the boring parts are code, not prompts
Anything that has to be exact, auth, money math, routing, belongs in code; only judgment belongs in a prompt. A reliable agent is a thin language core inside a deterministic shell.
Apollo Space Research
Apollo Space
Ask a language model to add two columns of currency a thousand times and it will get one of them wrong. Not because the model is bad, because that is not what a language model is for. It predicts the next token; it does not carry the one across a decimal place the way a four-line function does, every time, forever. The function is boring. The boring part is exactly the part you cannot afford to get wrong.
So here is the question that decides whether an agent is a demo or a coworker: which parts of its work are allowed to be a prompt, and which parts have to be code?
Anything that has to be exact, auth, money math, routing, belongs in code; only judgment belongs in a prompt.
That one line is the whole post. The rest is the mechanism: why the obvious approach (prompt everything) feels great for a week and then quietly betrays you, where the seam actually goes, and what a system looks like once you stop asking a probabilistic thing to be deterministic.
The naive version: prompt everything, hope it holds
The first version everyone builds, including us, including the demos you’ve seen, is the maximalist one. One capable model, one big prompt, and you let it do all of it. Read the request. Decide who’s allowed. Pick the tool. Do the arithmetic. Format the result. Send it.
It is genuinely thrilling for an afternoon. The model reads a messy email, figures out the customer means the invoice from last month, totals the line items, and drafts a reply, all in one pass, no glue code, no schema. You think: why was software ever harder than this?
Then you run it a thousand times.
The failure isn’t dramatic. There’s no crash, no stack trace, no red. The model just drifts. The total is right nine hundred and ninety times and off by a rounding error ten times, and those ten land in front of a customer. The same request that routed to “billing” yesterday routes to “support” today, because you swapped two sentences in the prompt and the model’s mood moved with them. Once, on a phrasing nobody anticipated, it decides a read-only request is a permission to act and does the thing it should have refused.
The model didn’t break. It was never deterministic to begin with, you just asked it to pretend.
That’s the trap. A prompt is a probability distribution wearing the costume of a function. Most of the time the most-likely output is also the correct one, so it looks like a function. The gap only shows up at scale, on the inputs you didn’t test, in the dimensions you can’t see, and those are precisely the inputs production is made of. The bottleneck never disappears. It moves from “can the model do this once in the demo” to “does it do this every single time in front of a real person,” and the second question has a much worse answer.
You feel the pain as flakiness. Same input, different output. A test that’s green on Tuesday and red on Friday with no code change. An audit you can’t run because there’s no logic to point at, just a prompt and a shrug.
The seam: exactness is code, judgment is a prompt
The fix is not a better prompt. It’s a line drawn through the work itself.
Look at any agent task and you’ll find it’s actually two kinds of work braided together. One kind has a right answer that does not depend on phrasing: this user either has permission or doesn’t, these numbers either sum to this total or they don’t, this request either belongs in this queue or that one. The other kind has no single right answer and lives entirely in interpretation: what did this person actually mean, which of three reasonable replies fits the situation, is this email worth flagging at all.
The first kind is exactness. The second kind is judgment. And the entire reliability of an agent comes down to refusing to confuse them.
Anything that has to be exact, auth, money math, routing, belongs in code; only judgment belongs in a prompt.
Take the three from the title, because they’re the three that hurt most when they drift.
Auth. Whether someone can do a thing is never a judgment call. It’s a lookup against a rule that either matches or doesn’t. If a model is the thing deciding access, then your security boundary has a temperature setting, and a phrasing nobody tested can talk it open. The model can read the request and tell you what the user is trying to do. Whether they’re allowed to do it is a function with a boolean answer, run before the model’s intent is ever acted on. Same input, same answer, every time, and a log line that says why.
Money math. Totals, taxes, proration, the running balance, these have exactly one correct value and an auditor who will check it. A model that’s right 99.9% of the time is, on money, wrong. The model’s job is to understand that the customer is asking about last month’s invoice. Computing what last month’s invoice actually came to is arithmetic, and arithmetic is code that can’t round differently on a bad day.
Routing. Where work goes next, which queue, which agent, which approval step, is the skeleton of the whole system. If routing is a vibe, the system has no shape; the same case wanders to a different place each run and nobody can reason about flow. The model can classify, “this reads like a refund request” is a judgment, and a fine one for a model to make. But turning that classification into “therefore it enters the refund workflow, which requires this approval” is a deterministic mapping. Classification is soft; the consequence is hard.
Notice the pattern. In every one, the model still does real work, it reads, it interprets, it classifies. What it never does is decide the exact thing. It hands its interpretation to code, and code does the part that has to be the same on Friday as it was on Tuesday.
The shape: a thin language core in a deterministic shell
Once you’ve drawn the seam, the architecture draws itself, and it’s the inverse of the naive one.
The naive version is a thick model wrapped in thin glue, the model does everything, the code just passes strings around. The reliable version is the opposite: a thin language core doing only the part that genuinely needs language, wrapped in a thick deterministic shell that owns every exact decision around it.
The shell is the boring code you trust. It checks the permission before the model’s intent is allowed to matter. It runs the arithmetic on real numbers, not on tokens. It validates that what the model returned is shaped the way the rest of the system expects, and if it isn’t, it rejects it rather than passing malformed judgment downstream. It maps the model’s soft classification onto a hard route. It writes the audit line. None of that ever varies, because none of it is a prompt.
The core is the model, asked one narrow thing at a time: what does this person mean, which of these options fits, write this in a human voice. Questions with no single right answer, the ones where a probabilistic, fluent, context-reading thing is genuinely the best tool on earth. You give it the judgment and nothing else. You do not give it the checkbook, the keys, or the routing table.
Why this inversion buys reliability is almost arithmetic. Every place you let the model make an exact decision is a place that can drift. Shrink the model’s surface to only the questions that have no exact answer, and you’ve shrunk the drift to exactly the places where drift doesn’t matter, because there was no single right answer to drift away from. A different-but-reasonable interpretation of a vague email is fine; that’s judgment doing its job. A different total on the same invoice is a bug; that’s exactness leaking into a prompt.
There’s a test you can apply to any line in the system. Ask: if you run this twice on the same input, do you need the same answer? If yes, it’s code. If a reasonable person could give two different answers and both be acceptable, it’s a candidate for a prompt. Auth, money, routing, same answer required, every time. Tone, intent, which-of-three, variation is allowed, even welcome. The seam isn’t a matter of taste. It’s a property of the work.
And the boundary is also where trust comes from. A deterministic shell is auditable, you can read the function that decided access, replay the math, point at the rule that routed the case. A prompt is not auditable in the same way; you can read it, but you can’t prove what it’ll do on the input you haven’t seen. Putting the exact decisions in code isn’t just about correctness. It’s the only way to ever answer “why did the system do that” with a line of logic instead of a shrug.
The turn: the boring parts are where the trust lives
Here’s the part that isn’t about architecture.
There’s a romance in the idea that the model does everything, that you describe the company in English and the intelligence handles the rest. It’s a seductive picture, and it’s why the maximalist version keeps getting built. But a coworker you can actually rely on is not the one who’s brilliant on the demo and unpredictable in production. It’s the one whose boring parts are boring every single time. The colleague who always checks whether you’re allowed before they hand over the keys. Who never gets the total wrong, even at the end of a long day. Who sends the work to the right place without being reminded.
That reliability was never going to come from a smarter model, because the thing you need from auth and money and routing isn’t intelligence, it’s sameness. The same answer on Friday as on Tuesday. The thing you need from intent and tone and judgment is the opposite, fluency, context, a reading of what a person actually meant. The mistake is asking one component to be both. Determinism is a feature, and you get it by giving the exact work to the part of the system that can’t have a bad day, and giving only the judgment to the part that can.
The model will keep getting smarter. That’s good, and it changes the judgment half, richer interpretation, better drafts, sharper classification. It does not change the other half. Money will still have one correct total. Permission will still be a boolean. The right place for a case will still be a fact, not a feeling. The agents that companies actually trust with real work will be the ones whose builders knew which half was which.
That’s how we build agents at Apollo, a thin language core for the judgment, a deterministic shell for everything that has to be exact, so the boring parts stay boring under load. If you’ve ever watched a brilliant demo fall apart the moment it met a thousand real requests, you already know the model was never the part that needed to be reliable. The arithmetic was.
Apollo runs your company's repetitive ops so your team doesn't.
Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.
Join the waitlistThe hidden tax of parallel agents is a migration diamond
Six agents writing to one schema conflict in the database, not the code, and CI dies at "multiple heads."
EngineeringAn orchestrator that can't survive its own crash isn't one
A crash that erases the orchestrator's reasoning loses the one thing you can't rebuild.
EngineeringPut a deterministic gate in front of your smartest reviewer
The cheapest defect-catch is a dumb script that checks two merged branches still boot before any judgment.