Engineering

Report two numbers or you're reporting none

A fast local eval is a proxy and the deployed surface is the truth, collapse them and you ship the proxy.

ASR

Apollo Space Research

Apollo Space

· 10 min read

A local test suite goes green in twenty-five seconds. An engineer reads the green, writes “works” in the pull request, and the change merges. Three days later a customer clicks the exact button the green test was supposed to cover, and it does the wrong thing. The test wasn’t lying. It was answering a different question than the one anybody cared about.

The green meant the proxy passed. Nobody ever checked the truth.

A fast local eval is a proxy and the deployed surface is the truth, collapse them and you ship the proxy.

This post is about a single discipline we wired into how we build agents at Apollo: never let one number stand in for both. The fast number tells you the change is plausible. The slow number tells you it’s real. Report one of them as “done” and you’ve quietly reported neither.

The naive version: one number, and it’s the fast one

The obvious way to know whether an agent change works is to run the cheap check and trust it. You have a local harness, it spins up the runtime in-process, replays a scenario, prints a pass. It runs in seconds. It costs pennies. You can run it on every commit, fan it out across a hundred scenarios, A/B two models before lunch. It is, genuinely, one of the best tools on the bench.

So it’s tempting to let it be the whole answer. Green here, ship there.

It works beautifully right up until it doesn’t, and the failure is specific. The local harness runs the agent in a clean room. The model is loaded the way the test loaded it. The memory is whatever the fixture seeded. The tools return whatever the harness wired them to return. Every variable that isn’t the change is pinned to a friendly value, which is exactly what makes the harness fast, and exactly what makes its green untrustworthy as a final answer.

A clean-room pass tells you the change is plausible. It does not tell you it survives contact with the world.

The deployed surface has none of those pins. Real auth. Real org boundaries. The memory the user actually accumulated over four turns of conversation. The tool that times out under load instead of returning a tidy fixture. A pass in the clean room and a pass on the deployed surface are not the same claim, and the moment you treat the first as if it were the second, you’ve shipped the proxy and called it the product.

This is the trap, stated plainly. The fast eval is so good that you stop asking what it can’t see. And what it can’t see is the only thing the customer touches.

Two numbers, and why neither one is optional

The fix is almost embarrassingly simple to say: every behavioral claim carries two numbers, not one.

The first is the inner number, the local eval. It runs the real runtime in-process, replays the scenario, and reports a pass rate fast and cheap enough to run on every change. The second is the outer number, the same behavior exercised on the deployed surface a customer would actually use, with real auth, real memory, real tools, real latency. Inner is the fast proxy. Outer is the truth. You report both, always, side by side, and you never let one of them disappear into the word “done.”

Two numbers run as one pipeline: a fast inner eval replays the scenario in a clean room as a cheap proxy, and a slower outer check exercises the same behavior on the deployed surface as the truth; reporting only one number ships the proxy.

The reason you can’t drop the inner number is speed. If the only signal you trust is the deployed one, every check costs a deploy, and a discipline that’s expensive on every change is a discipline that gets skipped on the change that matters. The inner loop is what lets you catch the obvious break in seconds instead of waiting on a pipeline to tell you what a local run would have screamed.

The reason you can’t drop the outer number is honesty. The inner loop, by construction, can’t see what it pinned. It cannot fail on the auth boundary it stubbed, the memory it didn’t accumulate, the tool it faked. So an inner pass is a real, useful, load-bearing signal, and it is also structurally blind to a whole class of failure. The outer number is the only one earned in the same conditions the user lives in.

Here’s the rule we actually follow: inner-green is “behaviorally green, pending the deployed check,” never “done.” The fast local eval is a proxy and the deployed surface is the truth, two numbers, reported as two numbers. The instant they collapse into one, you’ve lost the only information that told you which kind of green you were looking at.

Why collapsing them feels safe and isn’t

There’s a softer version of this mistake that sounds responsible, and it’s the one most teams reach for, so it’s worth naming.

The soft version is: run the fast eval, and trust it, because it runs the real runtime, not mocks, the actual agent loop. That feels rigorous. It is more rigorous than a unit test that asserts a stub returned a stub. So the reasoning goes: this isn’t a toy check, it exercises the real code, therefore green here is green everywhere.

But “runs the real runtime” and “runs in the real conditions” are different claims, and the gap between them is precisely where things break. A flight simulator runs the real avionics software. It does not run the real weather, the real ice on the wing, the real bird. You would not certify a pilot on simulator hours alone and call them road-tested, not because the simulator is fake, but because it pins the variables that bite hardest in the air. The local eval is the simulator. It’s not fake. It’s just not the sky.

So the danger isn’t a bad eval. It’s a good eval trusted past its evidence. The better your inner loop gets, the more convincing its green becomes, and the stronger the pull to let it be the only number. A weak proxy gets double-checked out of suspicion. A strong proxy gets believed, and a believed proxy is how the clean-room pass becomes the production surprise.

The discipline isn’t “distrust the fast number.” The fast number is excellent. The discipline is refuse to let it answer a question it structurally can’t see. It can tell you the change is plausible. It cannot tell you the customer’s button does the right thing. Those are two questions, and you owe two answers.

What the two numbers actually tell you

Once you commit to reporting both, the two numbers stop being redundant and start being a diagnosis. The interesting information is in how they disagree.

A two-by-two of inner and outer results: both green means ship, both red means a located bug, inner-green with outer-red exposes the condition the clean room pinned away, and inner-red with outer-green means the eval itself is wrong and must be fixed.

When both are green, you have something close to earned confidence, the change is plausible and it survives the deployed surface. That’s the only state that counts as done, and it’s worth noticing how rarely a single number could ever have told you that.

When both are red, you have a clean diagnosis: a real, reproducible gap, located. That’s the easy case. The proxy and the truth agree, and you go fix the thing they agree is broken.

The two interesting states are the splits.

Inner-green, outer-red is the whole reason the discipline exists. The clean room passed and the world failed, which means the failure lives in exactly the thing the inner loop pinned, the auth boundary, the accumulated memory, the tool under real load. The disagreement doesn’t just tell you something broke. It tells you where: in the gap between the simulator and the sky. That’s a found bug with an address, not a vague feeling that “the deployed version seems off.”

Inner-red, outer-green is the one people forget, and it’s a finding too. If the deployed surface does the right thing while the local eval insists it’s broken, the eval is wrong, a stale fixture, a scenario that drifted from real behavior, a pin that no longer matches reality. A red inner number isn’t always a code bug. Sometimes it’s a bug in the proxy itself, and the only way you’d ever catch it is by having a second number to disagree with. A team running one number would “fix” working code to satisfy a broken test.

One number can be green or red. Two numbers can tell you which kind, and which kind is the entire question.

The turn: a number is a promise about a question

What is a person actually doing when they paste a green checkmark into a pull request and write “works”?

They’re making a promise, to the reviewer, to the next engineer, to the customer downstream who will never read the test. The promise is: I checked the thing you care about, and it’s good. And the quiet tragedy of the single fast number is that it lets a sincere, careful engineer make that promise about a question nobody asked. The clean room passed. The button was never touched. The promise was kept to the letter and broken in spirit, and the only person who finds out is the customer.

Two numbers is, underneath the tooling, just intellectual honesty made into a habit. It’s the engineer who, asked “does it work?”, refuses to answer with the cheap green alone, who says, “it passes locally, and here’s what the deployed surface did,” because they know those are two different things and they respect you too much to blur them. That refusal isn’t something you can install. It’s a posture: a working distrust of your own good news, strong enough to survive a tight deadline and a green screen at 6pm.

The fast number will keep getting faster. The proxies will keep getting better, closer to the truth, more convincing in their green. None of that retires the second number. The better the simulator, the more disciplined you have to be about still flying the plane, because the failures that reach a customer are, by definition, the ones the simulator couldn’t show you.

A fast local eval is a proxy and the deployed surface is the truth. Collapse them, and the green you ship is a promise about a question nobody asked.


That’s the discipline we build into Apollo Space, agents whose every “it works” carries the fast proof and the real one, because the surface a customer touches is the only one that gets a vote. If you’ve ever shipped a green that turned red the moment a real person clicked it, you already know which of the two numbers you forgot to read.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist