Engineering

It passed every test locally. It was broken the second it deployed.

Local-green is a fast proxy for the truth, never the truth, the deployed surface is the only judge.

ASR

Apollo Space Research

Apollo Space

April 22, 2026 · 10 min read

The test suite was green. Every unit test, every integration test, the local end-to-end run, all of it passed on the engineer’s machine, twice, with the kind of clean output that makes you reach for the deploy button. So we deployed. And the very first real person who touched the live surface hit a wall that the laptop had never once produced.

Nothing was wrong with the code. Nothing was wrong with the tests either. What was wrong was the belief that one had proven the other.

Local-green is a fast proxy for the truth, never the truth, the deployed surface is the only judge.

This post is about that gap: why a machine full of passing tests can ship something broken, why the answer is not “write more tests,” and the discipline we built so that local-green stops pretending to be done.

The naive version: green on my machine means done

The obvious mental model is the one every test framework quietly teaches you. You write the code. You write the tests. The tests go green. Green means correct. Correct means shippable. The progress bar fills, the checkmarks line up, and the satisfying part of your brain files the task under finished.

It is a wonderful feeling and it is, surprisingly often, a lie.

Not because the tests are bad. The tests are doing exactly what they were asked to do, they are confirming that the code behaves the way the code’s author expected, inside the world the author’s machine constructed. That world is the problem. The laptop has its own database that someone seeded by hand, its own configuration with a few values quietly tweaked months ago, its own clock, its own network where nothing is ever far away and nothing is ever slow. The tests pass because the code and its home environment grew up together and agree with each other.

Green tells you the code agrees with its test. It does not tell you the code agrees with the world.

The deployed surface is a different world. The database has a row the seed script never created. A configuration value that was present locally is absent in the running environment. A call that returned in a millisecond on the loopback now crosses a real network and sometimes does not return at all. The user clicks in an order no test imagined, because the user did not read the test. Every one of those differences is a place where green-on-my-machine and works-in-production quietly disagree, and the disagreement is invisible until someone real walks into it.

This is the trap, and it is worth stating plainly so we can stop falling into it. The naive version isn’t “the engineer was careless.” The engineer ran the tests and the tests passed. The naive version is believing that a passing test on a local machine is the same kind of fact as a working flow on the deployed one. It is not. One is a proxy. The other is the truth.

Two worlds that look identical until they disagree: the local machine with its hand-seeded database, tweaked config, and instant loopback all reporting green, versus the deployed surface with its real data, missing config, and slow network where the actual user finally walks in and hits the wall.

Why “write more tests” doesn’t close the gap

The instinct, once you’ve been burned, is to write more tests. If the local suite missed the production failure, the suite must be incomplete, so fill the holes. Add a case for the missing config. Add a case for the slow call. Add a case for the row the seed script forgot.

This is real progress and it is also a treadmill you cannot finish running.

The reason is simple once you say it out loud. Every test you write encodes a failure you already imagined. You can only test for the production difference you thought of, and the entire category of bug we’re talking about is the difference you didn’t think of. The seed data you forgot is, by definition, the one you didn’t know to add. The config that was missing in production was missing precisely because nobody pictured needing it. You cannot write a test for the assumption you don’t know you’re making, and the local environment’s whole job, helpfully and treacherously, is to make a hundred assumptions for you so you don’t have to think about them.

Local-green is a fast proxy for the truth, never the truth, the deployed surface is the only judge. More tests make the proxy better. They do not make it the truth. A richer local suite still runs in the same convenient world, still inherits the same hidden agreements between the code and its home, and still cannot tell you what happens when the code is somewhere that does not agree with it.

So we stopped trying to make the proxy perfect. A perfect proxy is still a proxy. We did something else instead: we made the proxy fast and cheap and ran it constantly, and then we put a second, slower judge after it that runs on the only environment whose verdict actually counts.

The two-loop discipline: a fast proxy, then the real judge

Here is the shape we landed on, and the whole idea fits in one sentence: run the cheap check early and often, and never let it speak for the expensive one.

We call them the inner loop and the outer loop, and the distinction between them is the entire point.

The inner loop is local. It is everything that runs on a developer’s machine or in a fast harness in seconds, the unit tests, the integration tests, a local run of the actual behavior against a real-ish environment. Its job is speed. When you change something that could break the thing, you want to know in seconds, not after a deploy, because a failure you catch in seconds costs almost nothing and a failure you catch after shipping costs attention, a rollback, and the trust of whoever hit it. The inner loop’s verdict has a precise name: behaviorally green, pending the real test. Not done. Pending.

The outer loop is the deployed surface. It is the live environment, with the real data and the real network and the real configuration, exercised the way a real person would exercise it, a flow clicked through to the end on the thing that’s actually running. Its verdict is the only one that gets to say done. Not because the inner loop is untrustworthy, but because the inner loop, by construction, cannot see the world the outer loop lives in.

The discipline is to keep these two verdicts as two numbers and never collapse them into one. Inner-green is a count of how many fast checks pass. Outer-green is a count of how many real flows work on the deployed surface. The day you let the first number stand in for the second is the day a green machine ships a broken product, which is exactly the day this whole post is about.

The two-loop discipline as a sequence: a fast local inner loop reports behaviorally-green in seconds and feeds the deploy, then the slow outer loop walks a real flow on the deployed surface and returns the only verdict that says done, and a failure there flows straight back to mint a new inner check, so the proxy learns what it missed.

The inner loop earns its keep by being fast, not by being right

It would be easy to read this as “local tests don’t matter.” The opposite is true. The inner loop is the most-used tool we have, and it earns that by being relentlessly cheap.

Consider the alternative every team has lived through. A subtle bug in the behavior of a running system, the kind where the logic is almost right and goes wrong only on the second interaction, gets caught only after deploy, by watching the live surface, one slow cycle at a time. Each cycle is a push, a wait for the environment to come up, a click-through, a squint at the output. Minutes per attempt. Suppose a single elusive bug takes a dozen attempts to pin down; that is a chunk of an afternoon spent waiting on infrastructure to tell you something a local run could have told you before lunch.

So we push as much of the finding as possible into the inner loop, where a cycle is seconds, and reserve the outer loop for the judging, where the verdict is real. Find it cheap. Judge it true. The mistake is never that we ran the fast check. The mistake is ever letting the fast check sign the release.

This is also why the failure, when the outer loop catches one, is not a lament, it is an address. A flow that breaks on the deployed surface is a located gap: this environment, this step, this difference from local. And the most valuable thing we do with that address is turn it back into an inner-loop check, so the proxy that missed it once can never miss it again. The outer loop doesn’t just judge. It teaches the inner loop what the world looks like.

The turn: the only verdict that matters is a person, not a checkmark

I want to be honest about what’s underneath all of this, because it is not really about tests.

A checkmark is a promise that something will work. A working deployment is the promise kept. Every engineering culture has to decide which of those it celebrates, and the quiet tragedy of a lot of software is that it celebrates the promise. Green builds, passing pipelines, dashboards full of reassuring color, all of it is the feeling of done, manufactured by tools that have never once met your user. The feeling is real. The doneness is not. And the distance between them is measured in the exact number of people who clicked the live thing and got the wall instead of the result.

The reason we hold the line between the two loops so stubbornly is that someone is standing on the other side of the outer loop. Not a suite. A person, with a task they needed finished, who does not care how green the laptop was. They will judge the work in the only court that has ever mattered, the one where the software is actually running and they are actually using it, and they will be right, because that is the only place the truth has ever lived. A passing test is us telling ourselves the work is done. A working deployment is the world agreeing.

Local-green is a fast proxy for the truth, never the truth, the deployed surface is the only judge. We built our whole way of shipping around refusing to forget that, because the moment a green machine gets to declare victory is the moment the person on the other side starts paying for our confidence.

That’s what we’re building at Apollo Space: an operating system that treats the deployed surface as the judge and a passing test as the polite opinion it really is. If you have ever pushed something green and watched it break in front of the first real person who touched it, you already know which of those two verdicts you’d trust with your name.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist