Engineering

The hardest reviewer on the team isn't human

The best contributor to a codebase is the one who says "not yet, you didn't test this."

ASR

Apollo Space Research

Apollo Space

March 14, 2026 · 11 min read

The most productive thing that happened in our codebase last week was a rejection. An agent had written a clean change, the tests were green, the diff was small and readable, and another agent read all of it and sent it back with one line: not yet, you didn’t test this. No new code. No clever insight. Just a refusal, delivered the instant the work looked finished.

That refusal is worth more than the code it blocked. And the agent that wrote it produces almost nothing all day.

Here is the idea this post is about, stated plainly so you can carry it out the door: the best contributor to a codebase is the one who says “not yet, you didn’t test this.”

Most teams treat review as the tax you pay before shipping. We treat it as the only place quality is actually decided. This is the story of why we gave that job to a machine, why the machine is harder to satisfy than any human we could hire, and what that bought us that “smarter writers” never could.

The naive version: ship on green

The obvious way to build software with agents is the way the demos show it. One capable agent takes a task, plans it, writes the code, runs the test suite, sees green, and reports back: done. If the tests pass, you ship. It feels like the future, and for an afternoon it is.

Then you hit the thing nobody puts in the demo.

The failure isn’t that the agent writes bad code, the code is usually fine. The failure is subtler and worse: the agent that wrote the code is the same one grading it. It wrote the test, so the test asserts exactly what the code already does. It decided what “done” meant, so “done” became whatever it happened to finish. It never tested the input it didn’t imagine, because if it had imagined that input, it would have handled it. Green tests on a self-graded change don’t prove the change is correct. They prove the author agreed with themselves.

Every engineering team learned this lesson the hard way and built the same fix. Not smarter engineers, a second pair of eyes with no stake in the work. Code review exists because the author is the single worst person to certify their own code. The pride that made the work good is the exact pride that makes the grade unreliable.

So the naive fix everyone reaches for next is to add a reviewer agent. Sensible. It also doesn’t work, and it’s worth seeing why before the real version makes sense.

Why a polite second agent is worthless

Add an agent that double-checks the writer. Ask it, “does this look correct?” Watch what happens.

It agrees. Almost every time. The change is plausible, the tests are green, the writer’s explanation is coherent and confident, so the reviewer, asked a yes-or-no question with a clearly preferred answer, says yes, looks correct. You have not added a second opinion. You’ve added a second signature on the same blind spot. Two agents now believe the work is done. Neither of them tried to prove it isn’t.

This is the trap that makes most “AI reviewer” features theater. A reviewer with no adversarial mandate is a rubber stamp with extra latency.

The fix is not a better reviewer. The fix is a different question.

We don’t ask our reviewer “is this right?” We ask it “how would this fail in front of a real user, and where is the flow that proves it doesn’t?” That reframe is the entire mechanism. An agent told to hunt for the flaw goes and finds the flaw, the unhandled input, the path no test exercised, the two things that happen at once. And an agent told to demand proof stops accepting explanations and starts demanding a flow that actually ran. The reviewer isn’t being difficult. It’s structurally capable of the one thing the author can’t be: skeptical of work it has no pride in.

On the left, an agent writes code, runs its own test, sees green, and merges its own work, so the bug ships; on the right, a reviewer that never wrote the code asks which real flow proved the change, runs it, attaches the trace, and merges only on a result.

The best contributor to a codebase is the one who says “not yet, you didn’t test this.” Not because it’s smarter than the writer. Because it’s the only one in the room not trying to be done.

A claim is not a result

Here is the line our reviewer is built to hold, and it’s the whole standard compressed to one sentence: “it works” is a feeling, and a feeling is not a result.

“It works” is something you believe. “This exact flow ran end to end, and here is the trace” is something you can check.

Watch how often the two get confused. A change arrives wearing every signal of doneness, green checks, a tidy diff, a confident summary that says the feature is complete. None of that is evidence the feature does what a user needs. It’s evidence the author thinks it does. The gap between those two is exactly where shipped bugs live, and it’s invisible to anyone who reads the summary instead of running the flow.

So our reviewer refuses the summary and demands the run. Not “the tests pass,” but which test, against what input, proving what behavior an actual person would care about. A claim with no runnable flow behind it is treated as no claim at all, the same as silence. This sounds harsh. It is the cheapest harshness money can buy.

Think about where a bug actually costs you. Caught by a reviewer that runs in seconds, it’s a sentence of feedback. Caught in production, it’s an incident, a rollback, a user who learned not to trust the thing, and an afternoon nobody got back. The reviewer that insists on a real run before merge isn’t slowing you down. It’s moving the catch from the most expensive place to the cheapest one. That trade only looks like overhead if you’ve never paid the production price, and everyone reading this has paid it.

The discipline has a name we use internally and it’s not glamorous: a missing test isn’t a missing feature, it’s a missing proof. The behavior might be perfect. You just haven’t shown it, and “I’m sure it’s fine” has shipped more outages than malice ever did.

The reviewer that runs the flow, not the test

The naive reviewer reads the diff. Ours runs the software.

Reading a diff tells you the code is plausible, that it compiles in your head, that the logic looks sound, that the author was careful. It cannot tell you the thing you actually need to know, which is what happens when the real system executes this change against a real input on a real surface. Those are different questions, and only one of them is the truth.

So the strongest reviewer doesn’t grade the description of the work. It reproduces the work. It takes the change, runs the genuine flow a user would trigger, the multi-step path, the one where step two has to remember what step one did, and watches what the system actually does. Pass means the flow ran and behaved. Not “the author says it behaves.” It behaved, and there’s a trace to prove it.

A claim of "it works" is interrogated, not trusted: the reviewer demands the real flow be run on a real surface, the run produces a trace instead of an opinion, and the verdict is either a pass or "not yet, you didn't test this," fed straight back to the writer.

This is the difference between a reviewer that can be talked into a yes and a reviewer that cannot. You can write a beautiful explanation that convinces a diff-reader. You cannot talk a flow into running green. The software either did the thing or it didn’t, and the reviewer that insists on watching it do the thing is immune to a persuasive author in a way no human under deadline pressure reliably is.

That immunity is the point. Humans get tired. Humans want to be kind. Humans have a deadline tonight and a writer who’s clearly worked hard and a quiet voice that says it’s probably fine, just merge it. The machine reviewer has no deadline, no fatigue, and no wish to be liked. At 2am on the night before a demo, when every human instinct is to wave the change through, it asks the same flat question it asked at noon: where is the flow that proves this? The best contributor to a codebase is the one who says “not yet, you didn’t test this,” and it never once gets tired of saying it.

What it costs, honestly

This isn’t free, and pretending it is would be its own kind of dishonesty.

It costs compute. Running a reviewer that reproduces the real flow, every time, on every change, burns more than a single agent that ships on green. We spend it deliberately. The expensive thing in software was never the typing, it was the regression that hid for a week, the looks done that wasn’t, the trust a user spends once and doesn’t refund. An adversary that runs in seconds is the cheapest place on earth to catch that, and we’d rather pay it there than in an incident channel.

What it buys is a property that’s genuinely hard to get another way: the output is trustworthy without a human grading every line. Not because the writers are flawless, they aren’t, and they don’t need to be. The trust doesn’t live in any one agent’s reliability. It lives in the structure, in the fact that nothing reaches the main branch without surviving an agent built to break it and an agent that refuses to accept a feeling as a result. That’s the same reason a serious open-source project can take code from a stranger: it doesn’t trust the stranger, it trusts the review.

And the writers keep getting better, which changes nothing about why this matters. A smarter writer produces more convincing claims. It does not produce more proven ones. The smarter the author, the more you need a reviewer that grades the run instead of the rhetoric.

The turn: a standard is something you build

Strip the agents away and what’s left is the oldest true thing about building anything well.

Quality was never a property of the people doing the work. It’s a property of the process that decides what’s allowed to count as done, and the best teams have always known the standard doesn’t live in anyone’s head, it lives in the review every change has to clear, the one that holds even on the night everyone’s exhausted and the deadline is now. That standard is the thing you can’t install from a package manager. You can buy faster writers and smarter models all day; neither one will, at 2am, look at brilliant work and say not yet.

We didn’t invent that discipline. We noticed it transfers. An agent can write code, an agent can attack code, and an agent can hold the line on what gets to merge, and when you give the holding job to a reviewer that never tires and never wants to be liked, the line holds on the worst night, which is the only night it ever mattered.

The thing that makes a codebase good was never the cleverest contribution. It was the rejection that kept the cleverest contribution honest. The day we let the reviewer get tired is the day the rot starts, and the whole reason to make that reviewer a machine is that a machine is never tired.

That’s what we’re building at Apollo Space: an operating system where agents don’t just do the work, they hold each other to a standard no one is too tired to defend. If you’ve ever merged a looks done that wasn’t and paid for it on a Friday afternoon, you already know why the hardest reviewer on the team should never be the one who wrote the code.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist