We build Apollo with a council of agents
Several agents write code. Two refuse to let bad work merge. One keeps them all honest. The hardest reviewer on the team is not human.
Apollo Space Research
Apollo Space
A writer agent finishes a feature, runs the tests, sees green, and writes the words every reviewer dreads: looks done. In most setups, that sentence is the end of the story, the change merges. In ours, it’s the start of an argument. A second agent that never touched the code pulls the branch, ignores the green tests, and tries to break the thing on purpose. A third agent reads both of them and decides who’s right.
That argument is how every line of Apollo gets built.
Several agents write code. Two refuse to let bad work merge. One keeps them all honest. The hardest reviewer on the team is not human.
This post is about why we run it that way, and why the most useful idea we borrowed wasn’t from a research paper on agents. It was from the way a good open-source project says no.
The naive version: one agent, one prompt, one merge
The obvious way to build software with agents is the way the demos show it. You give one capable agent a task. It plans, it writes, it runs the tests, it reports back. If the tests pass, you ship.
It works astonishingly well for an afternoon. Then it doesn’t.
The failure isn’t that the agent writes bad code. The agents write good code. The failure is that the same mind that wrote the code is grading it, and a mind grading its own work is the least reliable judge there is. It wrote the test, so the test asserts what the code already does. It defined “done,” so “done” is whatever it happened to finish. It never imagined the input it didn’t handle, because if it had imagined it, it would have handled it. Green tests on a self-graded change tell you the author and the grader agreed. They don’t tell you the change is correct.
A human team solved this problem a long time ago, and not with better engineers. With a second pair of eyes that has no stake in the work being done. Code review exists precisely because the author is the worst person to certify their own code. The fix for self-grading agents is the same fix: take the grading away from the one who did the work.
So we did. We gave the grading to agents whose entire job is to disbelieve the claim.
The council: writers, gatekeepers, and one master
Picture a great open-source codebase. There are contributors, lots of them, sending pull requests all day. There are maintainers, far fewer, who read those pull requests with a cold eye. And somewhere there’s the one person whose name is on the project, who keeps the whole thing coherent and occasionally says not yet to a change that everyone else liked.
We rebuilt that structure out of agents.
The writers are the contributors. Several run at once, each in its own isolated copy of the repository, each on its own task. They plan, they implement, they write their tests, they make the case that their change is ready. Parallelism is the point: a company codebase has more work than any single agent can hold in its head at once, so we fan it out.
The gatekeepers are the maintainers, and there are deliberately few of them. One is an adversarial reviewer whose only instinct is to break the change, to find the input nobody tested, the path that was never exercised, the edge case the writer waved away. The other guards the standard: it refuses work that doesn’t carry real evidence. Not “the tests pass,” but which tests, against what, proving what behavior. A claim with no runnable proof behind it is, to this agent, the same as no claim at all.
The master is the maintainer-of-record. It reads the writer’s case and the gatekeepers’ objections, re-checks the claim against what the change is actually supposed to do, and makes the call. It is the agent that says not yet, you did not test this, and means it.
Several agents write code. Two refuse to let bad work merge. One keeps them all honest. The roles aren’t decoration. They’re the reason the output can be trusted without a human reading every line.
Why adversarial, not just “a second agent”
There’s a softer version of this idea that doesn’t work, and it’s worth naming, because it’s the one most people reach for first.
The soft version is: add a reviewer agent that double-checks the writer. Nice in theory. In practice, a reviewer with no adversarial mandate becomes a rubber stamp. Ask an agent “does this look correct?” and it is strongly inclined to agree, the change is plausible, the tests are green, the writer’s explanation is coherent, so sure, looks correct. You’ve added a second signature to the same blind spot. Two agents now agree the work is done. Neither one tried to prove it isn’t.
The fix is to change the question. Our gatekeepers are not asked “is this right?” They’re asked “how would you break this?” and “what would make this fail in front of a real user?” That reframe is the whole mechanism. An agent told to find the flaw goes looking for the flaw, and an agent looking for the flaw enumerates the hostile input, the missing test, the concurrency the writer ignored, the case where two things happen at once. The adversary isn’t being mean. It’s being the thing the author structurally cannot be: skeptical of work it has no pride in.
The other half of the standard is just as plain. A change doesn’t pass because it says it’s complete.
A claim is not a result.
“It works” is a feeling; “this exact flow ran end to end and here is the trace” is a result. The standards gatekeeper exists to keep that distinction sharp, to downgrade looks done back to not done every single time the evidence is a feeling wearing a checkmark.
The master agent is the one that says no
Here’s the part that surprises people: the most valuable agent on the team is the one that produces the least code.
The master writes almost nothing. Its whole contribution is judgment, the cold re-read, the cross-check, the refusal. When a writer reports success, the master treats that report as a claim to interrogate, not a fact to forward. It walks the change against what it was actually supposed to do, in order, and asks at each step: was this part proven, or just asserted? One missing proof and the change goes back, no matter how good the rest looks.
Anyone who has maintained a serious open-source project recognizes this person immediately. They’re the maintainer who reviews your beautiful pull request and replies with two words and a link: not yet. It is, in the moment, maddening. It is also the single reason the codebase didn’t rot. A project is only as healthy as the standard its maintainer is willing to defend when defending it is inconvenient.
That’s the role we couldn’t skip. The whole council collapses without it, because parallel writers and adversarial reviewers still need someone to resolve the argument, and that someone has to be unmovable about one thing.
The work is done when it’s done, not when it looks done.
The hardest reviewer on the team is not human, and that’s exactly why it never gets tired of saying no.
What this buys, and what it costs
The honest accounting matters, because this isn’t free.
It costs more. Running several writers and several gatekeepers on one change burns more compute than a single agent that ships on green. We pay it on purpose. The expensive thing in software was never the writing, it was the bug that reached a user, the regression that hid for a week, the looks done that wasn’t. Catching that before it merges is cheap compared to catching it in production, and an adversary that runs in seconds is the cheapest place to catch it.
What it buys is a property that’s hard to get any other way: the output is trustworthy without a human grading every line. Not because the writers are infallible, they aren’t, but because nothing they produce reaches the main branch without surviving an agent built to tear it apart and an agent built to refuse it. The trust lives in the structure, not in any one agent’s reliability. That’s the same reason a good open-source project can accept code from strangers: not because it trusts the strangers, but because it trusts the review.
The turn: a standard is a thing you build, not a thing you have
Strip away the agents and what’s left is an old truth about building anything well.
Quality is not a property of the people doing the work. It’s a property of the process that decides what’s allowed to count as done. The best engineering teams have always known this, the standard isn’t in anyone’s head, it’s in the review that every change has to pass, and it holds even on the day everyone’s tired and the deadline’s tonight. We didn’t invent that. We just noticed it transfers. An agent can write code, an agent can attack code, and an agent can hold the line on what merges, and when you arrange them the way a great codebase arranges its people, you get a team that’s hard on the work and easy on the speed.
The writers will keep getting faster. The model under them will keep getting smarter. None of that is the thing that makes the result good. The thing that makes it good is the agent that reads a brilliant change and still says not yet, you did not test this, and the moment we stop letting it say that is the moment the whole thing starts to rot.
We build Apollo this way because we’re building it for this, an AI-native operating system where agents don’t just do the work, they hold each other to a standard. If you’ve ever shipped a looks done that wasn’t, you already know why the hardest reviewer on the team should never be the one who wrote the code.
Apollo runs your company's repetitive ops so your team doesn't.
Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.
Join the waitlistThe hidden tax of parallel agents is a migration diamond
Six agents writing to one schema conflict in the database, not the code, and CI dies at "multiple heads."
EngineeringAn orchestrator that can't survive its own crash isn't one
A crash that erases the orchestrator's reasoning loses the one thing you can't rebuild.
EngineeringPut a deterministic gate in front of your smartest reviewer
The cheapest defect-catch is a dumb script that checks two merged branches still boot before any judgment.