Engineering

Three skeptics beat one judge

A single reviewer asked 'is this right?' nods along to the plausible-but-wrong. Three independent reviewers each told to prove it wrong, kill on a majority, catch what one judge waves through.

ASR

Apollo Space Research

Apollo Space

February 2, 2026 · 10 min read

Ask one careful reviewer a single question, “does this finding look right?”, and watch what happens to the answer that is plausible and wrong. The reasoning is fluent. The numbers add up to something. The conclusion is the kind of thing that is usually true. So the reviewer nods. Not because it checked, but because nothing about the claim raised a flag, and “looks right” is the path of least resistance when a claim is built to look right.

That nod is where most bad findings get through. Not the obvious wrong ones, those get caught. The dangerous ones are the wrong answers wearing the costume of a right one.

Three skeptics beat one judge. A single reviewer asked to confirm a finding will confirm the confident lie; three independent reviewers each told to refute it, killing the finding on a majority vote, will not. This post is about why we stopped asking agents “is this right?” and started asking them to prove each other wrong.

The naive version: one grader, one thumbs-up

The obvious way to check an agent’s work is to add a checker. The agent produces a finding, a fact pulled from a document, a conclusion drawn from data, a claim that something is true, and a second agent reads it and grades it. Pass or fail. One in, one out. It feels like rigor. You added a step. There is now a reviewer between the work and the world.

It works on the answers that don’t need checking. A grossly wrong claim gets caught because it’s grossly wrong. The reviewer reads “the contract renews in 2019” against a document dated this year and rejects it. Good. That’s the easy half.

The hard half is the claim that is wrong in a way that reads as right. Suppose an agent researches a market and concludes a competitor “shut down its enterprise tier last quarter.” The phrasing is crisp. The reasoning cites a real announcement. The conclusion is the kind of thing that happens. The only problem is that the announcement was about a pricing change, not a shutdown, and the agent compressed the distinction away. Now ask a single grader: “is this finding correct?” The grader is in exactly the position you are when a confident colleague states something firmly in a meeting, the claim is coherent, the source is real, pushing back is work, and agreement is free. So it agrees.

The bottleneck isn’t that the grader is dumb. It’s the question. “Is this right?” is a question whose easiest honest-sounding answer is yes. You have built a reviewer whose default is to approve, and then handed it the one class of error specifically engineered to survive approval.

So a single confirming grader gives you a number that feels like safety and isn’t. It catches the errors you’d have caught anyway and waves through the ones you wouldn’t.

Why “is this right?” is the wrong question

There’s a softer failure hiding inside the obvious one, and it’s worth naming because it’s the trap everyone walks into second.

When the single grader doesn’t work, the instinct is to make it smarter. A better model. A longer prompt. More context. And it does help, a little, a sharper grader catches a few more of the costumed wrongs. But it’s still answering “is this right?”, and that question has a gravity that no amount of intelligence escapes. A reviewer asked to confirm is looking for reasons to confirm. It reads the claim charitably, because charity is what “confirm” rewards. It fills small gaps with assumptions that favor the answer, because the answer is the thing it was pointed at. Smart or not, it’s searching the half of the space where the finding survives.

The bottleneck never disappears. It just moves to a more expensive grader that’s still nodding.

The fix isn’t a better confirmer. It’s to invert the question. Don’t ask “is this right?” Ask “how is this wrong?” An agent told to confirm goes looking for confirmation and finds it. An agent told to refute goes looking for the flaw, and an agent looking for the flaw reads the source again, notices “pricing change” where the finding said “shutdown,” checks the date the claim quietly skipped, and tests the one input that breaks it. Same model. Opposite gravity. The refuter is searching the half of the space where the finding dies, which is exactly the half a confirmer never visits.

This is the same move a good newsroom makes when it tells a second reporter to kill the story rather than check the story. “Check it” invites a skim. “Kill it” invites an investigation. The wording isn’t a mood. It’s the search direction.

A confirming reviewer asked is-this-right searches only the region where the finding survives and returns a thumbs-up, while a refuting reviewer asked how-is-this-wrong searches the region where the finding breaks and surfaces the flaw the confirmer never looked for.

One refuter is better, and still not enough

Flip the question and you’ve fixed the gravity. You haven’t yet fixed the dice.

A single refuter is one agent on one run. It has a blind spot, every model does, and a finding that happens to sit inside that blind spot survives, not because it’s true but because this particular skeptic, on this particular pass, didn’t see the seam. You’ve traded a reviewer that’s biased toward yes for a reviewer that’s correct most of the time and silently, occasionally, wrong. Better odds. Still a coin, just a fairer one.

And there’s a quieter failure: a lone refuter can be too eager. Told to find a flaw, it sometimes manufactures one, a nitpick dressed as a defect, a “this could be wrong if you assume X” where nobody assumed X. One skeptic gives you no way to tell a real kill from an overreach. Its objection is the only vote, so its objection wins.

The fix for a noisy judge is the fix juries have used for centuries: don’t ask one. Ask several, independently, and let them vote.

Independence is the load-bearing word. Three refuters that share the same context, the same prior steps, the same framing, will share the same blind spot, and three agents with one blind spot are just one agent that costs three times as much. So each skeptic gets the finding cold. No access to the others’ verdicts. No shared scratchpad where the first one’s doubt anchors the next. Different framings of the same refute-it task, so they probe from different angles. Each one asked, alone: here is a claim, and your job is to prove it false.

Then you count. If a clear majority of independent skeptics each found an honest reason the finding is wrong, the finding dies. If they each tried to break it and mostly couldn’t, it lives, not because it “looked right,” but because it survived a panel built to take it down. A single overeager refuter’s manufactured nitpick gets outvoted by the two who found nothing real. A genuine defect that one skeptic missed gets caught by the two who saw it. The vote launders out both the blind spot and the overreach, because a flaw real enough to matter is a flaw most independent skeptics find, and a “flaw” only one of them sees is usually not one.

That’s the whole mechanism. Refute-by-default kills the confirmation bias. Independence kills the shared blind spot. The majority vote kills the noise. Each layer covers the failure the layer before it leaves open.

A finding enters and fans out to three independent skeptics, each given the claim cold and told to refute it; their verdicts converge on a majority vote that kills the finding only when most skeptics independently found it false, otherwise the finding passes.

Why majority, and not unanimity

It’s tempting to set the bar at unanimity, let any skeptic veto, kill the finding if even one objects. It sounds safer. It’s the opposite.

Unanimity hands the whole decision to your most error-prone juror. The single skeptic that manufactures a nitpick now has a veto, so every overreach becomes a kill, and good findings die at the rate of your noisiest reviewer’s bad days. Push the bar all the way up and you don’t get rigor, you get a panel that rejects almost everything, which is the same as having no panel, because a verifier that fails everything carries no information. A reviewer that always says no is exactly as useless as one that always says yes.

Majority is the setting that uses the independence you paid for. It treats each skeptic’s objection as evidence, not as a trigger, and asks whether the evidence converges. One dissent in three is noise, the system notes it and moves on. Two in three is a signal, the system stops. The threshold is a dial you turn toward the cost of the mistake: where a wrong finding is merely embarrassing, a simple majority is plenty; where a wrong finding moves real money or touches a customer, you raise the bar and add skeptics, because the price of a false pass is high enough to pay for more dice.

The point is that the threshold is a choice you make on purpose, with the stakes in front of you, not an accident of how agreeable a single grader happened to be that morning.

The turn: the standard is in the structure, not the genius

Strip the agents away and what’s left is an old truth about getting to the truth.

You don’t reach a reliable verdict by finding one infallible judge. There is no such judge, not a person, not a model, not the next one either. You reach it by building a process whose structure is hard to fool: by pointing reviewers at the flaw instead of the finding, by keeping them independent so they don’t share a blind spot, by counting their honest objections instead of trusting any single voice. Courts learned this. Science learned it, a result isn’t true because one brilliant lab said so, it’s trusted because independent skeptics tried to knock it down and mostly couldn’t. The reliability was never in the genius of the reviewer. It was in the adversarial, independent, counted shape of the review.

That’s the part that matters for your company, whether or not an agent is anywhere near it. The findings your business runs on, what a customer wants, why a number moved, whether a thing is safe to ship, are mostly checked today by one tired person asking one easy question: does this look right? And the answers that look right are precisely the ones that don’t get a second look. The cost isn’t the obvious mistake. It’s the confident wrong one that nobody was structurally able to catch, because the only check in its path was built to nod.

Three skeptics beat one judge. Not because three minds are smarter than one, but because three minds told to disagree, kept apart, and counted are harder to fool than any single mind told to agree.

That’s one of the loops we’re building at Apollo Space, verification that earns its trust from the shape of the argument, not from any single reviewer’s confidence. If you’ve ever signed off on something that looked right and bit you later, you already know why the question should never be “is this right?”

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist