Engineering

Never let a model grade its own homework

A judge model scores its own family higher, so a same-family eval measures loyalty, not quality.

ASR

Apollo Space Research

Apollo Space

· 10 min read

Run the same answer past two judge models and you can get two different verdicts, not because the answer changed, but because one judge was scoring a sibling. Swap the judge to a rival model family and the score that looked like quality quietly turns into something else. The answer never moved. Only the loyalty of the grader did.

That gap is the most expensive blind spot in agent evaluation, and almost nobody checks for it.

A judge model scores its own family higher, so a same-family eval measures loyalty, not quality.

This post is about why that happens, why a green dashboard can be measuring exactly the wrong thing, and the small discipline that turns an eval back into a measurement instead of a vote.

The naive version: one model writes, the same model grades

The standard way to evaluate an AI system at scale is to use another AI to grade it. You can’t put a human on every output once you’re running thousands of conversations a day, so you reach for the obvious tool: a strong model, handed a rubric, asked to score each answer one to ten. This is the LLM-as-judge pattern, and it is genuinely useful. It is also where the trap is set.

Here is how the trap springs, and it springs quietly. The team building the agent is usually standardized on one model family, it’s what the runtime uses, it’s what the API key is for, it’s what everyone already trusts. So the writer is, say, a model from family A. And when it comes time to grade, the nearest capable judge is also from family A, because that’s the key on the shelf. The writer and the grader are cousins.

It works beautifully right up until it doesn’t, and the failure is invisible from inside.

The scores come back high. The dashboard goes green. Nothing in the numbers tells you that the judge has a thumb on the scale, that it rates answers written in its own family’s style, its own family’s hedging, its own family’s idea of a good explanation, a notch higher than it should. The eval isn’t broken in any way you can see. It’s just no longer measuring what you think it’s measuring.

A green eval where the writer and the judge are cousins is a measurement of agreement, not a measurement of quality.

This isn’t a hypothetical we invented to make a point. The effect has a name in the research literature, self-preference or self-enhancement bias, and it has been measured. A widely cited 2024 study from researchers at NYU and Anthropic (“LLM Evaluators Recognize and Favor Their Own Generations,” Panickssery et al.) found that judge models give measurably higher scores to text their own family produced, even when a neutral panel of humans rated that text no better than the alternatives. The judge isn’t lying. It genuinely prefers its own voice. And that preference is exactly the thing a quality score is supposed to be free of.

Two evaluation setups side by side: in the naive lane a writer and judge from the same model family produce a green score that secretly measures family loyalty, while in the honest lane a cross-family judge produces a lower but trustworthy score that measures quality.

Why a second model isn’t automatically a second opinion

The first instinct, once you’ve felt this pain, is reassuring and wrong. Fine, we’ll add a judge. You bring in a grading model, point it at the outputs, and feel safer because now a different process is doing the checking.

But a different process is not the same as a different perspective. If the judge shares the writer’s lineage, you haven’t added a second opinion. You’ve added a second signature on the same blind spot. The judge nods at the writer’s choices because they are, at a deep level, its own choices, the same tokenizer instincts, the same training data fingerprints, the same sense of what a confident, well-formed answer sounds like. Two cousins agreeing that the work is good is not evidence the work is good. It’s evidence they’re related.

The naive fix relocates the problem instead of solving it. You feel evaluated. You are not actually evaluated.

The mechanism is worth saying plainly, because once you see it you can’t unsee it. A judge model is not a neutral instrument like a ruler. It is a participant with a style, and it scores conformity to that style as if conformity were correctness. When the thing being graded conforms, because it was written by a relative, the score goes up for a reason that has nothing to do with whether a real user was served. The number is real. What it counts is wrong.

So the real fix is not “add a judge.” It’s “make the judge a stranger.” Take the grading away from the family that did the writing, exactly the way you’d never let a student grade their own exam, and never let their lab partner grade it either.

Our way: the judge is a stranger, and we cross-check the strangers

The rule we hold is short. The model that grades an answer must not be from the same family as the model that wrote it. Different lineage, different training, different instincts, a grader with no relative in the room.

That single constraint changes the meaning of every score that comes out. When a cross-family judge rates an answer highly, the rating can’t be loyalty, because the judge has no loyalty to spend. It isn’t recognizing its own voice; it’s reading the answer cold, the way a real user from outside the system would. The number gets lower and more honest at the same time. We’d rather have a true seven than a flattering nine, because the nine was never ours to keep.

But one stranger has its own taste, so we don’t stop at one. A single cross-family judge trades the writer’s bias for the judge’s bias, it might systematically dislike a style that’s actually fine, or over-reward a verbosity a real user would find tiring. The fix for that is the same fix one more time: more than one independent grader, from different families, and you look at where they agree. Consensus across unrelated judges is a far stronger signal than enthusiasm from one. When three strangers who share no lineage all mark the same answer down, that answer has a real problem, not a stylistic one.

And the strangers grade against the rubric, not against vibes. “Does this look good?” invites the judge to fall back on style, which is precisely the channel the bias travels through. So the question we hand the judge is specific: did this answer do the thing the user actually needed, against this checklist, with this evidence? A judge pinned to a concrete rubric has far less room to reward conformity, because the rubric, not the judge’s taste, is doing the measuring.

A grading pipeline where one answer fans out to three independent judges from different model families, none of them the writer's family, and only their consensus against a fixed rubric becomes the trusted score.

There’s one more discipline that sits underneath all of it, and it’s the oldest one in evaluation: the judge never edits the work it grades. A grader that can rewrite the answer to match its own taste has stopped measuring and started authoring, and now the bias isn’t just in the score, it’s baked into the artifact. Grading is read-only. The judge reports a verdict and evidence; it does not touch the thing under test. Measurement and authorship are kept in separate hands on purpose, because the moment they merge, the eval becomes a mirror.

What this costs, and why we pay it

None of this is free, and pretending otherwise would be its own kind of dishonesty.

It costs more calls. Grading one answer with three independent judges instead of one cousin is more compute per evaluation, and across a large corpus of test flows that adds up. It also costs us comfort: cross-family scores come back lower than same-family scores, every time, and a lower number is a harder thing to show a stakeholder than a green one. We pay both costs on purpose, because the alternative is paying a much larger one later.

The expensive failure in evaluation was never the extra grading call. It was the agent that shipped on a flattering score, met a real user, and fell apart in a way the eval had been structurally unable to see. A same-family eval is the cheapest possible way to feel safe and the most expensive possible way to actually be wrong, because it fails silently and it fails confidently. A cross-family panel that runs in seconds and tells you an uncomfortable truth is the cheapest place to catch that, long before the user does.

What we buy is a score we can act on. Not a vote of confidence from a relative, but a measurement from a room full of strangers who have no reason to flatter the work. When that score moves, it moved because the quality moved.

The turn: an eval is a culture, not a config

Forget the model families for a second: what’s underneath is an old idea about honesty that has nothing to do with AI.

The reason we don’t let students grade their own exams isn’t that students are dishonest. It’s that no one is a neutral judge of their own work, and the closer the grader sits to the work, the worse their judgment of it gets, not from malice, from proximity. Every team that has ever built anything worth trusting eventually learns to put distance between the maker and the verdict. A second pair of eyes that has no stake. A reviewer who didn’t write the line. An examiner who never met the student. We didn’t discover this. We just noticed it transfers, cleanly, to a world where the makers and the graders are both models.

The hard part was never wiring up a judge. The hard part is the willingness to take the lower, truer number, to look at a green dashboard and ask the uncomfortable question of whether the judge was a stranger or a cousin, and to keep asking it on the day you most want the score to be real. A model will always prefer its own voice. The discipline is to never let it grade in a room full of relatives, and to mean it even when the honest number is the one you didn’t want.


That’s what we’re building at Apollo Space: an operating system where the agents do the work and the verdict on that work is kept honest by design, not by hope. If you’ve ever shipped something on a score that turned out to be your own reflection smiling back, you already know why the grader should be the one person in the room with no reason to like you.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist