Engineering

Your eval set is a museum. It should be a drain.

A fixed test set rots the moment you start passing it. The fix is a flywheel: every real conversation that went wrong becomes a new test, so you can never overfit to a problem you already solved.

ASR

Apollo Space Research

Apollo Space

· 11 min read

A team writes two hundred test cases for their agent, runs them, gets a green dashboard, and ships. Three weeks later a customer asks the agent something none of the two hundred covered, the agent fumbles it, and the dashboard is still green. It was green the whole time. It will stay green while the product slowly stops working, because the test set stopped learning the day it was written.

That is the quiet way every eval suite dies. Not in a red build, in a green one that no longer means anything.

The fix is not a bigger test set. It’s a test set that grows from the work. Every conversation that went wrong should leave behind a test, so the eval can never overfit to a problem you already solved. This post is about how to build an eval that drains real failures out of production and turns each one into a permanent question you can never get wrong twice.

The naive eval is a museum, and museums don’t move

The obvious way to evaluate an agent is the way you’d evaluate a student: write the exam once, grade against it forever. You sit down at the start of the project, imagine the things users will ask, write a few hundred cases, and from then on a passing run means “the agent does what we imagined.”

It works beautifully for exactly as long as your imagination matched reality. Which is to say, for a week.

Here is the failure, and it’s structural, not lazy. The cases you wrote are the cases you already understood. You cannot write a test for a failure you haven’t imagined yet, if you’d imagined it, you’d have handled it, and there’d be nothing to test. So a fixed eval set measures one thing precisely: your foresight on the day you wrote it. It says nothing about the input that shows up next Tuesday, the phrasing no one on the team would ever use, the edge of the product where the real users live. The set is a museum. The exhibits are the failures you saw coming. The ones that hurt you are the ones that walked in after closing.

And it gets worse, because you optimize against it. Every fix you make is aimed at the test set, so the set gets easier to pass with every sprint. Greenness climbs while the thing the greenness was supposed to predict, does it work for a real person, drifts away underneath. You are not measuring quality anymore. You are measuring how well you’ve memorized your own exam.

The bottleneck never disappears. It just moves from “can we pass the test” to “is the test still about anything.”

The flywheel: a failure becomes a test, automatically

So we stopped treating the eval set as a thing you write and started treating it as a thing that fills itself.

The naive version says: a test is something an engineer authors. Our version says: a test is something a bad conversation deposits. When a real interaction goes wrong, the agent picks the wrong tool, forgets what it was told a minute ago, answers a question in the user’s language with an action in the wrong one, claims it did something it didn’t, that conversation doesn’t just get logged and forgotten. It gets minted into a flow: a named, replayable case that reproduces exactly what went wrong, parked permanently in the corpus.

The mechanism is simple to state and that’s the point. A real failure walks in. We capture the shape of it, the input, the context, what the right answer was, what the agent did instead. We turn that shape into a case that runs against the real agent runtime, not a mock. From then on, that exact failure is a question the agent has to answer correctly forever. Fix it once, and the test stands guard so you can never quietly break it again.

Every conversation that went wrong should leave behind a test, so the eval can never overfit to a problem you already solved.

On the left, a fixed eval set written once: a sealed box of imagined cases that stays green while real failures route around it untouched. On the right, a flywheel: a real conversation goes wrong, the failure is minted into a named flow, the corpus grows, the agent is graded against it forever, and the next failure feeds back in.

The shape on the right is the whole idea. A conversation goes wrong. The failure becomes a flow. The corpus grows by one. The agent is now graded against that flow on every run, forever. And because the loop never closes, the corpus tracks reality instead of tracking your memory of reality. You cannot overfit to it, because it is fed by the exact thing overfitting hides from you: the failures you didn’t see coming.

This is the move that flips an eval from a museum into a drain. A museum displays what you chose to collect. A drain catches whatever actually flows through. You want the drain.

”Burro” is a missing test, not a missing feature

There’s a particular kind of complaint that a fixed eval set teaches you to ignore, and it’s the most valuable signal you have.

Picture a user typing the conversational equivalent of this thing is dumb. They asked for something reasonable and got something useless. On a fixed-eval team, that complaint has nowhere to go. It’s not in the two hundred cases. It doesn’t move the dashboard. So it becomes a Slack message, then a vague feeling that “the agent’s been a bit off lately,” then a lament in a retro that everyone nods at and no one can act on. The signal was real. It just had no shape, so it evaporated.

The naive instinct is to read that complaint as a missing feature, we need to handle this category better, and add it to a backlog where it competes with everything else and loses.

The reframe is sharper: it’s not a missing feature, it’s a missing test. The user just handed you, for free, a real input that produced a real failure. That’s not a feeling to absorb; it’s a case to mint. You take the conversation, capture what went wrong, and drop it into the corpus as a located gap, not “the agent feels dumb” but “this exact flow, on this exact input, produced this exact wrong answer, and here is where it broke.”

The difference is enormous. A feeling is something you argue about in a meeting. A located gap is something you fix and then guard. One re-diagnoses the same vague malaise every week. The other turns a single angry message into a permanent improvement that can never silently regress. The complaint stops being noise and becomes the most honest test case you own, because no engineer’s imagination produced it, a real user’s real frustration did.

A red case is not a lament. It’s an address.

Two numbers, never one: the fast proxy and the truth

Here’s where a growing eval set runs into a hard limit, and it’s worth being honest about it instead of papering over it.

You want the loop to run fast. A failure should become a test, the test should run in seconds, you should be able to fan out dozens of them in parallel, A/B them across model sizes, and get an answer back before you’ve lost the thread. That speed is what makes the flywheel spin. So you build a local harness: the real agent runtime, in-process, replaying flows against the corpus, printing what it classified, which tools it reached for, what it replied, and what it remembered. Fast, cheap, repeatable. The fast proxy.

The naive move is to call green on the fast proxy “done.” It isn’t, and pretending it is recreates the original museum problem one level up. A local harness is still a controlled environment. It is your best guess at reality, run quickly, but it is a guess. The thing that’s actually true is whether a real person, on the real deployed product, clicking the real surface, got a result that worked.

So you report two numbers and you never merge them. The inner number is the fast proxy: how the agent does against the growing corpus, locally, in seconds. The outer number is the truth: how it does when a human uses the shipped thing. Inner-green means behaviorally promising, pending the real surface. Outer-green means it actually worked for someone. The moment you collapse those two into a single “it works,” you’ve told yourself a comforting lie, the same lie the green museum dashboard told, just faster.

Two readings of the same agent change. The fast inner loop replays the growing corpus locally in seconds and reports an inner pass rate, a quick proxy. The slow outer loop runs the real deployed surface a human can touch and reports an outer pass rate, the truth. The arrow flows one way: inner-green only earns the right to check outer; it never replaces it.

Both numbers matter, and they matter differently. The inner number is how you move fast without flying blind, it catches the regression in seconds, on your machine, before it ever reaches a person. The outer number is how you stay honest, it’s the only one that’s allowed to say “this is real.” Keeping them separate is the discipline that keeps a fast loop from quietly turning back into a museum that just runs more often.

The corpus is the asset, not the agent

Step back and notice what you’re actually building when you run the loop this way.

Every team thinks the agent is the valuable thing, the prompts, the tools, the orchestration. They’re not nothing. But they’re replaceable. The model underneath will change. The prompt will get rewritten. The tool layer will get swapped. None of that is the moat.

The moat is the corpus. Every failure your product ever hit, minted into a case, replayable against whatever agent comes next. That set is a structured record of every way your specific product, in your specific domain, has been gotten wrong by real people. A new model arrives, you run it against the corpus and learn, in an afternoon, whether it’s actually better at your job or just better on someone’s benchmark. You rebuild the agent from scratch, the corpus tells you instantly which of the old failures came back. The agent is the candidate. The corpus is the interview, and it’s an interview that gets harder and more honest every single week, because it’s fed by the only source that can’t be gamed: what actually went wrong.

That’s the asset that compounds. Imagine two teams shipping the same agent on day one. One writes a fixed test set and freezes it. The other drains every failure into a growing corpus. Six months later, say a thousand real conversations in, the first team’s set still measures their day-one imagination. The second team’s set measures a thousand real ways the product met the world. They are not the same company anymore, and the difference is not the model. It’s that one of them kept learning from being wrong.

The turn: the cost of a complaint that goes nowhere

Strip away the harness and the corpus and the two numbers, and what’s left is older than any of this.

Every company already has the flywheel’s raw material. It arrives every day, for free, in the form of users telling you exactly where the product let them down. The question was never whether you have the signal. You’re drowning in it. The question is whether anything catches it, whether a frustrated message becomes a permanent test, or becomes a feeling that fades by Friday.

A complaint that goes nowhere is the most expensive thing in a software company, because you paid the full price, a user’s trust, and got nothing back. No fix that holds, no test that guards, no lesson the system can’t forget. You absorbed the damage and discarded the data. The flywheel is, underneath all the machinery, just the decision to stop doing that: to treat every moment the product disappointed someone as a gift you refuse to waste.

That’s the part that isn’t about evals at all. It’s about whether your company gets smarter from being wrong, or just gets tired of it. The teams that win the next decade won’t be the ones with the cleverest agents. They’ll be the ones whose mistakes can only happen once.


That’s what we’re building at Apollo Space, not a test set you write and freeze, but one that fills itself from every conversation that didn’t go the way it should have, so the system is always graded on what actually went wrong. If you’ve ever watched a green dashboard reassure you while the product quietly stopped working, you already know why your eval set should be a drain, not a museum.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist