Engineering

An eval is the only honest definition of "it works"

"It works" is a feeling; a flow that runs the real agent is a result.

ASR

Apollo Space Research

Apollo Space

· 12 min read

Someone writes a tiny test for an AI agent: a scripted chat where a made-up customer says “I’m José from accounting, who am I?” and the agent answers “you’re José from accounting.” The test goes green. The dashboard goes green. And in a standup, someone says the words that should make any team lead’s stomach drop: memory works now.

It doesn’t. What just happened is that a string went in and the same string came back. No real customer, no real session, no real recall under pressure, a puppet show graded by the puppeteer. But the checkmark is bright green, and green is contagious. By Friday “memory works” has become a fact that three other features quietly depend on.

That single green checkmark is the most dangerous artifact in agent engineering. This post is about why, and about the only thing we’ve found that replaces it.

“It works” is a feeling; a flow that runs the real agent is a result.

The naive version: green is the goal

The obvious way to know your agent works is the way every dashboard trains you to think. You write tests. They pass. The bar fills up. Green means done.

This is fine, genuinely fine, for code that does one deterministic thing. A function that adds two numbers either adds them or doesn’t, and a passing test is a real proof, because there is exactly one right answer and the test checks for it.

An agent is not that. An agent decides. It reads an ambiguous request, picks a tool, recalls something from three turns ago, chooses what to say, and changes its mind based on what the last response was. There is no single right answer to assert against, there’s a behavior, unfolding over a conversation, that’s either good or bad in context. So when a team reaches for the only tool they know, the deterministic test, they end up testing the only thing they can pin down: a fixed string going in and a fixed string coming out.

And that’s the trap. The test you can write for an agent is almost never the test that matters.

A passing test on a scripted conversation proves the script ran. It says nothing about whether the agent can think.

The José-from-accounting test is the perfect specimen. It looks like a memory test. It has the shape of a memory test. But it never exercises memory, it feeds the answer in as the question, so even an agent with no memory at all sails through. The team didn’t lie. They wrote a real test, it really passed, and the thing it proved was worthless. That’s the failure mode, and it doesn’t come from carelessness. It comes from treating “green” as the goal when green was only ever a proxy.

On one side, a self-graded feeling, an agent reports it works, one happy-path demo, ship it. On the other, a result, a flow in the corpus runs the real agent and produces a trace with a pass or fail. The arrow between them is the gap an eval exists to close.

The fix is to change what “done” means

If a passing test can be worthless, you can’t let a passing test define done. You have to define done as something a worthless test can’t fake.

The key idea is simple. A behavioral claim only counts if it traces to a flow, a real scenario that runs the actual agent runtime, end to end, and produces a trace you can read. Not a mock of the agent. Not a stubbed tool returning a canned value. The same engine that talks to a real customer, run against a scenario built to make the behavior actually happen.

“It works” is a feeling; a flow that runs the real agent is a result.

Watch what that does to the memory claim. A flow that genuinely tests memory cannot feed the answer in as the question. It has to be a conversation with at least two turns, where turn one establishes something the agent must remember and turn two depends on recalling it without being reminded. Turn one: I run a small accounting practice and I only handle quarterly filings, never payroll. Five turns later, with other topics in between: can you take on my payroll? An agent with working memory says no, that’s not your area. An agent that’s faking it offers to help with payroll, cheerfully, having forgotten who it’s talking to. There is no string you can pipe through to fake that. The behavior either survives the distance between the two turns or it doesn’t.

That’s the difference between a test and a flow. A test asks did the code run. A flow asks did the agent behave, and the only way to answer is to make the behavior happen for real and watch.

A red flow is an address, not a complaint

Here’s the objection that comes next, and it’s a fair one. If you stop trusting the green checkmark and start running hundreds of real-behavior flows, a lot of them are going to fail. Isn’t that just trading one dashboard’s worth of false comfort for a different dashboard full of red?

It would be, if a failing flow told you the same useless thing a failing test does. But it doesn’t, and that’s the part worth slowing down on.

When a deterministic test fails, you get “expected X, got Y” and a line number. When a flow fails, you get a transcript, the entire conversation the agent actually had, the tools it reached for, the thing it recalled or didn’t, the moment it went wrong. A red flow is not a verdict. It’s a located gap with a return address. The agent forgets a constraint after four turns of unrelated chat is not a lament about the system feeling dumb. It’s a coordinate. You know which behavior, at which turn, under which pressure.

This is the quiet difference between an eval suite and a test suite. A test suite tells you something broke. An eval tells you what the agent can’t yet do, and where to stand to fix it.

“The system feels dumb” is a feeling. “It drops the payroll constraint on turn six” is a work order.

And once a red flow is an address, the loop almost writes itself. A failing flow points at a behavior; a fix targets that behavior; you re-run the flow to see if the behavior changed. No guessing whether you helped. The flow that found the gap is the same flow that certifies the gap is closed. You are never debugging a vibe, you’re debugging a transcript, against a scenario you can run again on demand.

The inner loop runs a corpus of flows against the real runtime; a red flow becomes a located gap that gets fixed in isolation and re-run. The outer loop is the deployed surface where a human actually uses it. Inner green is a fast proxy; the outer pass is the truth.

Two loops, and never confuse them

There’s a way to take this idea too far, and it fails in the opposite direction, so it’s worth naming.

Suppose you build the eval suite, you grow it to hundreds of flows, every one runs the real runtime, and they all go green. It’s tempting to call that done. It isn’t, and confusing it for done is just the green-checkmark mistake wearing better clothes. A flow runs the real agent, but it runs it in a harness, on a scenario you wrote, from a terminal. A real user runs it through a deployed product, on a surface they don’t understand, with intentions you never imagined.

So we hold two loops, and we never let one pretend to be the other.

The inner loop is fast. It’s the eval suite, hundreds of flows against the live runtime, runnable in seconds, parallel, cheap enough to run on every change. It’s where a red flow becomes a fix and a fix gets re-checked. It is the proxy: close enough to truth to steer by, fast enough to steer with.

The outer loop is slow, and it’s the truth. It’s a real person on the deployed surface, doing the thing they came to do, and having it work. No harness, no scenario file, no friendly scaffolding, the product, in the wild.

A passing inner loop earns one sentence and no more: behaviorally green, pending the real surface. It does not earn “done.” Reporting the two as one number, collapsing “the flows pass” into “it works”, is the José-from-accounting mistake one altitude up. The honest report is two numbers, kept apart: how many flows pass in the harness, and whether a human got what they came for on the live thing. Merge them and you’re back to lying with a checkmark, just a more expensive one.

This is the whole discipline. The inner loop lets you move fast without flying blind. The outer loop keeps the inner loop honest about what it actually proved.

The corpus is alive

One more piece, because a fixed eval suite rots the same way a fixed test suite does.

If the set of flows never changes, you eventually overfit to it. The agent gets good at exactly the scenarios you thought to write, and blind to every one you didn’t, which is most of them, because the situations that break an agent in production are the ones nobody imagined in a meeting. A frozen corpus becomes a comfort blanket: all green, all the time, all on the questions you already knew the answers to.

So the corpus has to grow, and it grows from reality. Every real conversation an agent has is raw material for a new flow. A user phrases a request in a way nobody anticipated; that phrasing becomes a flow. The agent mishandles an edge no test covered; that edge becomes a flow, tagged with where it came from. The set of things you measure expands to match the set of things that actually happen, so the eval can’t be gamed, because you didn’t write it in advance from your own imagination. You harvested it from the world.

That’s what turns evals from a one-time audit into an engine. The monologue of a frustrated user becomes a flow; the flow finds the gap; the gap gets fixed; the fixed behavior gets locked in by the flow that found it. Every conversation makes the next version harder to fool.

What an eval actually is

So back up and look at what we’ve actually been describing, because the word “eval” is doing a lot of quiet work.

An eval is not a test. A test asks whether code did what the author expected. An eval asks whether an agent behaved well in a situation that mattered, and it answers by making the situation real and watching what happens. It’s the difference between asking a pilot if they can land the plane and putting them in the simulator in a crosswind. One is a claim. The other is a flight.

That reframe is the whole post. “It works” is a feeling; a flow that runs the real agent is a result. Everything else, the inner and outer loops, the living corpus, the red-flow-as-address, is just the machinery that makes the second thing buildable at the speed the first thing is claimable.

The turn: honesty is a thing you engineer, not a thing you have

Notice what this is really about, underneath the loops and the corpus. It isn’t about agents. It’s about the oldest temptation in building anything: to believe the thing is good because you want it to be good, and to reach for the measurement that confirms it.

Every team that’s ever shipped knows this temptation by feel. The green dashboard before a launch. The demo that worked the three times you practiced it. The metric that went up because you defined it to. None of those are lies, exactly, they’re hopes wearing the costume of evidence, and the reason they’re so dangerous is that they feel like rigor. An eval discipline is, in the end, a way of refusing to be comforted by your own optimism. It’s a standard you build outside yourself, so that on the day you most want the thing to work, something other than your wanting gets to decide.

The hard part of this was never the engineering. The harness is a few hundred lines. The hard part is the human willingness to keep a number red on a dashboard you could so easily make green, to look at a real transcript where your agent failed a real person and write it down as a flow instead of explaining it away. That discipline doesn’t come from the tooling. It comes from a team that would rather know than feel good.

And that’s the only honest definition of “it works” anyone has ever had. Not the checkmark. The thing the checkmark was supposed to stand for, actually verified, on the real thing, in front of someone who needed it.


That’s what we’re building at Apollo Space: an operating system that earns the word “works” the hard way, by running the real agent against the real situation and reading what actually happened, instead of trusting the green light. If you’ve ever shipped something the dashboard swore was fine and a user proved otherwise, you already know which number you should have been watching.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist