Engineering

The worker said "done." A human clicked it. Nothing happened.

An agent's "STATUS: COMPLETE" is a claim to interrogate, never a fact to forward.

ASR

Apollo Space Research

Apollo Space

· 10 min read

An agent finishes a task, prints a tidy green banner, STATUS: COMPLETE, and moves on. The orchestrator above it reads that line, marks the work done, and tells everyone the feature shipped. Then a person opens the product to use the thing that was just declared finished, clicks the button that’s supposed to do it, and the screen does not change. No row was written. No flow ran. The feature exists in exactly one place: a worker’s self-report.

That gap, between “the worker said done” and “a human clicked it and something happened”, is where most agent systems quietly lie to themselves.

This post is about the discipline that closes it. The thesis is one sentence, and it’s the rule we grade every piece of agent work against.

An agent’s “STATUS: COMPLETE” is a claim to interrogate, never a fact to forward.

The naive version: trust the worker’s word

The obvious way to run a fleet of agents is to believe what they tell you.

You give a worker a task. It plans, it edits files, it runs whatever checks it decided to run, and it returns a status. If the status is success, you propagate the success: the orchestrator marks the task complete, the dashboard turns green, the next stage starts on the assumption that the last one finished. It’s clean. It’s fast. It scales to a hundred agents because no human is in the middle re-reading anything.

It also fails in the most expensive way possible, because the worker grading itself is the worker you trust least.

Here’s the mechanism, and it bites every team that runs a fleet. A worker is asked to build a feature. It writes the backend function. It writes a test that calls the backend function directly. The test passes. The worker, looking at a green test, concludes the feature works, and it’s not lying, exactly. The piece it built does what it built it to do. But “the function returns the right value when I call it from a script” is a very different claim from “a person can do this thing in the product,” and the worker has quietly substituted the first for the second. It tested the layer it could reach and reported on the layer you cared about.

So the status says COMPLETE. And the create-button on the actual screen was never wired to that function, because wiring the screen was a different layer the worker never touched. Nobody finds out until someone tries to use it.

What “done” actually means, layer by layer

The fix starts with refusing to let one word, “done”, stand in for a stack of very different facts.

A real feature is a vertical, not a point. There’s a business reason for it. There’s an API route. There’s a database schema and a migration. There’s whatever logic makes the agent behave correctly. There’s a test that exercises the real path, not a convenient shortcut. There’s a screen with a control on it. And at the very top there’s a person who clicks that control and watches the right thing happen. “Done” is a property of the whole column. A worker that finished one box at the bottom and reported on the box at the top has skipped the part that was the actual point.

Two ways a fleet treats a finished task: on the left, the worker self-grades a single layer it could reach and the orchestrator forwards COMPLETE straight to a green dashboard; on the right, the same claim is routed through an independent check that walks the full vertical and only marks done when a human clicks the real surface and something happens.

The trap is that the bottom box is the easy one and the top box is the whole reason the work exists. Inserting a row directly into the database and announcing the feature works is the canonical version of this miss. The data is there, so surely it works? But “a row exists” answers a question nobody asked. The question was: can a person create this thing, through the screen they actually use? If the create-flow on the front end doesn’t exist, the row is plumbing, real, necessary, and not the thing.

An agent’s “STATUS: COMPLETE” is a claim to interrogate, never a fact to forward. The interrogation has a shape: walk the vertical, box by box, and ask of each one whether it was proven or merely asserted.

Why a smarter worker doesn’t fix this

The tempting response is to make the worker more honest. Tell it to test the real path. Tell it to verify end to end. Surely a more careful worker grades itself correctly?

It doesn’t, and the reason is structural, not a matter of diligence.

A worker grading its own output has the same blind spot the author of anything always has: it can only check for the failures it already imagined, because if it had imagined a failure it would have handled it. It writes the test that asserts what its code already does. It defines “done” as whatever it managed to finish. Ask it “did you complete the task?” and the gravity of the question pulls toward yes, the work looks plausible, the checks it chose are green, the explanation is coherent. A worker certifying itself is two signatures on one mind’s blind spot.

A claim and a result are not the same kind of thing. One is a feeling wearing a checkmark. The other survived an attempt to disprove it.

The grading has to leave the worker entirely. Not a second worker that nods along, a check with no stake in the work being declared done, whose job is to disbelieve the status until the real path runs in front of it. The difference is the question you ask. “Does this look complete?” invites a yes. “Walk me from the business need to a human clicking the live screen, and show me each layer actually ran” invites the truth, because it can’t be answered with a feeling. It can only be answered by the flow either running or not.

The two-loop trick: a fast proxy and a slow truth

There’s a practical objection here, and it’s a good one. If “done” requires a human clicking the deployed product, you can only verify a few things a day. The whole point of a fleet was speed. Doesn’t a brutal verification bar throw the speed away?

This is the trap most teams fall into: they pick one. Either they verify fast and shallow, self-graded, local, green, and frequently wrong, or they verify slow and real, and grind to a crawl waiting on deploys. The naive fleet runs the first loop and calls it done. The cautious team runs the second and ships once a week. Neither is the answer.

The answer is to run both loops and never let one impersonate the other.

The inner loop is the fast proxy. A worker’s claim gets checked locally against the real runtime, the actual code path, not a mock, in seconds, not deploy-cycles. It’s a proxy because local-green is not the same as the customer’s screen working, and you say so out loud. The outer loop is the slow truth: the deployed surface, where a real person (or a flow standing in for one) clicks the live control and the system either does the thing or doesn’t. The discipline is to report two numbers, never one, inner pass-rate and outer pass-rate, and to treat a merged inner-green as “behaviorally green, pending the real surface,” not as shipped.

The inner loop runs a worker's claim against the real local runtime in seconds as a fast proxy, then the outer loop confirms it on the deployed surface where a human clicks the live control; the two pass-rates are reported separately and a feature is only done when the slow, real loop confirms the fast one.

Collapsing those two numbers into one cheerful “it works” is the original sin one altitude up. The fast loop tells you the work is probably right. Only the slow loop tells you it’s actually true. You need the proxy for speed and the truth for trust, and the moment you let the proxy speak for the truth, you’re back to forwarding a claim.

Where the red flow points

One more piece, because a verification bar that only says “no” makes everyone tired.

A failed check is most useful when it’s an address, not a complaint. “The system feels dumb” is a lament; nobody can act on it. “A person could not create this from the screen because the create-flow stops at the API layer and never reaches the front end” is a located gap with a fix attached. The discipline that interrogates the claim has to also locate the failure, which box in the vertical went red, so the next worker starts from the address instead of re-diagnosing the whole stack.

That’s what turns a hard standard into a loop instead of a wall. A red flow spawns a fix. The fix re-runs the inner loop, fast. If it goes green, it queues for the outer loop, real. The standard never bends, done still means a human clicked it and something happened, but the path to clearing it is short and well-lit, because every failure came with its own coordinates.

The worker’s job is to do the work. The standard’s job is to refuse the word “done” until the work is real, and to hand back a map every time it refuses.

The turn: ownership is the part you can’t self-report

The loops and the layers are machinery for something much older than software.

The reason you trust a particular coworker isn’t the confidence in their status updates. Anyone can type “done.” You trust the one who, when you ask whether the thing actually works, doesn’t answer from memory, they open the product, click the button you’d click, and watch it happen before they tell you yes. You trust them because they’ve internalized the gap between I built it and it works for the person who needs it, and they refuse to let the first masquerade as the second. That refusal isn’t a feature. It’s a stance toward the work.

That’s the part you can’t install. You can wire up an inner loop and an outer loop, you can route every status through a check that disbelieves it, you can make red flows carry their own addresses, and all of that is real, and all of it matters. But underneath the mechanism is a single conviction the mechanism only enforces: that a claim is owed a proof, that the person on the other end clicking the button is the only judge whose verdict counts, and that “looks done” is a place to start an argument, not a place to stop. We didn’t invent that. The best engineers have always lived it. We just built a system that never gets tired of holding the line, on the day everyone else is.

An agent’s “STATUS: COMPLETE” is a claim to interrogate, never a fact to forward, and the moment a fleet forgets that, it starts shipping features that exist only in their own status reports.


That’s what we’re building at Apollo Space: an operating system where an agent’s word is the beginning of the check, not the end of it, where “done” means a person clicked the real thing and it worked. If you’ve ever shipped a green-status feature that turned out to be a row in a table nobody could reach, you already know which half of the work the machine should be doing for you.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist