Use Cases

Can Apollo be your QA department? Yes, because QA was always an eval loop

QA is not a phase you bolt on before launch, it is a loop: define the flow, run the real product, grade the result with a judge that never wrote the code, file the gap as a task.

ASR

Apollo Space Research

Apollo Space

October 22, 2025 · 12 min read

A test suite goes green. The build ships. Two days later a customer can’t get past the second screen of the thing the suite swore was fine. Nobody lied. Every assertion the team wrote passed, it’s just that none of them was the thing the customer actually did. The tests checked that the code did what the code does. They never checked that the product does what the user needs.

That gap, between “the tests pass” and “the flow works”, is the whole reason QA exists. And it’s the gap most QA never closes, because QA gets treated as a phase you survive instead of a loop you run.

Here’s the claim this post defends: QA is not a phase, it’s a loop, define the flow, run the real product, grade the result with a judge that never wrote the code, file the gap as a task. Say that sentence out loud and you’ve described something an operating system can run continuously, not a thing a team does in a panic the week before launch.

So can Apollo be your QA department? Yes, but only because QA was never really a department. It was a loop wearing a department’s clothes.

The naive QA: a phase at the end, graded by the people who shipped it

The way most teams do QA is the way you’d expect if you’d never thought hard about it. You build the feature. When it feels done, you hand it to QA, or, in a small company, you are QA, clicking through your own work at 11pm before a release. You run the happy path, it looks right, you ship.

This fails in two specific, predictable ways, and naming them is the whole point.

The first failure is the phase problem. QA-at-the-end means QA runs exactly once per release, on the most depleted timeline, against the version you’re already emotionally committed to shipping. By the time the loop closes, the gap it found is the most expensive kind, the one a customer would have hit first. You didn’t run the loop too slowly. You ran it too late, and only once.

The second failure is worse, and quieter. The same mind that built the thing is grading whether it works. The engineer who wrote the flow clicks through the flow, and of course it works, they take the path they designed, fill the field with the value they expected, and never try the input they didn’t imagine, because if they’d imagined it they’d have handled it. A green check from the author is the author and the author agreeing. It is not evidence.

These two failures compound. A phase that runs once, self-graded by the builder, is how a product passes every internal check and falls over the moment a stranger touches it. The bug didn’t slip through QA. QA was structurally built to miss it.

The loop QA actually is: flow, run, judge, file

So strip QA back to what it’s doing underneath the org chart. Forget the department. What are the steps?

Step one, you define the flow, not a code path, a user-shaped journey. “A new customer signs up, connects their calendar, and gets their first morning briefing.” That’s a flow. It has a beginning a real person starts at and an end a real person can see.

Step two, you run the real product, the deployed thing, the actual runtime, not a mock of it. A test that stubs the part that breaks is a test that passes for the wrong reason. The flow has to touch the same surfaces a customer touches.

Step three, you grade the result with a judge that never wrote the code. This is the load-bearing step, and the one everyone skips. The grader has to be independent, structurally unable to give the builder the benefit of the doubt, and it has to grade the outcome the user wanted, not the line of code that ran.

Step four, you file the gap as a task. A failed flow is not a vague feeling that “the product feels rough.” It’s a located gap with an address: this flow, this step, this is what was supposed to happen and didn’t. That address is a task someone, or something, can pick up and fix.

Then the loop closes and runs again. And again. That’s the trick the phase model misses entirely: a loop has no “once.” It runs every time the code changes, forever.

QA reframed as a four-step loop instead of a one-time end phase: define a user-shaped flow, run the real deployed product, hand the result to an independent judge that never wrote the code, then file each failure as a located task with an address, and the loop runs again on the next change.

QA is not a phase, it’s a loop, define the flow, run the real product, grade the result with a judge that never wrote the code, file the gap as a task. Notice that nothing in those four steps requires the loop to be run by a human at the end of a release. Each step is something a system can do, on every change, while the office is dark.

The judge problem: why “add a reviewer” isn’t enough

There’s a tempting shortcut at step three, and it’s worth killing it on the page because it’s the one most teams reach for first.

The shortcut is: have a second agent, or a second engineer, double-check the work. Sounds like independence. It isn’t. A reviewer with no adversarial mandate, asked “does this look right?”, is strongly inclined to say yes. The change is plausible, the screen renders, the demo path works. Sure, looks right. Now two minds have signed off on the same blind spot, and you’ve mistaken a second signature for a second judgment.

The naive fix doubles the graders. It doesn’t change the question.

The real fix changes the question. An honest QA judge isn’t asked “is this right?” It’s asked “what did the user actually want here, and did they get it?”, and it grades against the flow’s stated outcome, not against whatever the code happened to produce. If the flow said “the customer reaches their first briefing” and the customer reached an error screen, no amount of green tests upstream changes the grade. The judge reads the outcome the way a customer would, with no pride in the work and no memory of how hard it was to build.

Two rules make that judge trustworthy, and they’re both rules you can enforce in a system more reliably than in a tired human.

A claim is not a result. “It works” is a feeling; “this exact flow ran end to end and produced what the user needed” is a result.

And the second rule, the one that keeps the loop honest:

Never let the mind that did the work be the mind that grades it.

That’s not a slogan about agents. It’s the oldest principle in quality control, from manufacturing to code review to peer-reviewed science: the author is the worst possible certifier of their own output. QA’s entire reason to exist is to move the grading away from the builder. An operating system can enforce that separation by construction, the judge is a different process, with a different prompt, that literally never saw the code get written.

Why this is a job for an OS, not a tool

Now put the loop somewhere. Here’s where “can Apollo be your QA department” stops being a metaphor.

A QA tool runs when you open it. You point it at a flow, you click go, you read the report, you close the tab. It’s a box you query, which means the loop only runs when a human remembers to run it, which means it runs least often exactly when the team is busiest, which is exactly when bugs ship. The tool inherits the phase problem it was supposed to solve. The bottleneck never disappears, it just moves into the person who has to remember to open the tool.

An OS doesn’t wait to be opened. It has a scheduler. The same loop, define, run, judge, file, runs because the code changed, not because someone remembered. A flow that mattered yesterday gets re-run today, automatically, against the deployed product, and the gap it finds shows up as a task in the same place all the other work lives. The grading happens with an independent judge by default, because the system was built so the writer and the grader are never the same process.

That’s the difference between a QA tool and a QA department: one runs when summoned, the other runs continuously and tells you when something broke. The department was always the continuity, not the people.

Two ways to do QA. On the left, a QA tool waits to be opened, a human runs it at release time, self-grades the happy path, and the bug ships when the team is too busy to run it. On the right, an OS runs the same loop on every change: an independent judge grades each flow against the user's outcome, and failures land as tasks automatically.

And because the gap arrives as a task, not a report, the loop doesn’t dead-end. A report gets read and forgotten. A task gets picked up, by an engineer, or by one of the writer agents already working in the codebase, fixed, and re-run through the same judge. The loop that found the gap is the loop that confirms the fix. Nothing closes the gap by asserting it’s closed; it closes when the same flow that failed now passes in front of the same independent judge.

The corpus that won’t let you cheat

There’s one more piece, and it’s the piece that turns a QA loop into a QA department you can trust over months instead of one good afternoon.

A single flow is a test. A growing library of flows is coverage. The naive way to build that library is to write a fixed set of test cases once and call it done, which means your QA only ever checks the things you already thought to check, and never the thing a real user did that you never imagined. A frozen suite measures yesterday’s product against yesterday’s imagination.

The better way: the library grows from real usage. Every time a customer hits a path nobody anticipated, the input you didn’t handle, the flow you didn’t think was a flow, that becomes a new entry in the corpus, and from then on the loop checks it forever. The product can’t regress on a real pain twice, because the first time it surfaced, it became a flow the judge now grades on every change. QA stops being a guess about what might break and becomes a record of everything that ever did.

That’s the property a department of humans could never quite hold: it forgets. People leave, the institutional memory of “oh, watch out for this edge case” leaves with them, and the same bug ships again two years later. A loop with a growing corpus doesn’t forget. The flow is the memory.

The turn: QA was never the gate. It was the conscience.

Strip away the agents and the scheduler and what’s left is a question about what QA is for.

We tend to think of QA as a gate, the checkpoint a release has to clear before it reaches a customer. But a gate that opens once per release, staffed by the people who built the thing, isn’t really protecting the customer. It’s protecting the schedule. The deeper job QA was always doing, the one the best testers did instinctively and the worst process buried, is to be the company’s conscience about whether the thing actually works for the person using it. Not whether it compiled. Whether it worked.

That conscience is too important to run once, and far too important to let the builder grade. In most companies it lives in one or two exhausted people who carry the whole product’s reliability in their heads and click through it by hand the night before every launch. That’s not diligence. It’s a single point of failure wearing the costume of care.

The promise isn’t a faster test runner. It’s that the conscience never sleeps, never self-grades, and never forgets a flow it has seen break. The most reliable judgment in the company stops living in one tired person’s head the night before launch, and starts living in a loop that runs every time the product changes, graded by something that has no reason to give the work the benefit of the doubt.

That’s what we’re building at Apollo Space: not a QA tool you remember to open, but a loop that runs itself, define the flow, run the real product, grade it with a judge that never wrote the code, file the gap as a task. If you’ve ever shipped a green build that broke the second a stranger touched it, you already know why QA should never be the same mind that built the thing, and why it should never run just once.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist