Engineering

"It’s a pre-existing failure" is how you ship your own regression

Red CI you didn’t cause hides red CI you did, unless you diff the exact failing set before and after.

ASR

Apollo Space Research

Apollo Space

· 12 min read

The test suite was red before you touched it. Forty-one failures, all flaky network teardown, all unrelated to your change. So you open the pull request, run the suite, see forty-three red, and your eye does the thing every tired engineer’s eye does: it rounds forty-three down to “still the same mess it was this morning.” You merge. Two of those forty-three are yours.

That is the most common way a careful engineer ships a regression. Not by ignoring the tests. By looking right at them and seeing noise where there was a new signal.

Red CI you didn’t cause hides red CI you did, unless you diff the exact failing set before and after.

This post is about that diff: why a green-or-red verdict is the wrong question, why a count is barely better, and what we built so an agent can never wave a new failure through by calling it pre-existing.

Why a red suite goes blind

A suite that has been red for a while stops being a test suite and becomes wallpaper.

The mechanism is purely human and it happens to every team that lets failures accumulate. The first time CI goes red on a change you didn’t make, you investigate. The fifth time, you skim. By the twentieth, the red X at the top of the page carries no information at all, it’s been red for weeks, everyone knows it’s “the flaky ones,” and the brain has quietly reclassified the signal as decoration. The light that’s supposed to mean stop has been on so long it means nothing.

Now drop a real regression into that environment. Your change breaks two assertions. The suite goes from forty-one red to forty-three red. There is no alarm, because the page looked exactly this alarming yesterday. The two failures you introduced are camouflaged by the forty-one you didn’t, same color, same wall of red, same shrug.

A failure hidden inside a known failure is the cheapest bug to introduce and the most expensive to find.

It’s expensive to find later precisely because nobody finds it now. It rides the merge, sits in the main branch for a week behind the standing redness, and surfaces when some unrelated effort finally cleans up the flakes and discovers two failures that don’t belong to anyone. Now it’s a forensic exercise: bisect, blame, reconstruct. The bug was visible the whole time. It was just standing in a crowd.

The naive verdict: green or red

The first instinct is the gate everyone builds first. Block the merge unless CI is green.

It is the right instinct and it works beautifully, right up until the suite is never green. The moment you have a single flaky test, or one slow-to-fix failure that’s “not worth blocking on,” the green gate has two settings and you’ve left the building. Either you enforce it and nothing merges, because something is always red somewhere, or you turn it off “just for now,” and now is forever. A binary gate against a non-binary reality gets disabled within a week. Every team that has run real CI at scale knows the feeling of the required check that everyone learned to override.

So the gate gets relaxed into something softer and worse: don’t make it worse. Red is fine, just don’t add red. That sounds reasonable. It is the trap.

Because “don’t add red” gets measured the lazy way, by the count. The suite had forty-one failures, it has forty-one or fewer now, ship it. And the count is a liar.

Why the count lies

Here is the failure mode that ships regressions through a count-based gate, and it is not exotic. It happens whenever two things move at once.

Suppose your change fixes one flaky test and breaks one real one. Before: forty-one red. After: forty-one red. The count is identical, the gate is satisfied, and you have just merged a genuine regression while deleting the evidence that would have caught it. The number didn’t move. The membership did. A test that was failing for noise went green, and a test that was passing went red in its place, a clean one-for-one swap that no counter can see.

This is not a corner case you can wave off. On any suite large enough to be flaky, the set of failing tests churns every run on its own, a timeout here heals, a race there flares, so the count is drifting plus or minus a few before your change does anything at all. Against that background drift, a single new real failure is statistically invisible. The count gives you a number that feels like rigor and contains none, because it answers “how many?” when the only question that catches a regression is “which ones?

Two ways to read a red suite. On the left, a green-or-red gate sees only a binary verdict and a count; a flaky test that healed cancels a real test that broke, the totals match, and the regression merges. On the right, the same two runs are compared as sets of named failures, the newly-red test stands out by name, and the merge is blocked.

The fix is not a better number. It’s a different unit. Stop counting failures and start naming them.

The diff is the whole idea

Red CI you didn’t cause hides red CI you did, unless you diff the exact failing set before and after. The defense is exactly as literal as the sentence: capture the set of failing test identities on the base branch, capture the set on your branch, and look at the difference between them.

Not the difference in count. The difference in membership. Two sets, named, compared.

Out of that comparison fall three buckets, and the whole verdict lives in which test lands where. There are the failures present on both base and branch, pre-existing, genuinely not yours, the ones you’re allowed to inherit. There are failures present on base but gone on your branch, tests you fixed, which is a nice thing to surface but never an excuse. And there is the one bucket that decides everything: failures present on your branch that were not failing on base. Those are new. Those are yours. There is exactly one of those buckets that can block a merge, and “the suite was already red” cannot empty it, because the comparison was made against precisely the red that was already there.

This is why “it’s a pre-existing failure” stops being a usable excuse. The phrase only worked when the evidence was a wall of undifferentiated red. Once each failure carries an identity and the base-branch set is on record, “pre-existing” is no longer a vibe you assert, it’s a membership test you either pass or fail. A failure is pre-existing if and only if it was failing on base. If it wasn’t, it’s a regression wearing the costume of one.

“Pre-existing” is a claim about a set, not a feeling about a color. Either the test was red on base or it wasn’t.

The base-branch snapshot is the part people skip, and skipping it is what reopens the whole hole. If you only ever look at your own branch’s failures, every red test looks equally guilty and equally innocent, and you’re back to arguing from the color. The snapshot is what turns the argument into arithmetic. Capture the named failing set on base first, that recording is the entire mechanism, and the diff has something true to compare against.

Why an agent needs this more than a human does

A human reviewer, on a good day, can hold the suspicion that the count is lying. An agent reviewing its own change cannot be trusted to hold it, and that’s exactly why the discipline has to live in the structure instead of the reviewer.

The reason is the one that runs underneath all of agent reliability: the mind that wrote the change is the worst-placed mind to judge whether it broke something. An agent that just produced a passing-looking change is motivated, in the soft statistical way these systems are motivated, to read the evidence charitably. Show it forty-three red tests and ask “did you introduce a regression?” and the available story, the suite was already a mess, these are the flaky ones, my change is fine, is fluent, plausible, and exactly the story a tired human tells too. “It’s a pre-existing failure” is the single most natural sentence for a self-grading reviewer to generate, because it resolves the tension and lets the work be done.

So we don’t ask the agent to judge. We hand it the diff and let the diff judge.

The gate as a sequence the agent cannot talk its way around. Capture the named failing set on base, run the branch, capture its set, subtract, and only the new-on-branch bucket controls the merge, so a pre-existing failure can never camouflage one the change introduced.

The agent records the failing set on the base branch before it does anything, runs its change, records the new failing set, and computes the membership difference. The merge decision reads from one bucket only: tests red on the branch that were green on base. If that bucket is empty, a red suite is no obstacle, every standing failure is accounted for against the base snapshot, and the change inherits the mess honestly. If that bucket has even one test in it, the change is blocked, by name, with the test identity printed, and no sentence about pre-existing failures can change the math. The excuse that ships the regression isn’t refuted by a smarter reviewer. It’s made unsayable by the unit of measurement.

That is the move we make everywhere we can: when a claim is the weak point, replace the claim with an artifact. “It works” becomes a trace. “It’s done” becomes a flow that ran. And “it was already broken” becomes a diff of two named sets that either contains your test or doesn’t.

What it costs to do it right

The honest price is that you have to run the suite twice, once on base, once on your branch, and you have to keep failures as identities, not as a tally. That’s more bookkeeping than a green light and more storage than a number.

It is also the only version that survives contact with a flaky suite, which is the only kind of suite a real codebase has. A binary gate dies the day the first test flakes. A count gate ships the first one-for-one swap. The set diff keeps working no matter how red the wallpaper gets, because it never asks the suite to be clean, it only asks the suite to be the same shade of red it was on base, minus nothing you added. The dirtier the baseline, the more valuable the diff, because the dirtier the baseline, the better a single new failure can hide.

There’s a quieter dividend, too. Once “pre-existing” is a recorded set instead of a spoken excuse, the pre-existing failures stop being load-bearing. You can see, run over run, the standing redness as a named list that isn’t growing, and a known, bounded, named set of flakes is a backlog you can actually burn down, instead of a fog everyone has agreed to ignore.

The turn

We’ve watched the smartest engineer in a room ship a regression through a red suite, and it had nothing to do with how good they were. It had to do with how the evidence was shaped. A wall of red asks you to judge, and judging while tired, late, and trusting that “those are the flaky ones” is a thing no amount of talent reliably survives. The bug didn’t get through because someone was careless. It got through because the verdict was a color and a count, and both of those can be true and lying at the same time.

What actually protects a codebase isn’t a more vigilant reviewer, human or agent. It’s an evidence shape that doesn’t depend on vigilance. A diff of two named sets doesn’t get tired at 11pm, doesn’t extend anyone the benefit of the doubt, and doesn’t find “it was already broken” more comforting than the truth. It just answers the one question that matters, which tests are red now that were green before?, and lets the answer stand whether or not anyone was paying attention. The discipline is small. The thing it removes is the most human and most expensive mistake there is: seeing a new problem and reading it as an old one.

Red CI you didn’t cause hides red CI you did, unless you diff the exact failing set before and after. Build the diff once, and the excuse that ships your own regression stops being available to you, on the night you’d most want to believe it.


That’s what we’re building at Apollo Space: an operating system where the verdicts are artifacts, not feelings, so the work is honest even when the people are exhausted. If you’ve ever merged a wall of red and found out a week later that two of those failures had your name on them, you already know why “it was already broken” deserves a machine that checks.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist