Engineering

Two agents wrote to the same file. We almost shipped a database that wouldn’t boot.

Parallelism doesn’t fail loud, it fails in the one slot two writers both reached for.

ASR

Apollo Space Research

Apollo Space

· 11 min read

Two agents got the same migration number. Not on purpose, each was working alone, in its own copy of the repo, on its own feature, and each needed the next slot in the migrations folder. Both looked. Both saw the last file was 0041. Both wrote 0042. Two different 0042s, two different schemas, both green in isolation, both merged within the same hour. The database that came out the other side could not decide which 0042 was real, and a fresh boot ran one of them, then choked on the other.

Nothing crashed when it should have. Each branch passed its own tests. The CI for each PR was a wall of green checkmarks. The failure only existed in the place no single branch could see: the seam where they met.

Parallelism doesn’t fail loud. It fails in the one slot two writers both reached for.

This post is about that seam, why running many agents in parallel creates a failure mode that no individual agent can catch, and the discipline that closes it. The fix isn’t smarter agents. It’s a rule about who is allowed to touch what.

Why we run many agents at once

Start with why the collision was even possible, because the instinct is to conclude that parallelism is the mistake. It isn’t.

A real codebase has more work in it than any single agent can hold in its head. Refactor the billing module, add a settings page, fix the flaky test, extend the schema, these are independent jobs, and forcing them through one agent one at a time turns a day of work into a week. So we fan them out. Each agent gets its own task and its own isolated checkout of the repository, plans, writes, tests, and opens a change. Several run at once. The whole point is that they don’t wait on each other.

And for the work that’s genuinely independent, this is close to free speed. An agent editing the settings page never reads the same lines as an agent editing billing. They finish, they merge, nothing touches.

The trouble starts the moment “independent” turns out to be a lie. Two tasks that have nothing to do with each other can still both need the same shared resource, and a migrations folder is exactly that. It’s a queue with a single next slot, and every feature that touches the database reaches for it. Two agents, two unrelated features, one slot. Neither did anything wrong. They just both reached for the same thing while looking away from each other.

Three agents work in isolated copies of one repo on unrelated tasks, yet all three reach into the same shared migrations folder for the next number, the hidden contention that isolation hides.

That’s the shape of the whole problem. Isolation makes each agent fast and each agent blind. It guarantees they won’t step on each other’s lines, and quietly says nothing about the resources all of them share.

The naive version: just merge them and let tests catch it

The obvious answer is the one we started with, and it’s worth staging the pain, because it’s the answer most teams reach for first.

You run the agents, each opens a PR, each PR has its own CI, and you trust the green checkmarks. If something’s broken, the tests will catch it. That’s the contract CI is supposed to honor.

Here’s why it doesn’t hold. Each PR’s CI builds that PR’s branch, the base plus one agent’s changes. On branch A, there is exactly one 0042, the schema migrates cleanly, every test passes. On branch B, there is also exactly one 0042, a different one, and it too migrates cleanly and passes. Each branch is internally consistent. Each is correct against the world it was tested in. The conflict doesn’t live in either branch. It’s born at merge, when both 0042s land in the same tree, and the thing that tested fine is now a folder with two files claiming the same position.

Every branch was green. The bug was in the union of the branches, and no branch’s tests can see the union.

This is the part that catches people. We’re trained to trust a green test suite, and the suite wasn’t lying, it correctly reported that each change was fine on its own. The flaw was in what got tested. Nobody ran the suite against both changes at once, because at the moment each suite ran, the other change didn’t exist yet. A test can only fail on a state it actually evaluates, and the colliding state appears for the first time after both merges, when no test is watching.

Worse, git itself often won’t save you. A real merge conflict, two edits to the same lines, gets flagged and blocks the merge. But two new files, 0042_add_invoices.sql and 0042_add_audit_log.sql, don’t overlap by a single character. Git merges them happily. There is no conflict to resolve, because the conflict isn’t textual. It’s semantic: the folder has a rule git has never heard of, which is one migration per number, applied in order. The tool that’s supposed to catch collisions only catches the kind it can see.

So the loud checks stayed quiet. Parallelism didn’t fail loud here either, it failed in the one slot two writers both reached for, after every check that could have flagged it had already said yes.

Why “be more careful” isn’t the fix

The tempting next move is to make the agents check before they write. Have each agent, right before claiming 0042, re-scan the folder to confirm the number is still free.

It feels like it should work. It doesn’t, and the reason is old enough to have a name.

Between the instant an agent reads “the last migration is 0041” and the instant it writes 0042, another agent can do the exact same read and reach the exact same conclusion. Both checked. Both saw 0041. Both wrote 0042. The check passed for both because nothing changed the folder during either agent’s glance, the change happened in the gap between glancing and writing. Database people call this a race condition; the check-then-act is only safe if nothing can act in the gap, and with parallel writers, something always can.

You cannot fix a race by checking harder. A second look is just one more read, and a read can’t reserve anything. Two agents politely checking the same unlocked door both walk through it. The flaw isn’t that they didn’t look. It’s that looking doesn’t claim, and without a claim, two careful agents collide exactly as reliably as two careless ones.

So the real fix has to do the one thing a check can’t: make the act of taking the slot atomic, indivisible, so that between deciding to take it and having taken it, no one else can slip in. The agents don’t need to be more careful. They need a referee that can only hand the slot to one of them.

Our way: a claim, not a check

The discipline we settled on is small, and it’s the kind of thing that sounds obvious the moment you say it and is invisible until something burns you. Before an agent may touch a shared, single-slot resource, it has to claim it, and the claim is atomic, so only one claim can win.

Concretely: the next migration number isn’t something an agent reads off the folder and hopes is still true. It’s something an agent requests, and the request either grants it that number, uniquely, or tells it someone else just took it and hands it the next one. Two agents asking at the same instant don’t both get 0042. One gets 0042, the other gets 0043, and neither ever saw a free slot it wasn’t actually given. The read-then-write race is gone because there is no read-then-write, there’s a single indivisible “give me the next one.”

The mental model that made this click for us: stop treating the agents like editors of a shared document and start treating them like callers to a single open line. Two people can read the same number off a wall. Only one can be handed the next ticket from a dispenser, because the dispenser advances the instant it gives one out. The shared resource needs a dispenser, not a wall.

A check-then-write race lets two agents both read slot 0042 and both write it; an atomic claim sends both through one dispenser that hands 0042 to one and 0043 to the other.

And the rule generalizes past migrations, which is why it earned a place in how we build rather than a one-off patch. Any single-slot shared resource is a collision waiting for two parallel writers: a sequence number, a lock file, a “next available port,” a singleton config the whole fleet reads, a queue with one head. The question to ask of every parallel job isn’t “are these tasks independent?”, tasks lie about that. It’s “does anything they both touch have exactly one slot?” Wherever the answer is yes, a check won’t save you and a claim will.

The fix for a collision is never a better look. It’s a referee that can only say yes once.

What this discipline actually costs

The honest tradeoff, because it isn’t free, and pretending it is would be the same self-graded optimism that caused the bug.

A claim is a small piece of coordination, and coordination is the thing parallelism was supposed to avoid. Every atomic claim is a moment where agents briefly line up single file at the dispenser instead of running fully independent. If you put a claim around everything, you’ve quietly rebuilt the one-at-a-time bottleneck you fanned out to escape, just with extra ceremony.

So the cost is real, and it sets the boundary on where this belongs. You claim the genuinely shared single slots, the migration number, the lock, the singleton, and you leave everything else wide open. The settings-page agent and the billing agent still never wait on each other, because they share nothing with one slot. The discipline isn’t “coordinate everything.” It’s “find the few places contention actually lives, and put a referee on exactly those.” A typical feature branch touches zero of them; the ones that do are the ones that would have silently corrupted the merge.

That targeting is the whole art. Coordinate too much and you’ve lost the speed. Coordinate too little and you ship a database that won’t boot. The skill is knowing which slots are dispensers in disguise, and that’s a question you answer once per resource, not once per task.

The turn

A database that won’t boot is a dramatic way to learn an undramatic lesson, so here’s the lesson without the drama.

When you put many capable workers on one shared thing, the work they do alone is almost never where it breaks. The breakage hides in the handoffs and the shared slots, the parts no one owns because everyone assumed someone else was watching. This is not an AI problem and it is not a new one. It’s the oldest problem in any team that grew past one person: the calendar everyone edits, the spreadsheet two people opened at once, the one parking space with two cars idling toward it. We just hit the agent-scale version of it, where the team is large, fast, and tireless, and the collisions arrive faster than a human could referee them by hand.

Which is the actual point. Parallelism doesn’t fail loud, and it never will, it fails in the one slot two writers both reached for, quietly, after everyone’s tests are green. You don’t make a fleet trustworthy by making each worker smarter, a smarter agent still can’t see the branch it isn’t on. You make it trustworthy by being honest about what they share, and putting a referee on exactly those seams. The intelligence lives in the workers. The reliability lives in the rules about what they’re allowed to touch at the same time. Get those rules right and parallelism stops being a gamble and starts being plain speed.


That’s what we’re building at Apollo Space: an operating system where many agents move at once without quietly overwriting each other, fast because they’re independent, safe because we were honest about the few places they aren’t. The hard part of running a fleet was never teaching it to work. It was teaching it where two hands must not reach for the same thing at once, and we’d rather learn that on a migration number than on something you can’t undo.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist