Engineering

An orchestrator that can't survive its own crash isn't one

A crash that erases the orchestrator's reasoning loses the one thing you can't rebuild.

ASR

Apollo Space Research

Apollo Space

· 11 min read

An orchestrator is forty minutes into coordinating six workers. It has read the spec, split the work, dispatched five of the six, decided the sixth has to wait on the third, and is holding in its head a hundred small judgments nobody wrote down: which worker is trustworthy, which result looked off, what it was about to do next. Then the process dies. Out of memory, a dropped connection, a host reboot, the cause barely matters. What matters is what comes back. The workers are still running, still writing files, still spending money. The orchestrator is gone, and so is every reason it had for everything it set in motion.

The workers were never the fragile part. The orchestrator’s reasoning was.

A crash that erases the orchestrator’s reasoning loses the one thing you can’t rebuild.

This post is about why that one thing is the hardest to protect, why the obvious ways of protecting it quietly fail, and what we do instead so that a dead supervisor is a recoverable event and not a small disaster.

The orchestrator holds something the workers don’t

Start with the shape of the problem, because it’s not where people look.

In a fleet, the workers are the visible, expensive, dangerous part. They write code, hit databases, call out to the world. So that’s where the engineering attention goes: sandbox the worker, cap its spend, bound its blast radius. All correct. All necessary. And all of it leaves the actual fragile thing completely exposed.

The orchestrator doesn’t write much. Its whole job is judgment, the running model of what is happening and why. It knows that worker four came back with a result that smelled wrong and got sent to re-check. It knows worker two is blocked on worker five and must not be told it’s done. It knows the user asked for three things and only two are in flight. None of that is in a file. It’s in the orchestrator’s working state, the way the plan of a meeting lives in the head of the person running it.

The workers can be restarted. The reasoning that decided to start them cannot.

Lose a worker and you lose some compute. You re-dispatch it; the rest of the fleet doesn’t notice. Lose the orchestrator and you lose the only thing that knew the workers were for something. They keep running, that’s the cruel part, disconnected from the intent that justified them, spending real money on a plan no surviving process remembers. A crash that erases the orchestrator’s reasoning loses the one thing you can’t rebuild.

Two failures compared: a crashed worker drops only its own compute while the fleet continues, but a crashed orchestrator orphans every still-running worker because the reasoning that justified them is gone.

So the question isn’t how to make the orchestrator never crash. Everything crashes. The question is how a new orchestrator, started cold on a different host, recovers enough of the dead one’s mind to take the wheel without re-deciding everything from zero, and without two of them grabbing the wheel at once.

The naive fix: just restart it

The obvious answer is the one every supervisor ships with. The orchestrator dies, a watchdog notices, a fresh one starts. Process supervision solved this decades ago. Why would agents be different?

Because restarting the process is not restarting the reasoning.

The new orchestrator boots with an empty head. It doesn’t know six workers are mid-flight. It doesn’t know worker four was already flagged as suspect, so it either ignores a result it should have re-checked or re-checks one that was already cleared. It doesn’t know worker two is waiting, so it might tell it to proceed into a dependency that isn’t ready. Worst of all, it doesn’t know the old workers exist at all, so it may dispatch a second set against the same tasks, and now twelve workers are running where six were planned, half of them stepping on the other half’s files.

A restart with no memory isn’t recovery. It’s a stranger walking into a control room mid-launch and starting to flip switches.

The pain is sharpest exactly when you can least afford it: under load, deep into a long run, with the most state in flight. The afternoon demo works because nothing has crashed yet. The 2 a.m. overnight job is where the supervisor dies forty minutes in and the bare restart turns one failure into a pile-up. Anyone who has run a fleet through a real night knows this failure shape, the orchestrator didn’t crash because the work was hard; it crashed because it was long, and “long” is precisely when losing its mind costs the most.

So honest restart is the floor. It keeps the service alive. It does nothing for the reasoning, and the reasoning was the whole point.

Our way: checkpoint the reasoning, not just the result

Here’s the move, and the key idea is simple: the orchestrator checkpoints itself the same way it checkpoints its workers.

Most fleets already persist worker output, the artifact, the diff, the result. We persist the layer above it. As the orchestrator reasons, it writes a running heartbeat of its own state to durable storage outside its process: the plan it’s executing, which workers it dispatched and why, which results it accepted or flagged, what it was about to do next, what each worker is for. Not a transcript dump. A structured account of the decisions, current as of seconds ago.

The discipline is one sentence. The orchestrator is the one process in the system that is never allowed to keep its reasoning only in its own head.

When it dies, a fresh orchestrator doesn’t boot blind. It reads the last checkpoint and reconstructs the dead one’s mind: these six workers are live, worker four is suspect, worker two is blocked, the user is owed a third thing not yet started. It adopts the running fleet instead of orphaning it. The work in flight keeps its meaning, because the meaning was written down a beat before the crash. A crash that erases the orchestrator’s reasoning loses the one thing you can’t rebuild, so we make sure the crash can’t erase it.

This is the same lesson the workers already taught us, applied one level up. We never trusted a worker’s “done” that lived only in the worker’s own report; we wrote it down where a colder process could re-read it. The orchestrator gets the same treatment. Its judgment is too valuable to live anywhere a crash can reach.

A loop where the orchestrator continuously writes its plan, dispatches, and flags to a durable checkpoint outside its process, so a fresh orchestrator can read that checkpoint and adopt the still-running fleet instead of starting blind.

The part everyone forgets: two orchestrators is worse than none

There’s a trap waiting on the far side of recovery, and it’s the one that turns a clever fix into a fresh disaster.

If a fresh orchestrator can adopt a running fleet, what stops two of them from adopting it at once? The watchdog starts a replacement, but the original wasn’t dead, just slow, frozen on a stuck network call for ninety seconds. Now there are two orchestrators, both convinced they own the same six workers, both issuing instructions. They re-dispatch the same tasks. They contradict each other. The fleet gets two captains and obeys both. This is worse than a clean crash, because a clean crash at least leaves one coherent story; a split brain leaves two, fighting.

The naive instinct is to make recovery faster and hope the overlap window shrinks to nothing. It never shrinks to nothing. Hope is not a concurrency primitive.

What actually closes the gap is a single rule: adopting a fleet requires winning an atomic claim, and exactly one process can win. Before a fresh orchestrator touches a single worker, it has to take an exclusive lock on that run. The first one in gets it. The second one, the frozen original, thawing out, or a second watchdog firing, asks for the same lock and is told no, taken, and stands down. Two orchestrators race; one wins; the loser doesn’t get a consolation fleet. The lock is what makes “adopt the running work” safe instead of reckless.

Recovery without a lock isn’t recovery. It’s a second captain on a ship that already has one.

That ordering matters: claim first, then adopt, then act. Read the checkpoint to know what’s happening, win the lock to earn the right to act on it, and only then issue an instruction. Skip the lock and a beautifully reconstructed mind just becomes the second voice arguing over the same fleet.

What this costs, and why we pay it

None of this is free, and pretending otherwise would be the kind of “looks done” we don’t trust.

Checkpointing the orchestrator’s reasoning costs writes, a steady trickle of state to durable storage on every meaningful decision, not just at the end. It costs the discipline of treating the checkpoint as a real artifact, structured enough that a cold process can actually read it, not a log nobody parses. And the claim lock costs a round-trip before anyone can act, plus the care to make it genuinely atomic, because a lock with a race in it is worse than no lock, it gives you false confidence while the split brain forms anyway.

We pay all of it on purpose, because the thing on the other side of the ledger is the expensive one. The cost of a lost orchestrator was never the orchestrator. It was the fleet it abandoned: workers running on a plan no one remembers, money spent toward an intent that died with the process, a long overnight run that has to start over from the top because the supervisor forgot why it began. Catching that with a checkpoint and a lock is cheap next to discovering it in the morning.

When the human walks in

It’s 2 a.m. and the supervisor died forty minutes into a six-hour run. Picture the two mornings.

In the first, the on-call person wakes to an alert, opens the laptop, and starts the archaeology: which workers were running, what were they doing, did any of them finish, what was the plan, why did it start, can any of it be salvaged or does the whole night go in the bin. An hour of reconstruction to recover a mind that was fully intact ninety minutes ago and simply wasn’t written down. They are doing, by hand and half-awake, the exact job the checkpoint exists to do.

In the second morning, there is no alert worth waking for. A fresh orchestrator read the last checkpoint, took the lock, adopted the six workers mid-stride, and carried the run to its end. The crash happened. It just wasn’t an event, it was a blip the system absorbed, the way a good team absorbs one person going home sick without the project stopping.

That second morning is the entire point. Not that nothing breaks, things break, but that when the coordinator breaks, no person has to become the coordinator at 2 a.m. to keep the work alive. The reasoning was never trapped in one fragile process, so no human has to rebuild it from a cold log. The system remembers why it was doing what it was doing, even across its own death, so the people who depend on it can keep sleeping.

That last part is the one you can’t buy with more reliable hardware. A machine that never crashes still forgets everything the instant it does. What earns trust is a system that assumes it will crash and arranges its mind so the crash can’t take the reasoning down with it.


That’s what we’re building at Apollo Space: an operating system whose coordinator treats its own death as a normal event, checkpoints its reasoning before it can be lost, and hands the running fleet to its successor without a person in the loop. If your worst on-call memories are the nights you woke up and had to become the thing that crashed, that’s the job we think the system should hold, so you don’t have to.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist