The heartbeat is the contract
A long-running agent doesn't prove it's alive by finishing. It proves it by leaving a trail you can resume from, the checkpoint it writes before it crashes is what lets the work survive the crash.
Apollo Space Research
Apollo Space
An agent starts a four-hour job and goes quiet. Two hours in, the box it runs on reboots. When it comes back, the question isn’t “did it finish”, it’s “how much of those two hours still exists.” In a lot of systems, the answer is none. The work was alive in memory, the memory is gone, and the only proof the agent ever ran is a log line that stops mid-sentence.
That’s the failure we kept hitting, and it taught us something blunt. A long-running agent doesn’t prove it’s alive by finishing. It proves it by leaving a trail you can resume from.
Most of the engineering that makes agents trustworthy isn’t about making them smarter. It’s about making them legible while they run, and recoverable after they don’t. This post is about the smallest piece of that, the heartbeat, and why we treat it not as a status light but as a contract the agent signs with its future self.
Silence is the one signal you can’t read
Here’s the naive way, and it’s the way almost everyone starts.
You launch the agent. You wait. While it runs, it tells you nothing, because it’s busy doing the work, and telling you about the work would just slow it down. When it’s done, it reports. If it never reports, you assume it’s still going. So you wait longer.
This breaks the instant something goes wrong, and it breaks in the cruelest possible way: it looks identical to success. An agent that is grinding through a hard task and an agent that died forty minutes ago produce the exact same output, which is nothing. Silence means “working.” Silence also means “dead.” You cannot tell which from the outside, so you guess, and you guess wrong at the worst times, you kill a job that was about to finish, or you wait an hour on a process that crashed in the first minute.
The bottleneck never disappears. It just moves from “is the work done” to “is the worker alive,” and you have no instrument for the second question.
A human team solved this without thinking about it. You don’t ask a teammate every ten minutes whether they’re alive. They just are, they move, they talk, they show up. Liveness, for a person, is free and continuous. For a process running unattended on a machine you can’t see, liveness is the thing you have to manufacture. Nothing is free. If you want to know the worker is alive, the worker has to say so, on a rhythm, in a place you can read without interrupting it.
So that’s the first move. Stop treating silence as information. Make the agent emit a pulse.
The heartbeat: a small, regular promise
The heartbeat is the simplest reliability primitive there is, and it’s the one most agent setups skip. On a fixed interval, say once a minute, the running agent writes a tiny record to a place outside its own memory: I am still here, this is the step I’m on, this is when I last spoke. That’s the whole thing. A timestamp and a position.
What makes it work isn’t the record. It’s the rhythm. A single “I’m alive” message tells you nothing an hour later. A heartbeat every minute tells you something precise: if the last one is two minutes old, the agent is fine. If it’s twenty minutes old, the agent is gone, and you know roughly when it left. Liveness stops being a guess and becomes a subtraction, now minus last-beat. Under a threshold, alive. Over it, dead, and recoverable.
Notice what this quietly changes. The watcher no longer has to interrupt the worker to ask how it’s doing. It reads the heartbeat. The worker no longer has to choose between making progress and reporting progress, the report is a single cheap write, off to the side, that doesn’t touch the work. The two concerns come apart. The agent does the job; the heartbeat narrates that the job is being done; nobody has to poke anybody.
There’s a subtlety worth naming, because it’s where the naive heartbeat fails too. A heartbeat that only says “alive” is barely better than silence. If the agent has been beating happily for an hour but stuck on the same step the whole time, it’s alive and useless, a process spinning in place, faithfully reporting that it’s spinning. So the heartbeat carries position, not just pulse. Alive, on step seven, last advanced four minutes ago. Now a stalled agent and a working one look different again, and the watcher can act on the difference: a heartbeat that beats but never advances is a thing you restart, not a thing you keep waiting on.
That’s liveness handled. But liveness alone doesn’t save the work. Knowing precisely when an agent died doesn’t un-kill it. For the work to survive, the heartbeat has to carry one more thing.
The checkpoint is the part that survives the crash
Here’s the idea the whole post orbits, stated plainly: the checkpoint an agent writes before it crashes is what lets the work survive the crash. Everything alive in memory dies with the process. Only what was written down outside the process comes back.
The naive version assumes the crash won’t happen. The agent holds its progress in memory, the plan, the steps done, the partial result, the context it built up over two hours, and it intends to write all of that down at the end, when it reports. This is the same mistake as a writer who never saves until the document is finished. It’s fine right up until the power flickers, and then two hours of irreplaceable state evaporates because it lived in exactly one place, and that place was volatile.
The pain is specific and it compounds. It’s not just the lost time. It’s that a long agent run accumulates expensive state, decisions made, dead ends ruled out, intermediate results that cost real work to produce. Lose that and the restart doesn’t resume; it begins again, from zero, repeating every dead end the first run already walked. The second attempt is as slow and as fragile as the first, so it’s just as likely to die before finishing. You can loop on that forever and never converge.
The fix is to invert the assumption. Assume the crash will happen, at the worst possible moment, and design so that it costs almost nothing when it does. Concretely: every time the agent finishes a meaningful unit of work, it writes a checkpoint, a durable snapshot of where it is and what it’s produced, to the same outside place the heartbeat lives. Not at the end. Along the way. The checkpoint and the heartbeat are the same habit at two grains: the heartbeat says I’m alive and here; the checkpoint says and here is everything you’d need to continue if I’m not.
Now the reboot story is completely different. The agent dies two hours in. A watcher notices the heartbeat went stale, declares it dead, and starts a fresh agent, but the fresh one doesn’t start from zero. It reads the last checkpoint, sees the plan and the steps already completed and the partial result, and picks up from the last durable point. The crash cost one unit of work, the bit that was in flight when the lights went out. Not two hours. One step.
This is the whole game. The measure of a long-running agent isn’t how rarely it crashes, it will crash, the machine will reboot, the network will drop, the process will get killed by something it can’t control. The measure is how cheap its crashes are. And a crash is cheap exactly to the degree that the agent wrote things down before it happened.
Why “the heartbeat is the contract”
Put the two halves together and the phrase stops being a slogan.
A heartbeat without a checkpoint tells you precisely when your work died and gives you nothing to recover. A checkpoint without a heartbeat saves your work but never tells you it’s time to recover it, you sit waiting on a dead agent because nothing said it was dead. You need both, beating on the same rhythm, written to the same durable place. Together they form a promise the agent makes continuously: I am alive, I am here, and if I’m not, this is where you continue. That promise is a contract. The heartbeat is the agent signing it, once a minute, in ink that survives the agent.
And like any contract, it’s only worth anything if it’s enforced by someone other than the party who signed it. The agent can’t be trusted to declare its own death, a dead agent declares nothing. So the heartbeat is read by something outside the agent: a watcher whose entire job is to subtract, notice the stale pulse, and act. The contract is between the agent and the watcher. The agent promises to leave a trail; the watcher promises to read it and resume from it. Neither half works alone, which is exactly why a single self-contained agent, however capable, can’t be the thing you trust with a four-hour job. Trust lives in the structure, not in the worker.
This is also why we don’t bolt the heartbeat on at the end as a monitoring afterthought. It’s load-bearing. The decision to make the work resumable changes how the work is shaped, into units small enough to checkpoint between, with state explicit enough to write down. An agent designed to be recoverable is, almost as a side effect, an agent whose progress is legible the whole way through. You can watch it advance, step by step, because it was built to leave the trail. The reliability and the visibility turn out to be the same property, approached from two sides.
The turn: what it means to trust a worker you can’t see
Strip away the timestamps and the snapshots and what’s left is an old question, the one every manager of remote, asynchronous work eventually faces: how do you trust someone doing important work where you can’t watch them do it?
The bad answer is to demand they be online, to ping them constantly, to mistake their silence for slacking and their presence for progress. That answer scales terribly with people and not at all with software. The good answer is the one good distributed teams already learned. You don’t trust presence; you trust the trail. You don’t ask “are you working right now?”; you arrange things so the work itself reports, so that progress is written down as it happens, in a place anyone can read, and so that if a person disappears, the work doesn’t disappear with them. The teammate who leaves a clean trail is the one you can hand the four-hour job to and walk away.
That’s the standard we’re holding agents to, because it’s the only one that lets you actually leave them alone. An agent you have to babysit isn’t saving you anything, you’ve just traded doing the work for watching the work. The whole promise of handing a long job to software is that you get to stop thinking about it. You only get to stop thinking about it if the work can survive everything you’re not there to catch: the reboot, the crash, the silent two hours. It survives because it left a trail. The heartbeat is how it keeps leaving one.
We build Apollo so that every agent running an unattended job beats while it works and checkpoints as it goes, so the question stops being “is it still alive” and becomes “where did it leave off.” That’s the difference between a tool you have to supervise and a coworker you can trust with the night shift. If you’ve ever lost real work to a process that died in silence, you already know which one you’d rather hire.
Apollo runs your company's repetitive ops so your team doesn't.
Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.
Join the waitlistThe hidden tax of parallel agents is a migration diamond
Six agents writing to one schema conflict in the database, not the code, and CI dies at "multiple heads."
EngineeringAn orchestrator that can't survive its own crash isn't one
A crash that erases the orchestrator's reasoning loses the one thing you can't rebuild.
EngineeringPut a deterministic gate in front of your smartest reviewer
The cheapest defect-catch is a dumb script that checks two merged branches still boot before any judgment.