Engineering

Retries are a story you tell yourself about reliability

A blind retry doesn't make an action reliable, it makes it happen twice. Real reliability is an idempotency key and a check that the effect landed, not re-running until no exception comes back.

ASR

Apollo Space Research

Apollo Space

· 12 min read

An agent is asked to send one invoice. It calls the tool, the network hiccups on the way back, and the agent never hears the word “done.” So it does the sensible-looking thing: it tries again. The second call succeeds. The agent reports success. The customer receives two invoices.

Nobody wrote a bug. Every line did exactly what it was told. The retry, the move we reach for to make a system more reliable, is the thing that just billed someone twice.

A retry doesn’t make an action reliable. It makes it happen again. Whether “again” is safe or catastrophic depends entirely on a property the retry knows nothing about.

This post is about that property, why agents make the problem sharper than any system before them, and the two-part discipline that turns a retry from a loaded gun into an actual reliability tool.

The retry that lies to you

Here’s the move everyone writes first, because it looks like obvious good engineering. Wrap the risky call in a loop. If it throws, sleep a moment and try again. Give up after a few attempts. Ship it.

attempt = 0
while attempt < 3:
    try:
        send_invoice(customer)
        break
    except:
        attempt += 1
        sleep(backoff)

For a long time this works, and that’s the trap, it works for exactly the operations where it didn’t matter. Reading a record. Fetching a price. Asking the weather. Run those twice and the worst you’ve done is waste a little time. The loop looks like a reliability pattern because the calls it first met were harmless to repeat.

Then someone wraps a write in the same loop, and the loop keeps its promise the only way it knows how: by doing the thing again.

The lie is hidden in the word “failed.” When send_invoice throws, the loop concludes the invoice was not sent. But an exception on your side doesn’t mean nothing happened on the other side. The request can land, the work can complete, the invoice can go out, and the acknowledgment can die on the way home. To the loop, a lost reply and a real failure look identical. So it retries a thing that already succeeded, and the customer pays for the difference.

A retry doesn’t make an action reliable. It makes it happen again. The naive loop never asked the only question that matters: is doing this twice the same as doing it once?

The word for “safe to repeat” is idempotent

Split every action your system can take into two piles.

In the first pile, doing it twice is identical to doing it once. Set a customer’s status to “active.” Store a file at a fixed path. Tell a record its email is x. Run any of these a hundred times and the end state is the same as running it once. These actions are idempotent, and for them the naive retry loop is genuinely fine. Repeat away.

In the second pile, every repetition is a new event in the world. Send an email. Charge a card. Issue an invoice. Post a message. Create a ticket. These don’t converge on a state, they accumulate. Two charges aren’t one charge confirmed twice; they’re two charges. For this pile, the retry loop is not a safety net. It’s a duplicate generator with exponential backoff.

Retries are free on the first pile and dangerous on the second. The whole problem is that the loop can’t tell which pile it’s in.

The naive loop treats both piles the same because it only ever looks at one signal: did an exception come back? That signal tells you about your side of the wire. It tells you nothing about whether the effect happened on the other side, and for the second pile, the effect is the entire point.

So the fix isn’t “retry less” or “retry more carefully.” Retrying carefully still sends two invoices. The fix is to make the dangerous pile behave like the first pile, to give a non-idempotent action a memory, so that the second attempt knows the first one already counted.

Two piles of actions. On the left, idempotent ones, set status active, store at a fixed path, where running twice lands the same state, so a blind retry is harmless. On the right, accumulating ones, send invoice, charge card, post message, where a blind retry produces two real events. The naive retry loop sits between them, unable to tell which side it is on, and treats both the same.

Part one: the idempotency key gives the action a memory

Here’s the elegant version, and the key idea is simple: make every dangerous action carry a name, and refuse to do the same named action twice.

Before the agent sends the invoice, it mints a stable identifier for this specific intent, not a fresh random number each attempt, but one value derived from the work itself: this customer, this billing period, this invoice. Call it the idempotency key. The key travels with every attempt of that same send.

Now the operation that does the sending checks the key first. Has an invoice already been issued under this exact key? If yes, it doesn’t send a second one, it returns the result of the first. If no, it sends, and records the key as it does so, atomically, in the same breath as the effect. The first attempt writes “this key is done.” Every retry afterward reads “this key is done” and returns the original outcome instead of producing a new one.

The retry still happens. The loop still fires three times into the dark. But now only the first attempt has any effect, because the other two arrive carrying a name the system has already seen. We didn’t stop retrying. We made retrying safe, we moved the dangerous pile into the harmless one, on purpose, with a key.

Two things make or break this, and both are easy to get subtly wrong.

The key has to be stable across attempts and unique across intents. If each retry generates a new key, the system sees three different requests and sends three invoices, you’ve rebuilt the original bug with extra steps. If two genuinely different invoices collide on the same key, the second one silently never sends, and now you’ve lost a real one. The key must be a fingerprint of the intent: same intent, same key, every time; different intent, different key, always.

And the record-the-key step has to be atomic with the effect. If you send the invoice and then, as a separate step, write down the key, and you crash in the gap, you’ve sent without remembering, and the next retry sends again. The “I did this” must commit together with the doing, or the whole guarantee leaks through the crack between them.

Part two: the receipt, verify the effect, don’t infer it

The key stops the same attempt from doubling. But there’s a second, quieter failure the key alone doesn’t touch, and it’s the one the naive loop gets exactly backwards.

Reconsider the original scene. The agent sends, the work completes on the far side, and the acknowledgment is lost in transit. The agent never heard “done.” What does it conclude? It concludes the send failed. With an idempotency key in place, a retry is now safe, but the agent still doesn’t know the invoice went out. It’s guessing from the absence of a confirmation, and absence is not information.

Run that backwards and it’s just as bad. The agent gets back a cheerful 200 OK and concludes success, but the call only enqueued the work, and the queue later dropped it. The agent heard “done” and the thing never happened. A confirmation is a claim, and a claim is not a result.

“No exception” is not the same as “it happened.” “An exception” is not the same as “it didn’t.”

The naive loop conflates the return value with reality. It treats a successful return as proof the effect occurred and a thrown error as proof it didn’t, and both inferences are wrong in exactly the cases that bite. The honest move is to stop inferring the effect from how the call returned, and start checking for it directly.

So after the action, the system asks the world a separate question. Not “did the call succeed?” but “does the invoice now exist?” It reads back the effect from the source of truth, the invoice is in the ledger, the message has an ID, the charge shows in the processor. If the effect is there, the action is done, no matter what the original call returned or whether it returned at all. If the effect isn’t there, the action is not done, no matter how green the response looked.

This is the part people skip, because it feels redundant, you just did the thing, why check that you did it? You check because the call that does the work and the reality of the work are two different facts, joined by a network that drops messages. The receipt is how you find out which world you’re in.

A sequence in two acts. First, the action carries a stable idempotency key, so a retried send finds the key already recorded and returns the first result instead of issuing a second invoice. Then a separate verification step reads the effect back from the source of truth, the invoice exists in the ledger, and only that read-back, not the call's return value, is allowed to mark the action done.

Put the two together and the loop finally tells the truth

A reliable action is the key and the receipt working as a pair, and you can read the whole flow in one breath.

Mint a stable key from the intent. Try the action carrying that key. If the call returns cleanly, don’t believe it yet, read the effect back; the receipt, not the response, decides. If the call errors or simply vanishes, retry safely, because the key guarantees the second attempt can’t double the first. Keep going until the read-back confirms the effect is real, or until you escalate a genuine, verified failure to a human. At no point did “no exception came back” get to stand in for “the work is done.”

Notice what each half is doing. The idempotency key makes it safe to try again, it removes the cost of being wrong about failure. The verification makes it safe to stop, it removes the cost of being wrong about success. Without the key, every retry is a gamble that the last attempt really failed. Without the receipt, every success is a guess that the call told the truth. You need both, because the two ways the naive loop lies are mirror images, and one guard catches each.

A retry doesn’t make an action reliable. The key and the receipt do, the retry just becomes safe to use once they’re in place.

Why agents make this urgent, not optional

Systems engineers have known about idempotency keys for years; payment processors were built on them. So why write this now? Because agents change the blast radius of getting it wrong.

A traditional integration retries a fixed set of operations a human wired up in advance. An engineer looked at each call, decided whether it was safe to repeat, and built the guard where it was needed. The set of dangerous actions was small, known, and reviewed.

An agent composes its actions at runtime. You don’t hand it three pre-vetted calls; you hand it tools and goals, and it decides, turn by turn, what to do and when to try again. Its retry instinct is the same naive loop, now applied to operations no engineer pre-classified. It will cheerfully re-run a send because the last one looked like it failed, and it has no built-in sense of which pile that send belongs to. The judgment that used to live in a careful engineer’s review now has to live in the system the agent runs on.

That’s the line we hold. A tool an agent can call to do something irreversible in the world is built around the key and the receipt before the agent is ever allowed to call it, so that “the agent retried” can never become “the customer was billed twice.” The agent gets to be eager. The platform underneath it is what makes eagerness safe.

The turn: reliability was never the absence of errors

Strip away the keys and the receipts and what’s left is a quieter idea about what “reliable” even means.

We tend to picture a reliable system as one that doesn’t throw errors. So we chase the green, retry until the exception stops, treat a clean return as a clean conscience, and call it robust. But the invoice sent twice threw no error. The charge that vanished returned 200 OK. The most expensive failures in real systems are the ones that look, from the inside, exactly like success. A system that only knows how to avoid exceptions is blind to precisely the failures that cost the most.

Real reliability isn’t the absence of errors. It’s the presence of truth, the system knowing, at every step, what actually happened in the world, and refusing to mistake a hopeful return value for a real effect. That’s a harder thing to build than a retry loop. It’s also the only thing that lets you hand an eager agent a tool that touches a real customer and sleep at night. The retry was never the reliability. The reliability is the moment the system stops guessing and goes to look.


This is one of the boring, load-bearing things we build into Apollo so the interesting things are safe, agents that can act in the real world without turning a lost network packet into a double charge. If you’ve ever shipped a retry loop and felt a small cold dread the first time it wrapped a write, you already know which pile we’re talking about.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist