Engineering

Our long-running agent started confidently making up its own history

When you compress an agent's memory to keep it alive, the thing it forgets first is what it actually did.

ASR

Apollo Space Research

Apollo Space

· 11 min read

Two hours into a long task, the agent told us, with total composure, that it had already sent the report. It hadn’t. It described the file it wrote, wrong name, wrong contents. It referred back to a decision “we agreed on earlier” that nobody had made. None of this read as a glitch. It read as a confident colleague recapping the morning. The unsettling part wasn’t that it was wrong. It was that it was sure, and the thing it was sure about was its own past.

We had hit the failure mode that ends every long-running agent that doesn’t plan for it.

When you compress an agent’s memory to keep it alive, the thing it forgets first is what it actually did.

This post is about why that happens to every team running an agent for longer than a single reply, and the one distinction that fixes it. The fix isn’t a bigger context window. It’s deciding which memories are allowed to be summarized and which ones can never be touched.

Why a long-running agent has to forget something

Start with the constraint everyone runs into, because the failure grows directly out of it.

An agent thinks inside a context window, a fixed budget of tokens that holds the conversation, the tool results, the instructions, everything it can currently “see.” That budget is finite. A short task never fills it. But run an agent across hours, dozens of tool calls, a back-and-forth with a person, and a few detours, and the transcript grows past what the window can hold. Something has to give.

The naive answer is the one every first version ships: when the history gets too long, summarize the old part and keep going. Take the first few thousand tokens of transcript, compress them into a tight paragraph, drop the originals, and free up room. It feels obviously correct. It’s how a human takes notes. It’s how you’d describe a meeting you sat through. And for a while, it works beautifully, the agent keeps the gist, loses the filler, and sails on.

Then it summarizes the wrong thing, and you get an agent that’s sure it sent a report it never sent.

The reason is subtle and worth staging carefully, because it’s the whole point.

The naive fix: summarize the old history, keep going

Picture what summarization actually does to a transcript. It is, by design, lossy. It keeps what looks important and discards the rest, and it has to decide “important” without knowing what you’ll ask next. That’s a fine bet for most content. A long explanation of a concept compresses cleanly, the idea survives, the wording doesn’t matter. A digression you never returned to can vanish and nobody misses it.

But not all of the transcript is explanation. Some of it is record.

The line “I’ll draft the report and send it for review” is a plan. The line “the report was sent to the review queue at 4:02, here is the receipt” is a fact about what happened in the world. To a summarizer trying to save space, those two sentences look almost identical, same words, same topic, a few tokens apart. So when it compresses, it cheerfully blends them. The plan and the fact melt into one smooth paragraph: worked on the report and handled the review. Did the report actually get sent? The summary no longer knows. It kept the shape of the activity and lost the truth of it.

Now the damage compounds. The agent reads its own summary as ground truth, it has no other record, the originals are gone, and acts on it. It “remembers” sending the report because the summary says it was handled. Ask it for the receipt and it doesn’t say “I’m not sure.” It generates a plausible one, because generating plausible text from a vague prompt is exactly what the model is built to do. The summary was vague. The model filled the gap. And it fills it with the same confidence it uses for everything else.

A summary of what you did is not a record of what you did.

That’s the trap. Lossy compression on a transcript doesn’t just lose detail, it quietly converts facts about actions into vibes about activity, and a model handed a vibe will manufacture the missing detail rather than admit the gap. The agent didn’t lie. It read a smudged note in its own handwriting and trusted it.

On the left, one memory bin holds everything, plans, explanations, and the hard record of what was done, and the summarizer melts the report-was-sent fact into a vague paragraph the agent later fills in with a fabricated receipt. On the right, the record of actions is split into its own ledger that summarization is never allowed to touch.

The distinction that fixes it: a ledger is not a transcript

The key idea is simple. Not all of an agent’s memory is the same kind of thing, and treating it as one undifferentiated stream is what lets the corruption in.

There are two kinds, and they have opposite needs.

One kind is reasoning, the agent’s chain of thought, its explanations, its exploration of an idea, the long discussion that got it to a decision. This kind is safe to compress. The conclusion is what matters; the path can be summarized to a sentence and nothing important is lost. If the agent decided to use approach B over approach A, you need that it chose B and why, not the four paragraphs of deliberation. Summarize away.

The other kind is record, the discrete, factual events of what the agent actually did in the world. A file was written, here. A message was sent, to this queue, at this time. A tool returned this exact result. A person approved this specific thing. These are not opinions or explanations. They are facts with consequences, and a fact you’ve smudged is worse than a fact you’ve forgotten, because a smudged fact still answers questions, wrongly.

So we stopped keeping one memory and started keeping two.

The reasoning lives in the working transcript, and that transcript gets summarized freely when it grows, that’s fine, that’s what it’s for. The record lives somewhere else entirely: an append-only ledger of actions, each one a structured entry the agent writes the instant it does something real. Wrote file X. Sent message Y to queue Z at time T. Tool returned result R. This ledger is never summarized. It never gets compressed to make room. When the working memory fills up and the old reasoning gets squeezed into a paragraph, the ledger sits beside it, untouched, holding the literal truth of every action the agent has taken.

Now ask the agent whether it sent the report. It doesn’t consult a vague paragraph and guess. It checks the ledger. Either there’s an entry that says the report was sent, with the time and the destination, or there isn’t. If there isn’t, the honest answer is “no, not yet,” and the agent can say it, because the absence of a record is itself a fact.

The naive version asks one memory to be both a thinking space and a system of record. Those jobs fight each other. A thinking space wants to be fluid and compressible; a system of record has to be rigid and exact. Forcing them into one store means every time you compress to protect the first, you corrupt the second.

Why “just don’t summarize” is not the answer

There’s a tempting shortcut here, and it’s worth killing it explicitly, because it’s the move most people reach for second.

If summarizing the history is what breaks the record, why summarize at all? Keep the full transcript forever. Buy a bigger window. Never compress anything. Problem solved.

It isn’t, for two reasons that show up fast in production. The first is mechanical: windows are finite no matter how large, and a genuinely long-running agent, one that works for a day, a week, indefinitely, will outgrow any fixed budget. “Just make it bigger” buys you hours and loses you the week. The second is worse, and it’s about attention rather than space. An agent reasoning over a giant undifferentiated transcript gets worse, not better. The signal it needs is buried in thousands of tokens of old deliberation it no longer cares about. Relevant facts get diluted. The model spends its attention re-reading exploration it already resolved, and the important record, the three actions that actually matter right now, drowns in the noise of everything it ever thought.

So you can’t keep everything, and you can’t summarize everything. The answer was never about how much to compress. It was about what to compress. Summarize the reasoning, because the reasoning’s value survives compression. Protect the record, because the record’s value is destroyed by it. The same operation that’s healthy for one kind of memory is poison for the other, which is why the only durable fix is to stop treating them as one kind.

A loop showing the two-memory cycle: the agent reasons in a working transcript that summarizes freely as it grows, writes every real action as an entry into a protected ledger that is never compressed, and answers questions about its own past by reading the ledger instead of guessing from a summary, so the loop never drifts from the truth.

What this changes about trusting a long-running agent

Once the split is in place, a property emerges that you can’t get any other way: the agent can be honest about its own past, even when its memory of the conversation has been compressed a dozen times over.

That honesty is the whole game for any agent that runs real operations. A system that does work over hours and days is constantly being asked, implicitly, did you do the thing?, and the cost of a wrong answer is not embarrassment, it’s a duplicated action, a skipped step, a report sent twice or never. An agent that guesses about its own history is an agent you have to double-check, and an agent you have to double-check is one you haven’t actually delegated to. The ledger is what lets you stop checking. When the agent says it sent the report, it’s reading a receipt, not recalling a feeling.

There’s a discipline here that long outlives any one model.

Memory you can compress and memory you can trust are not the same store, and the day you merge them is the day your agent starts making things up.

This is true for software, and it’s quietly true for people too, the reason serious operations keep a logbook the recap can’t overwrite is the same reason we gave the agent one.

The turn: the part you can’t summarize

The ledger and the transcript are new words for an old, human thing.

Think about the colleague you trust with something that matters over a long stretch, a launch, a migration, a thing that takes weeks. It is never the one with the most impressive recall. It’s the one who, when you ask “did the thing get done?”, doesn’t perform an answer. They check. They go look at the record, and they tell you what’s actually there, even when what’s there is “not yet.” The trust isn’t in their memory being perfect. It’s in their refusal to fill a gap with a confident guess, to keep the line between I think I did that and here is the receipt sharp, especially when smudging it would be easier and sound just as good.

That line is the thing we were really building. An agent that runs for a day will forget the texture of most of that day, and it should, that’s how it stays fast and clear. What it must never forget, and must never smudge, is the ledger of what it actually did to the world. A system that keeps those two kinds of memory in their right places isn’t impressive because its recall is large. It’s trustworthy because, asked about its own past, it would rather check and say “no” than guess and say “yes.”

When you compress an agent’s memory to keep it alive, the thing it forgets first is what it actually did, unless you decide, on purpose, that the record of what it did is the one thing compression is never allowed to touch.


That’s what we’re building at Apollo Space: agents you can hand a long, real job to and walk away from, because the one thing they will never do is invent a past to cover a gap. If you’ve ever caught yourself re-checking work you supposedly delegated, you already know why an agent that can honestly say “no, not yet” is worth more than one that always sounds sure.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist