Engineering

Compaction is a decision you make before the window fills, not after

Prepare the summary in the background so the boundary is crossed without a stall.

ASR

Apollo Space Research

Apollo Space

· 10 min read

An agent is twelve tool calls into a long job, reading files, running checks, stitching a plan together, and the context window is nearly full. The next token won’t fit. So the runtime stops everything, hands the entire transcript back to the model, and waits for it to write a summary of what just happened. Forty seconds of nothing. The user, mid-thought, watches a spinner. The agent, mid-task, has gone dark to think about thinking.

That pause has a name in our codebase, and we spent a long time trying to make it shorter. We were solving the wrong problem.

The pause isn’t slow because summarizing is slow. It’s slow because we started summarizing at the worst possible moment, the exact instant the agent had no room left to do anything else. The fix isn’t a faster summary. It’s a summary that was already written.

Prepare the summary in the background so the boundary is crossed without a stall.

This post is about why a long-running agent has to manage its own memory the way an operating system manages yours, and why the trick that makes it feel instant is the same trick your laptop has used for decades.

The naive version: compact when you’re out of room

The obvious way to handle a full context window is to wait until it’s full.

You let the agent run. You watch the token count climb. When it crosses some threshold, say, the window is nearly used up and the next step won’t fit, you trigger compaction: you ask the model to read everything so far and produce a tight summary, you throw away the raw transcript, you keep the summary, and you resume with the freed-up room. It is correct. It loses nothing important if the summary is good. And it is the version almost every agent framework ships first, because it’s the version you reach for when you’re thinking about correctness and not yet thinking about the person waiting.

It works fine in a benchmark, where nobody is watching the clock between turns. Then you put it in front of a real user on a real task and the seam shows.

The problem is timing, not technique. Compaction is expensive, you’re sending a large transcript through the model and waiting for a full summary to come back. And you’ve arranged for that expensive thing to happen at the one moment the agent is busiest and the user is most engaged: deep in a task, momentum high, the boundary hit without warning. The agent doesn’t slow down. It freezes. Everything it was doing stops while it goes away to compress its own history, and the person on the other end has no idea why the thing that was flying a second ago is now a spinner.

A naive compactor isn’t wrong about what to do. It’s wrong about when. It does the right work at the worst time.

Two ways to cross the context boundary. In the naive lane, the agent runs until the window is full, then stops dead to summarize while the user waits. In the prepared lane, a background summary is built ahead of the boundary, so when the limit arrives the agent swaps it in and keeps moving.

The idea we borrowed: prepare the page before you need it

We didn’t invent the fix. We stole it from the layer underneath everything.

Your operating system has the same problem, scaled to the whole machine. Physical memory is finite. Programs ask for more than fits. The naive answer would be to wait until memory is completely exhausted and then, in a panic, find something to evict and write it to disk while every program stalls. No operating system ships that way, because it would feel exactly like our spinner, the machine freezing at the worst moment to make room it should have made earlier.

Instead, the OS works ahead of the wall. It keeps a little memory free in reserve. A background process watches the pressure and, well before things are critical, quietly picks pages that aren’t being used and writes them out, so that when a program finally does need room, the room is already there. The eviction still happens. The cost is still paid. But it’s paid off the critical path, in the background, before the moment of need, so the program that asks for memory gets it instantly and never knows a thing.

That’s the whole move, and it transfers cleanly:

Don’t pay the cost when you hit the wall. Pay it earlier, in the background, so hitting the wall is free.

Compaction is the agent’s version of paging. The transcript is memory under pressure. The summary is the evicted page, written to a cheaper place. And the lesson from the layer below is that the summarizing was never the problem, doing it synchronously, at the boundary, with the user waiting, was. So we moved it.

Prepare the summary in the background so the boundary is crossed without a stall.

How the prepared summary actually works

The mechanism is less clever than it sounds, which is how you know it’s right.

While the agent works, a second job watches the same token count the naive version waits on, but instead of waiting for the wall, it sets a softer line well before it. Suppose the window holds room for a long conversation; the watcher draws its line at, say, two-thirds full, a threshold you’d tune to leave comfortable headroom. When the transcript crosses that softer line, nothing visible happens. In the background, off the path the user is watching, the runtime starts building the summary of the history so far. The agent keeps taking its turns. The user keeps getting responses. A summary is quietly assembled in reserve.

Then one of two things happens, and both are fine.

If the agent finishes the task before the window actually fills, the prepared summary is never needed. We spent a little compute we didn’t have to. That’s the reserve we keep free on purpose, cheap insurance, paid whether or not the fire happens.

If the agent does reach the boundary, there is no pause. The summary already exists. The runtime swaps the raw transcript for the summary it prepared moments ago, the window has room again, and the next turn goes out as if nothing happened. The boundary that used to be a forty-second stall is now a thing that happens between two tokens. The work of compaction was identical. The experience of it went from a freeze to nothing, because the expensive part was finished before anyone needed it.

There’s a subtlety worth naming, because it’s where naive background work goes wrong. The summary has to be built from a consistent snapshot of the history, you can’t summarize a transcript that’s being appended to underneath you, or you get a summary that’s already stale by the time it lands. So the prepared summary is built against a marked point in the conversation, and if the agent adds a few more turns before the boundary, those last turns are simply carried forward on top of the summary. You summarize the settled past in the background and keep the fresh present raw. The boundary, when it comes, is a clean swap with no recompute.

Compaction as a background loop. The agent runs and the token count climbs; crossing a soft threshold quietly kicks off a summary off the critical path; the agent keeps responding; and when the hard boundary arrives the prepared summary is swapped in and the loop continues without a stall.

Why “just make the summary faster” was a trap

The instinct everyone has first, including us, is to attack the duration. If the pause is forty seconds, make the summary shorter, use a smaller model to write it, cache part of it, do anything to drag the number down.

It’s a trap, and it’s worth saying why, because it’s the kind of trap that feels like progress.

Every second you shave off a synchronous compaction is a second the user still waits, you’ve made the freeze shorter, not gone. And you’ve bought that shorter freeze with a worse summary: a smaller model or a tighter budget compresses the history less faithfully, so the agent resumes with a blurrier memory of what it was doing, and the quality you lose downstream is harder to see than the seconds you saved up front. You’re trading the thing the user feels for the thing the agent needs, and calling it an optimization.

The background approach refuses the trade entirely. Because the summary is built off the critical path, its duration almost doesn’t matter, it can take its time, use the full model, produce the faithful compression the agent will actually rely on, and still cost the user zero, because the user was never waiting on it. The naive version forces a choice between a fast pause and a good summary. The prepared version dissolves the choice: a good summary that the user never waits for at all.

This is the difference between optimizing a number and moving it off the path. One makes the pain smaller. The other makes the pain happen where nobody is standing.

The turn: an agent that runs for hours has to manage its own forgetting

Ask what it actually takes for an agent to work on something for an hour instead of a minute, and you arrive somewhere quieter than the demos.

A short conversation never fills its window, so it never has to forget anything, so none of this matters. The whole problem only appears when the work is long, when an agent is genuinely living in a task across many turns, accumulating more history than any window can hold, and has to decide, continuously, what to keep sharp and what to compress into the gist. That decision is memory management, and a system that’s going to run real work for real durations cannot treat it as an afterthought it handles in a panic at the wall. It has to do what every long-running system before it learned to do: look ahead, make room before it’s out of room, and pay the cost where the cost can’t be felt. Prepare the summary in the background so the boundary is crossed without a stall.

The forgetting was always going to happen. A finite window guarantees it. The only question an agent gets to answer is when it forgets, in a stall the user watches, or in a quiet background pass they never see. We think the difference between those two answers is most of what separates a demo that’s impressive for five minutes from a system you’d trust to work all afternoon.

The window will always fill. What you decide before it does is whether the person on the other end ever has to know.


That’s the kind of thing we’re building at Apollo Space, agents that manage their own memory the way the OS under your fingers already manages yours, so the long jobs feel as smooth as the short ones. If you’ve ever watched something fast go suddenly, inexplicably still right when you needed it most, you already know which of those two moments we’re trying to delete.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist