Engineering

Memory is not the context window

The window is RAM, fast, small, wiped between runs. Memory is disk, durable, curated, written on purpose. Stuffing the window full is not remembering; it is forgetting more expensively.

ASR

Apollo Space Research

Apollo Space

· 11 min read

Two agents read the same thing on Monday. By Friday, one can tell you what it learned and why it matters. The other can’t tell you anything, because Friday is a different run, and the run that read it on Monday is gone. Both were given a “memory feature.” Only one of them actually remembers.

The difference isn’t model quality. It’s a confusion about what memory even is.

The most common mistake in agent design is treating the context window as memory. The window is RAM, fast, small, wiped between runs. Memory is disk, durable, curated, written on purpose. Stuffing the window full is not remembering; it is forgetting more expensively. This post is about why that one swapped definition breaks more agents than any bad prompt, and what you build instead.

The naive version: just put everything in the window

The obvious move, the one almost everyone makes first, is to treat the context window as the agent’s memory. It’s right there. It’s enormous now, hundreds of thousands of tokens, sometimes more. Why not just keep stuffing it?

So you do. Every turn, you take the whole conversation, every document the agent read, every tool result, every prior reply, and you paste it all back in. The agent has “memory” in the sense that, for the length of this one run, it can see everything that happened. For an afternoon, it’s magic. The agent recalls the thing you said an hour ago. It feels like it knows you.

Then the run ends. And the next run starts cold, knowing nothing, because the window was never storage, it was a workspace, and the workspace gets wiped between sessions. So you compensate by making the window do double duty: at the start of every run, you reload everything, hoping the agent can re-derive what it “knew” by re-reading the transcript of what it once saw.

This fails three ways at once, and they compound.

It fails on cost, because you pay to re-read the same history on every single turn, the same onboarding documents, the same preferences, the same history, billed again and again for the privilege of pretending the agent remembers.

It fails on focus, because a window crammed with everything is a window where nothing stands out. The model has to find the one relevant fact inside a haystack you rebuilt out of every fact you own. Recall gets worse, not better, as you add more, the signal drowns. The window fills up, and the thing you actually needed is somewhere in the middle, where models attend the least.

And it fails on truth, because eventually the haystack is bigger than the window. Something has to be dropped. Now the agent has partial history and no idea which part is missing, which is the exact setup for an agent that confidently fills the gap with something it invented. It doesn’t say “I don’t recall.” It says the wrong thing, fluently, because the missing piece used to be in the window and isn’t anymore.

The window was never memory. It was the desk. A desk piled to the ceiling is not a filing system. It’s a fire hazard.

The fix is a distinction every computer already makes

Here’s the thing: the operating system on the machine you’re reading this on solved exactly this problem fifty years ago. It did not solve it by making RAM infinite. It solved it by refusing to confuse two different things.

RAM is fast and small and volatile. It holds what a program is working on right now, and it’s gone the moment the program stops. Disk is slower and larger and durable. It holds what you decided to keep, and it survives the reboot. No one builds software by holding the entire disk in RAM at all times. You load what you need, you do the work, you write back what matters, you let the rest stay on disk where it’s cheap and safe.

The window is RAM, fast, small, wiped between runs. Memory is disk, durable, curated, written on purpose. Once you see agent memory through that lens, the whole design falls out of it.

The window’s job is to hold what this run is working on. The memory’s job is to hold what the agent decided was worth keeping, and to put the right small slice of it back into the window at the start of the next run. Not everything. The slice. The agent doesn’t reload its whole life every morning any more than your laptop reloads its entire disk into RAM at boot. It pages in what’s relevant to the task in front of it.

A two-lane contrast: on the left, the naive agent stuffs the whole transcript and every document back into the context window each run, which overflows and drops facts; on the right, durable memory on disk pages only the relevant slice into a clean window, and writes back what was learned when the run ends.

That swap, from “load everything” to “load the right slice, write back what’s worth keeping”, is the entire difference between an agent that performs memory and an agent that has it.

Writing to memory is a decision, not a side effect

So if memory is disk, the next question is the one everyone skips: what gets written, and who decides?

The naive answer is everything, log every turn, dump every transcript, append it all. This is the same mistake wearing a different hat. A disk where you write every byte you ever touched isn’t a memory; it’s a landfill. You can store it, but you can never find anything in it, and the act of searching it costs as much as the original work. An agent that “remembers everything” remembers nothing usefully, because retrieval over an undifferentiated pile is just the haystack problem moved one layer down.

Real memory is curated. Something has to read the run and decide: this is worth keeping, that was noise. The customer’s name and what they care about, keep. The exact phrasing of a tool error that’s already resolved, drop. The decision the team made and the reason behind it, keep, and keep the reason, because a fact without its reason rots the first time the world changes. The forty intermediate steps the agent took to reach that decision, drop; the conclusion is the memory, the scratch work was RAM.

This is exactly the move a good note-taker makes after a meeting. They don’t transcribe the hour. They write the three things that will matter next week, in a form their future self can act on. The transcript was the working memory of the meeting; the note is the durable memory of the company. Confusing the two is how you end up with a thousand transcripts nobody will ever read and zero notes anyone can use.

And the writer has to be deliberate, because two failure modes sit on either side of “keep the right things.”

Write too little and the agent is amnesiac, it re-learns the same fact every week, asks the same question it already had answered, makes the customer repeat themselves. Write too much and the agent is a hoarder, its memory swells until retrieval is slow and noisy and the relevant fact is buried under a hundred near-duplicates of itself. The skill isn’t storage. Storage is cheap. The skill is the judgment about what crosses from working memory into durable memory, and that judgment is a real part of the system you have to build on purpose, not a side effect you get for free by logging.

Say an agent handles a hundred conversations a day. The landfill version stores a hundred transcripts and can answer nothing. The curated version stores maybe a handful of new durable facts, this customer switched plans, that one flagged a bug, this preference changed, and can answer everything that matters tomorrow. Same hundred conversations. The difference is entirely in what was thrown away.

A funnel showing curation: a hundred raw conversation turns flow into a write-decision filter that keeps durable facts and decisions-with-reasons while dropping resolved errors and scratch steps, producing a small clean memory store that the next run reads.

Reading is retrieval, not reload

The last piece is the one the RAM/disk frame makes obvious in hindsight. If memory lives on disk, then starting a run is not “load the disk.” It’s “fetch the relevant slice.”

The naive agent, at the start of every run, pastes its entire memory back into the window, and now we’re right back where we started, with an overflowing desk, except the overflow is durable this time. Making memory persistent doesn’t help if you reload all of it every turn. You’ve just moved the landfill into the prompt.

The agent that works does what the operating system does: it queries. Given the task at hand, this customer, this question, this moment, it pulls the small set of facts that bear on this, and leaves the rest on disk where it belongs. The window stays clean. It holds the task plus the handful of memories the task actually needs, and nothing else. The model isn’t hunting through everything the agent has ever known; it’s looking at the few things that matter now.

This is why the right architecture makes the agent faster and cheaper as it learns more, not slower and more expensive. Under the naive model, every new fact the agent learns is a new tax on every future run, because every future run reloads it. Under the disk model, a new fact is just a new row on a disk that’s already huge and already cheap to ignore. The agent can know a million things and still run on a window that holds twenty, because it only ever pages in the twenty that count.

The bottleneck never disappears. It just moves, from “how big is the window” to “how good is the retrieval.” And that’s the right place for it to live, because retrieval you can improve forever, while a window is always, eventually, full.

What this changes for a company

Step back from the machinery and here’s why this matters past the engineering.

A company’s memory is not a transcript of everything that was ever said. It’s the small set of things that have to survive a person leaving, a quarter ending, a tool being swapped out. The reason a customer churned. The decision behind a pricing change and what it was reacting to. The thing this account always cares about that nobody wrote down because the person who knew it was always in the room, until the day they weren’t.

When you build agents that confuse the window with memory, you build a company that confuses being busy with learning. Everything is processed, nothing is kept. The agent handled the conversation and forgot it the instant the run ended, the same way an overworked team handles a hundred tickets and retains nothing, solving the same problem for the fifth time because the first four solutions lived in a window that got wiped.

The promise of an agent that remembers isn’t that it can recite the transcript. It’s that the company stops paying the re-learning tax, the meeting that re-decides what was already decided, the question already answered, the lesson the org keeps re-buying because it never wrote it to disk. Memory done right is the difference between a company that compounds what it knows and one that resets to zero every Monday morning.

The window is where the work happens. The disk is where the company lives. The mistake of pouring one into the other isn’t just expensive, it’s the quiet reason an agent can read everything and still know nothing.


That distinction, workspace versus storage, the thing you’re holding versus the thing you keep, is one of the foundations we build on at Apollo: a company brain that decides what’s worth remembering and pages the right slice back in exactly when it’s needed. If you’ve ever watched a smart team solve the same problem twice because the first solution lived nowhere, you already understand why memory is not the context window, and why getting that one distinction right is most of the job.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist