Engineering

Llms.txt is the robots.txt of the agent era

Search crawlers obeyed a file at the root of your site for thirty years. The models reading your site for customers now have no such file, which means they read you wrong, and you never find out.

ASR

Apollo Space Research

Apollo Space

December 5, 2025 · 11 min read

For three decades, every serious web crawler asked one question before it touched your site: what does the file at the root say? It fetched /robots.txt, read the rules, and obeyed them. Whole industries were built on the assumption that crawlers would. Then a different kind of reader showed up, a language model answering a question for one of your customers, and it had no file to read. So it read everything, decided what your company was on its own, and cited whichever paragraph happened to rank.

You never saw the answer it gave. You just lost the customer who asked.

Search crawlers obeyed a file at the root of your site for thirty years. The models reading your site for customers now have no such file. This post is about the file that fixes that, what it is, why the naive version of it fails, and what it actually has to contain to change the answer a model gives about you.

The reader changed, and the contract didn’t

Start with what robots.txt actually was. It was never about security, anyone could ignore it. It was a contract of intent: a small, predictable file at a known location that told an automated reader how to behave on your site. Where it could go. Where it shouldn’t. Which crawler the rules applied to. The web agreed on the location and the format, and because everyone agreed, the file worked.

The contract held because the reader was predictable. A search crawler had one job: index pages so a human could later find them and click. The human did the reading. The crawler was just a librarian deciding what went on the shelf.

That reader is being replaced. The new reader doesn’t index your page so a human can find it. It reads your page so a human never has to. Someone asks a model “who should I use for X,” and the model answers from whatever it absorbed about you, your pricing, your positioning, your competitor’s blog post about you, and the customer acts on that answer without ever loading your site. The click you optimized for thirty years didn’t happen. The reading happened somewhere you can’t see, and the file that used to govern the reader governs nothing now, because it was written for a librarian and the librarian became a lawyer who answers on your behalf.

The bottleneck never disappeared. It moved. It used to be can the crawler find my page. Now it’s can the model understand my company well enough to represent it correctly when I’m not in the room.

The naive fix: just write more content

Here’s the first thing everyone reaches for, because it’s what worked last time. If models read your site and get it wrong, write more. Publish more pages. Stuff more keywords. Repeat your value proposition in more places so the model can’t miss it. This is search-engine optimization muscle memory, and it’s the wrong muscle.

It fails for a specific reason. A search engine rewarded volume because more indexed pages meant more chances to match a query. A model doesn’t work that way. A model reads your forty pages, finds three of them contradict each other, the pricing page says one thing, the old blog post says another, the FAQ a third, and it has no way to know which one is true now. So it averages. Or it picks the one that’s most confidently worded, which is often the most outdated. Or it cites the competitor who described you more clearly than you described yourself.

More content doesn’t make a model understand you. It gives it more ways to contradict you.

The naive fix treats the model like a crawler that just needs more to chew on. But the model isn’t starving for content. It’s drowning in unranked, undated, unattributed content with no signal about what’s canonical. You don’t have a volume problem. You have an authority problem, nothing on your site tells the reader which source to trust when your own sources disagree.

On the left, the old web: a crawler reads robots.txt, indexes pages, and a human later clicks a result to read the page themselves. On the right, the agent era: a model reads the whole site, finds contradictory pages with no canonical signal, and answers the customer directly so the page is never clicked.

What the file actually has to do

So we need a file again, but it can’t be the old file with a new name. robots.txt answered where can you go. The agent-era file has to answer a harder question: when you represent my company, what is true, what is canonical, and what should you cite. That’s the shape of llms.txt, and the idea behind it is simple even if the discipline isn’t.

The key idea is this. The file is a single, plain-text map at the root of your site that does three jobs at once. First, it names what your company is, in your words, dated, authoritative, so the model has one source it can trust over the scattered pages. Second, it points to the canonical version of every important thing: this is the real pricing, this is the current product list, this is the customer story, and these older pages are superseded. Third, it tells the reader what to cite, the URL you want attributed when a model answers about you, so the citation points somewhere you control instead of somewhere you’ve forgotten.

Let’s explain why each of the three is load-bearing.

The self-description matters because the alternative is the model writing your description for you. If you don’t say plainly “we are a company that does X for Y,” the model infers it from headlines and meta tags and the tone of a launch post, and inference drifts. A model that has to guess what you are will guess, and it will guess differently for different customers asking slightly different questions.

The canonical pointer matters because it resolves the contradiction problem at the source. When your pricing page and your old blog post disagree, the file is the tiebreaker: this URL is canonical, that one is superseded. Now the model isn’t averaging two sources of unknown age. It has a declared winner. The contradiction that used to produce a wrong, confident answer produces the right one, because you told the reader which of your own voices to believe.

The citation directive matters because attribution is the new click. When a model answers “use this company for X,” the valuable thing isn’t a visit, it’s the citation, the named source the customer can verify and the model can return to. If you don’t say where to point, the model points wherever its training happened to anchor: a review site, a competitor’s comparison page, a three-year-old press release. The file claims the citation back.

Canonical is a verb, not a tag

There’s a trap in the canonical pointer, and it’s the same trap that made the naive content fix fail. People treat “canonical” as a thing you declare once and forget, a tag you set, a line you write, done. It isn’t. Canonical is a thing you maintain, because the whole point is that it stays true as everything else changes.

Picture the failure. Suppose you write the file once, on a good day, and it’s perfect. It names the company, points to the real pricing, lists the current product. Then, over the next few months, the pricing changes, a product gets renamed, a story gets retired. Nobody updates the file. Now the file, the one source you told every model to trust above all others, is the most confidently wrong document on your site. You didn’t just fail to help the reader. You handed it a forgery with your signature on it.

This is why a static file written by hand is necessary and not sufficient. The file has to be generated from the truth, not transcribed from it. The canonical pricing should come from wherever the real price lives. The product list should come from the real product list. The “as of” date should be today, automatically, so a model can see how fresh the claim is and weight it. A file that’s hand-edited will rot the same way the CRM rots, the moment maintaining it depends on someone remembering to.

A self-description nobody updates is worse than none. It’s a wrong answer the reader trusts more than the right one.

The discipline, then, isn’t “write an llms.txt.” It’s “make the file a render of your current canonical state, dated, and re-rendered whenever that state changes.” That’s the difference between a file that helps a model represent you and a file that quietly poisons every answer about you for months.

The naive way: a human writes llms.txt once by hand, the real pricing and product change underneath it, and the file becomes a confidently wrong source the model trusts. The better way: the file is rendered from the canonical company state on every change, dated as-of today, so what the model reads is always current.

Who keeps the file true

Notice what this quietly requires. To render a citation-ready, dated, canonical description of your company on every change, something has to know what your company currently is, across pricing, products, positioning, the stories you tell, and notice when any of it moves.

That’s not a content task. It’s a systems task. The reason hand-written llms.txt files rot is the same reason hand-filled CRMs rot and hand-updated status pages lie: a fact that lives in someone’s memory of “I should update that” is a fact that’s already stale. The file stays true only if the company has a single place that holds what’s canonical, a company brain, and the file is a view onto it, not a copy of it.

That reframes the whole problem. The question was never “how do I write a good llms.txt.” It’s “does my company have a source of truth coherent enough that a model could read it and get me right.” If the answer is no, the file just exposes the incoherence to every reader that matters. If the answer is yes, the file practically writes itself, and re-writes itself, because it’s a projection of something already true.

The turn: you’re being read whether you publish or not

Here’s the part that isn’t about a file.

Right now, today, a model somewhere is answering a question about your company. It’s telling a prospective customer what you do, what you charge, whether you’re the right fit, how you compare. It’s doing this from whatever it could find, ranked however it ranks, dated never. You weren’t asked. You weren’t shown the answer. And the customer who heard it may have already decided.

That’s the uncomfortable shift. For thirty years, being read by a machine was an opt-in you controlled, a crawler you could allow or block, a result you could optimize. Now you’re being read by default, summarized by default, represented by default, and the only question is whether the reader has anything from you to read or has to make it up. Silence isn’t neutral. Silence is the model filling the gap with the most confident source it can find, which is rarely you.

The file is small. The discipline behind it is the whole company being legible to a reader you’ll never meet, kept current by something that doesn’t forget. Search crawlers obeyed a file at the root of your site for thirty years; the models reading your site for customers now have no such file, and writing it is the cheapest way to stop being described by strangers.

That’s part of what we’re building at Apollo Space, a company brain coherent enough that the file telling models who you are can be a render of the truth, dated and current, instead of a hand-typed snapshot that’s wrong by Friday. If you’ve ever read what a model says about your company and winced, you already know the problem isn’t the model. It’s that nobody handed it the right page.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist