Engineering

Your agent should be able to see the screen

The last mile of 'do everything' is the software with no API. An agent that can see the screen and click reaches the work locked behind UIs that integrations will never touch.

ASR

Apollo Space Research

Apollo Space

· 12 min read

There is a state portal a company files through every quarter. It has no API. It has no export. It has a login, a dropdown that doesn’t open until the one above it is set, a file picker, and a submit button that’s grey until three other fields are green. A person does it in twenty minutes, swears the whole time, and forgets how until next quarter.

Every agent platform on the market can read that company’s email, update its CRM, and draft its contracts. Not one of them can file that form, because there’s nothing to call. The work isn’t hard. It’s just locked behind a screen no integration was ever built for.

That portal is not an edge case. It’s the shape of a third of the work in most companies.

The thesis: an agent that can’t see the screen can’t do the work that has no API

Here’s the line this whole post orbits. The last mile of “do everything” is the software with no API, and an agent reaches it only when it can see the screen and click. Everything else is integrations, and integrations stop exactly where the connector ends.

The promise people sell when they say “AI will do the work” quietly assumes the work has a door the machine can open, a REST endpoint, a webhook, a Zapier tile. For a lot of modern SaaS, it does. But the moment the work lives in a county records site, a decade-old ERP, a supplier’s bespoke vendor portal, a desktop app from 2014, the door isn’t locked. There’s no door. There’s a window, the screen, and until the agent can look through it and reach in, that work stays manual forever.

The mechanism that fixes this isn’t a smarter model. It’s a sense the model didn’t have. Here’s why the obvious approach runs out of road, and what replaces it.

The naive version: wire up an API for everything

The clean way to make an agent act on the world is to give it APIs. You connect the email, the calendar, the CRM, the docs, the project tracker. Each connection is a contract: here are the things you can read, here are the things you can do, here’s the shape of the data. The agent calls a function, the system does the thing, a structured result comes back. It’s reliable, it’s auditable, it’s fast. When it works, nothing beats it.

So you connect the ten tools the company lives in, and for a glorious week it feels like you’ve automated everything.

Then someone asks the agent to renew the business license on the state portal. And there’s no connector. There was never going to be a connector. A government records site serving four counties does not ship an API, and no integrations team on earth is going to build and maintain one for it. The agent, the one that just drafted a flawless contract, sits there unable to click a button a temp could click on their first day.

This is the wall, and it’s worth being precise about why it’s a wall and not a gap you can close. An API has to be built for an agent by the owner of the software. That’s the dependency. The CRM exposes an API because the CRM vendor decided to. The state portal exposes nothing because nobody decided to, and you can’t make them decide. The integration model gives the agent reach exactly as wide as other people’s generosity. Most of the long tail of business software was never generous, and never will be.

Count the surfaces a real company actually touches in a week: the modern SaaS with clean APIs, yes, but also the legacy internal tool with a login and no docs, the supplier portal that’s different for every supplier, the desktop accounting app, the bank’s web interface, the regulator’s filing site. Imagine the integration approach covers the first bucket cleanly and a thin slice of the second. The rest, call it the part of the work that lives behind a UI and nothing else, is invisible to it. Not slow. Not flaky. Invisible.

The bottleneck didn’t disappear when you added connectors. It moved to everything the connectors don’t reach.

Two ways an agent reaches the world. On the left, the API lane: clean connectors to email, CRM, and modern SaaS, fast and reliable, but it stops dead at any tool the vendor never exposed. On the right, that same dead end is a state portal, a legacy ERP, and a supplier portal with no API, no export, only a screen, the third of the work the integration model cannot see.

Why “just build the integration” loses the long tail

There’s a tempting answer to the wall: fine, we’ll build the integrations the vendors didn’t. Scrape the portal, reverse-engineer the form, write a custom adapter for each holdout. Teams do this. It works for the first few. Then the economics catch up.

Each custom adapter is a small piece of software that targets one site’s exact HTML, one form’s exact field order, one login flow’s exact quirks. The site redesigns, and a redesign nobody warned you about is the norm for software with no API, because there’s no API contract to keep stable, and your adapter breaks silently. You’re now maintaining a fleet of brittle scrapers, one per holdout, each a standing liability that fails the next time someone moves a button. The long tail is long precisely because each tool is used by too few companies to justify anyone maintaining a clean interface to it. Building one bespoke adapter per tool just relocates that same un-economical maintenance onto you.

The naive integration loses the long tail because the long tail is defined by being not worth integrating, one tool at a time.

So the question changes. You stop asking “how do we build an interface to each of these tools?” and start asking “what interface do all of these tools already share?” And there is one. Every one of them, the portal, the ERP, the desktop app, the bank site, already exposes the exact same interface, the one they built for the humans who use them every day. The screen. The pointer. The keyboard.

That’s the interface the agent should target. Not a thousand custom APIs. The one universal API that every piece of software on earth already ships: its own UI.

Our version: give the agent eyes

The idea is simple to state and most of the work is in taking it seriously. An agent that can see the screen and click reaches the work that has no API, because it stops needing the software’s permission and starts using the software the way a person does.

Concretely, the agent gets a sense it didn’t have. It can take a screenshot, look at it, and understand what’s on it, this is a login form, that’s the dropdown, this grey button is disabled, that red text is a validation error. From understanding, it acts: move the pointer here, click, type into that field, press tab, wait for the page to settle, look again. See, decide, act, see again. It’s the same loop a person runs without noticing they’re running it.

This is the part the naive approach structurally could not do, so it’s worth being concrete about what it unlocks. The agent no longer asks “does this tool have an endpoint I’m allowed to call?” It asks “can I see this screen?”, and the answer is yes for every tool a human can use, because if a human can use it, it renders to a screen. The dependency on the vendor’s generosity is gone. The interface is the one the vendor already shipped to its own users and has every reason to keep stable, because their users would revolt if the login form moved every week.

Two properties make this work in practice rather than in a demo, and both come straight from the failures above.

The first is that vision degrades gracefully where adapters shatter. A scraper keyed to an exact HTML structure breaks completely when a field moves, because it was reading the page’s invisible skeleton. An agent that reads the rendered screen the way a person does sees that the “Submit” button moved three inches down and clicks it where it is now, the same way you would, without even noticing the redesign. It’s reading the meaning of the pixels, not the brittle markup underneath. The redesign that kills the adapter is a non-event to the eyes.

The second is that the agent operates under the same permissions as the person it works for. It doesn’t need a special integration credential the vendor never issued. It logs in the way the employee logs in, sees what the employee sees, can do what the employee is allowed to do, and nothing more. The boundary of its reach is the boundary of the account it’s using, which is exactly the boundary you already understand and already control.

So the loop, drawn plainly, is: the agent looks at the screen, decides the next action, takes it, and looks again to confirm what happened before deciding the next one. It doesn’t fire a blind sequence of clicks and hope. It checks its own work at every step, because the screen tells it whether the last action landed, the dropdown opened or it didn’t, the field turned green or threw an error. Acting and verifying are the same sense.

The see-decide-act-verify loop that lets an agent operate software with no API. The agent captures the screen, reads what is actually rendered, decides the next action, performs it as a human would, and then looks again to confirm the action landed before continuing, checking its own work at every step instead of firing blind clicks.

Eyes are a fallback, not a replacement

It would be a mistake to read this as “screens beat APIs.” They don’t. When a clean API exists, it wins every time, it’s faster, it returns structured data, it doesn’t have to read pixels, and it can’t misclick. An agent should always prefer the endpoint when there is one.

The point is the order of fallback. Reach for the API first. When there’s no API, reach for the export. When there’s no export, reach for the screen. Most agent platforms stop at step one and call the rest impossible. The whole argument of this post is that step three is where the locked work lives, and the screen is what unlocks it.

Put the two senses together and the agent’s reach becomes the union, not the intersection: everything with an API, plus everything a human can operate. That union is what “do everything” actually requires. Without the eyes, “do everything” was always quietly “do everything that came with a connector”, which is a much smaller, much more comfortable promise.

This is also where the honest caution lives, because pixels are a less precise interface than a function call. An agent acting on a real screen, with real submit buttons, needs a brake, a moment of human confirmation before it files the form, sends the wire, clicks the irreversible thing. Seeing the screen is what gives the agent reach. A human in the loop on the consequential click is what makes that reach safe to grant. The two ship together or neither should ship at all.

The turn: the work nobody automates is the work nobody wants

Step back from the mechanism, because the mechanism isn’t the point. The point is who’s been doing this work, and what it’s been costing them.

Right now, in most companies, the screens with no API get operated by hand, and it’s almost always a person whose time is worth far more than the task. The operator who logs into the state portal every quarter. The finance person who re-keys numbers into the desktop app the auditor insists on. The account manager who fills out a different supplier portal for every supplier, each one a maze of dropdowns. This is the work that never made it into anyone’s automation roadmap, precisely because it had no API to automate against, so it fell to whoever was sitting closest, forever.

That work feels like the cost of doing business. It’s actually the most automatable work in the building, hiding behind the one barrier that made it look impossible. The barrier was never that the task was hard. A temp learns it in a morning. The barrier was that software could only act through doors other people built, and nobody built a door to the county records site. Give the software eyes and the barrier is gone, the task was always simple, it was just unreachable.

The promise isn’t that an agent will do clever work humans can’t. It’s that an agent will finally do the dull, locked, no-API work humans shouldn’t, the quarterly filing, the portal slog, the re-keying, so the person who’s been stuck doing it gets their mornings back for the work that actually needed a human.


That’s what we’re building toward at Apollo Space: not an agent that’s only as capable as the connectors it was handed, but one that reaches the last mile, the software with no API, by seeing the screen and clicking, the same way the people there already do. If your company runs on even one portal that nobody ever built an integration for, you already know exactly which screen we mean.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist