Engineering

Most of your company is trapped in pixels

The work that matters lives in screenshots, scanned PDFs, and charts no API exposes, and reading those pixels is a different job from clicking them, harder and far more useful.

ASR

Apollo Space Research

Apollo Space

· 11 min read

A vendor sends an invoice as a scanned PDF. The total you owe is right there, in 40-point type, and your software cannot read a single digit of it. To the machine it isn’t a number. It’s a rectangle of grey. The amount that decides whether you pay this week or next is sitting in plain sight, and every integration you own walks straight past it.

That gap, between what a human sees instantly and what software can address, is where most of the real work in a company hides.

Here’s the line this whole post orbits: the work you need is locked in the pixels, and no API will hand it to you. The naive fix is to wait for an integration that never comes. The better fix is an agent that does what you do, it looks at the screen and reads.

Two different jobs people keep confusing

There’s a version of “your agent can use the screen” that’s been getting a lot of attention, and it’s the wrong one to start with. That version is about clicking. Move the cursor, fill the form, press the button, drive the software the way a person’s hands would. It’s a real capability and it matters. But it is not the bottleneck.

The bottleneck is the other job: reading. Looking at a chart and knowing the line went down. Looking at a scanned contract and pulling the date out of clause nineteen. Looking at a dashboard screenshot pasted into a chat and answering the question someone asked about it.

Clicking is about acting on a surface. Reading is about extracting meaning from one. And reading is where the value is, because the meaning is the part nobody bothered to expose.

Think about how much of your company already lives this way. The invoice that arrives as an image. The competitor’s pricing page that’s a screenshot in a Slack thread. The quarterly chart in a slide deck that exists nowhere as a row of numbers. The handwritten note photographed on a phone. The legacy report your bank emails as a locked PDF. Each of these is a fact your business needs, sitting behind a wall of pixels, with no endpoint to call.

The work you need is locked in the pixels, and no API will hand it to you. So the first thing to build isn’t a better clicker. It’s a better reader.

The naive way: wait for the integration

The instinct, when a system can’t read something, is to go build the pipe. Find the vendor’s API. Negotiate the export. Write the connector that turns the document into structured data at the source, so the agent never has to look at a picture at all.

It’s a reasonable instinct, and for a small set of systems it’s the right one. If a clean API exists, use it, pixels are the fallback, not the goal.

But the instinct fails the moment you count how many systems don’t have that pipe. The invoice from a supplier whose accounting software is from 2009. The chart your own analytics tool renders but won’t let you export underneath. The contract a counterparty sends as a flattened scan precisely so you can’t easily extract from it. The internal dashboard that has no export button because nobody ever asked for one. There is no integration to wait for, because the data was never offered as data. It was offered as a view.

And the pain compounds. Every “we’ll integrate that later” is a fact your agents can’t see, which means a question they can’t answer, which means a human back in the loop doing the reading by hand, squinting at the PDF, retyping the total, eyeballing the chart, transcribing the photo. The integration backlog isn’t a list of nice-to-haves. It’s the exact shape of the work that fell back onto people.

An integration is a fact someone agreed to expose. Most facts, nobody agreed.

So waiting for the pipe doesn’t fail because it’s slow. It fails because for most of the screen, the pipe is never coming. The view is the interface. You either teach the machine to read the view, or you keep a human doing it forever.

Two lanes for getting a fact out of a document. The naive lane waits for an API that for most documents never arrives, so a human ends up reading the scan by hand. The Apollo lane treats the rendered view itself as the source, reads the pixels directly, and returns the same structured fact without a connector.

The Apollo way: treat the view as the source

So we flip the assumption. Instead of asking “what API exposes this fact,” we ask “what does a person see when they look at this, and can the agent see the same thing.” The rendered view, the screenshot, the page, the scan, the chart, becomes a first-class input, the same way a row from a database is.

The key idea is simple: a document the human can read is a document the agent should be able to read. It’s harder than it sounds, though, and it’s worth being precise about what makes it actually work.

Reading pixels well is not one capability. It’s three, stacked, and skipping any of them gives you a confident wrong answer.

First, see it faithfully. The agent has to take the image as it is, the skew of a phone photo, the compression of a screenshot, the two-column layout of a report, and resolve what’s actually on it. Not a guess from the file name. Not the first half before it got lazy. The whole surface, including the small print, because the small print is usually the part that matters.

Second, find the fact, not just the words. Reading the words off an invoice is the easy half. The hard half is knowing that this number is the total and that number is the tax and the third one is an account code you don’t care about. A chart isn’t a list of words at all; it’s a shape, and the fact is “the trend reversed in the third quarter,” which appears nowhere as text. Extraction is interpretation, and interpretation is where a naive reader fabricates.

Third, refuse to guess. This is the one everyone skips, and it’s the one that makes the whole thing safe to use. When the scan is too blurry to be sure whether the digit is a 3 or an 8, the only acceptable answer is “I can’t read that one, here’s the crop, you confirm.” A reader that always returns a number is worse than useless, because it returns a plausible number, and a plausible wrong number in an invoice is how you pay the wrong amount with full confidence. The brake matters more than the read.

Stack those three and you get something that behaves less like an OCR library and more like a careful assistant: it looks, it pulls the fact you needed, and it tells you when it isn’t sure instead of bluffing.

Reading is not clicking, and conflating them is a trap

It’s worth being sharp about why these two jobs have to stay separate, because the temptation is to mash them into one “computer-use” capability and call it done.

The naive merge says: the agent has eyes and hands, so let it look at the screen and also operate the screen, all in one loop. It demos beautifully. Then it meets reality, and reality is that acting on a surface you’ve half-misread is how you do real damage. Misread a chart and your summary is wrong, embarrassing, recoverable. Misread a chart and then click submit on a decision based on it, that’s a wrong action taken with conviction, and the surface won’t undo it for you.

Reading is observation. Worst case, it’s wrong and you catch it. Clicking is mutation. Worst case, it’s wrong and it already happened.

So we keep them as two ladders, not one. Reading comes first and stands alone: extract the fact, return it, let a human or a downstream check use it. Clicking is a separate, more guarded capability that only earns its place once the reading underneath it is trustworthy, and even then, the act of pressing a button that spends money or sends a message routes through a confirmation, never straight off a glance at the pixels.

The point isn’t that clicking is bad. It’s that reading is the foundation and clicking is the floor built on top, and a floor on no foundation falls through. Get an agent reading the screen reliably first. Earn the right to let it touch the screen second.

The reading ladder versus the clicking ladder. Reading is three safe steps, see the surface faithfully, find the fact inside it, and flag anything it cannot confirm. Clicking is mutation built on top, gated by a human confirmation, and it only stands if the reading beneath it is trustworthy.

Where this shows up the day you turn it on

This stops being abstract the first time someone drops a screenshot into a chat and asks a question about it.

Suppose an operator pastes a picture of last month’s revenue chart and types “why did this dip.” A reactive tool answers about the text in the message and ignores the image entirely. A reader looks at the chart, sees the dip is in the third week, cross-references the period against what else the company knows happened that week, and answers the actual question. The fact lived in the picture. Nobody had to retype it into a form first.

Or take the invoice. Say a stack of supplier PDFs lands, half of them clean exports and half of them scans of scans. The clean ones, the agent reads in a blink. The blurry ones, it reads what it can and hands back the three line items it couldn’t confirm with the crops attached, so a human spends thirty seconds on the genuinely ambiguous digits instead of thirty minutes retyping the whole batch. The human’s attention goes only to the part that needed a human.

Or the competitor’s pricing page someone screenshots into a thread. No API, no export, deliberately so. The reader treats the screenshot as the source, pulls the tiers and the numbers, and the fact enters the company brain as data, searchable, comparable, alive, instead of dying as an image nobody can query.

In each case the shape is identical: a fact the business needed was trapped in a view, and the only thing standing between the company and that fact was whether the software could read pixels the way a person does. The work you need is locked in the pixels, and no API will hand it to you, but the view was never actually locked. It was just waiting for a reader.

The turn: the squint tax nobody puts on a budget

Here’s the part that isn’t about models or pipelines.

In every company right now, somebody is doing this reading by hand. Someone is squinting at the scanned invoice and typing the total into a field. Someone is eyeballing a chart in a deck and writing “down about ten percent, I think” in an email. Someone is photographing a whiteboard after a meeting and then, later, transcribing it because the photo isn’t searchable. None of that work has a name. It doesn’t show up on a roadmap. It’s just the friction of a business whose most important facts arrive as pictures.

That friction has a cost, and it’s a cruel one, because it lands hardest on your most capable people, the ones whose judgment you’re paying for, spending it instead on transcription. The squint tax is invisible precisely because everyone assumes it’s just part of the job. It was never part of the job. It was the absence of a reader.

The promise here isn’t a smarter chatbot. It’s that the facts trapped in the pixels, the invoice total, the chart’s trend, the clause’s date, the photographed note, get read once, by the machine, faithfully, with a flag on anything it can’t be sure of. So the human stops being the eyes of the company and gets to go back to being the judgment.


That’s part of what we’re building at Apollo Space, not just an agent with hands that can click your software, but one with eyes that can read it, the screenshots and scans and charts included, and the honesty to say I can’t make out this one, you look. The most valuable thing in your company isn’t behind an API. It’s in plain sight, in pixels, waiting for something that can finally read it.

Apollo runs your company's repetitive ops so your team doesn't.

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist