How is a QA agent different from automated tests?

Automated tests verify that code behaves as expected based on predefined test cases. A QA agent actively explores the application, checks business rules against actual UI behavior, and identifies edge cases that no one wrote tests for.

How long does it take the QA agent to test a PR?

Typically 4-8 minutes for a standard PR, depending on the scope of changes. It runs in parallel with other CI checks, so it doesn't add wall-clock time to the pipeline.

Use Cases

The QA agent that catches what humans miss

Q: Does the QA agent replace human QA engineers?

It replaces the repetitive parts of QA, regression testing, edge case checking, cross-browser validation. Human QA engineers are still essential for exploratory testing, usability evaluation, and testing novel features where the expected behavior isn't well-defined.

Automated tests check if your code works. QA agents check if your product works. Here are the kinds of bugs Apollo Space's QA agent catches that routinely pass human review, CI pipelines, and unit tests.

ASR

Apollo Space Research

Apollo Space

August 23, 2025 · 13 min read

The Gap Between Tests Passing and Product Working

Every engineering team has lived through this moment: CI is green. Every test passes. Code review approved by two senior engineers. PR merged. Deployed to staging. Then someone opens the app on their phone and the navigation is broken.

Tests passed. The product didn’t work.

This is the gap that QA agents fill. Not the gap between “code compiles” and “code runs.” The gap between “code runs” and “the user experience is correct.”

Apollo Space’s QA agent operates on a simple premise: on every commit that touches frontend code, business logic, or API contracts, spin up the application and verify it against a set of business rules. Not unit tests. Not integration tests. Business rules, the things that matter to users.

A QA agent pointed at a real engineering workflow will routinely flag issues that passed human code review and every automated test, and a handful of those would have been serious production incidents. The five below are representative of the classes of bug it catches, each one the kind that slips through CI precisely because CI never checks for it. They’re illustrative examples rather than a specific incident log, but every pattern here is one a QA agent surfaces in practice.

The change: A refactor of the authentication middleware to add SAML-based SSO support. The setup: Clean, well-structured, well-documented, with new unit tests covering the SSO flow. Approved by human reviewers. Every existing test still green.

The pattern: A QA agent reviewing this kind of change will typically catch something the reviewers won’t — users authenticating with email/password (the original auth method) caught in an infinite redirect loop. The login page redirects to /auth/callback, which checks for an SSO token, doesn’t find one, and redirects back to /login. Loop.

Why humans miss it: Code review focuses on the new SSO paths, because that’s what the PR is about. The existing email/password flow isn’t modified directly — the redirect loop is caused by a change in middleware ordering that affects all auth paths, not just SSO. The existing auth tests still pass because they mock the middleware layer. They test “does the auth function return the right token?” but not “does the browser end up on the right page?”

What the QA agent does: Its business rule for authentication is straightforward: “After submitting valid credentials on the login page, the user should arrive at the dashboard within a few seconds.” It submits valid email/password credentials, observes that the browser is still on the login page (caught in the redirect loop), and flags the PR.

Business impact: Ship this, and every non-SSO user — for most products, the overwhelming majority of the active base — is locked out. The fix is usually a small change to middleware ordering, shipped in minutes. Authentication bugs are among the highest-impact failures precisely because they block all user activity: a single broken login path can take out everyone. This is the definition of a P0.

Bug 2: The Currency Formatting Edge Case

The change: Multi-currency support, adding EUR, GBP, and BRL display alongside the existing USD formatting. The setup: A solid implementation, well-tested for the common cases, with new unit tests covering currency formatting. Approved.

The pattern: The kind of thing a QA agent flags here is a rendering inconsistency no unit test looks for — prices displayed as “EUR 1.000,50” on the pricing page when the locale is set to pt-BR. That’s technically correct for Brazilian Portuguese formatting: the comma is the decimal separator, the period the thousands separator. But if the same page also shows “USD 1,000.50” nearby (American formatting), the two number formats side by side on one page are genuinely confusing — “1.000,50” and “1,000.50” read like contradictions.

Why humans miss it: The reviewer tests the change in one locale, usually en-US. The unit tests cover formatting functions with hardcoded locale strings. Nobody loads the actual rendered page with a browser locale set to pt-BR while viewing mixed currencies.

What the QA agent does: The agent has a business rule: “All currency amounts on a single page must use consistent formatting conventions.” It loads the pricing page across several locale configurations (en-US, pt-BR, de-DE, ja-JP) and compares the formatting patterns across currencies on the same page, flagging the locales where two currencies render with clashing separator conventions.

Business impact: No crash, no data loss — just user confusion in exactly the international markets the feature was built to serve. The kind of bug that generates no error reports but quietly drives churn. Confusing billing and pricing displays are a well-documented reason customers contact support and lose trust.

The change: A major navigation redesign — new sidebar, new hierarchy, new responsive breakpoints. The setup: Multiple rounds of review, a dozen screenshots across screen sizes attached, visual regression tests passing (they found no unexpected changes, because every change was expected — it was a redesign).

The pattern: A QA agent will catch what screenshots can’t — that in a narrow viewport band (say, 768px to 820px, covering iPad Mini and some Android tablets) the hamburger menu opens but the navigation items aren’t tappable. They’re visually present but sit under an invisible overlay div that intercepts touch events.

Why humans miss it: Engineers test the widths they always test — desktop (1440px), mobile (375px), standard tablet (1024px). A narrow band between those breakpoints goes unchecked. Visual regression confirms the nav looks correct, because it does look correct. The items are visible. They just aren’t interactive.

What the QA agent does: The agent’s business rule for navigation: “Every navigation link must be clickable and navigate to the correct page at all viewport widths from 320px to 1920px, tested in small increments.” Sweeping through the range, it finds that at 768px and 800px, clicking the “Dashboard” link doesn’t navigate to /dashboard, and flags the specific viewport range and the element blocking interaction.

Business impact: Tablet-sized viewports account for a meaningful share of real traffic — often close to a tenth of it — and they’re exactly the widths that fall between the breakpoints teams check by hand. A broken nav for that slice of users generates immediate support tickets and a hotfix deployment. The QA agent catches it before it leaves the PR.

Bug 4: The Race Condition in Checkout

The change: Parallel payment processing for subscription upgrades — charging payment and updating the plan concurrently instead of sequentially, cutting the upgrade flow roughly in half. The setup: A clean optimization, approved by two reviewers including the tech lead, all tests passing including load tests.

The pattern: A QA agent flags a race condition: when payment processing is slow (say, 2+ seconds), the plan update completes first. The user’s dashboard briefly shows the new plan’s features before the payment confirmation arrives. If the payment then fails, the user has access to premium features for a few seconds before the rollback kicks in — and, more critically, if the user triggers a premium-only action in that window (like exporting data in a premium format), the action succeeds and isn’t rolled back even after the payment fails.

Why humans miss it: Code review correctly notes that both operations can fail independently and checks that failure handling exists for both paths. The tests verify that a failed payment rolls back the plan upgrade. But the tests execute in milliseconds — the race condition only manifests when payment processing takes meaningfully longer than the plan update, which requires real-world latency.

What the QA agent does: The agent’s business rule for payments: “No premium feature should be accessible until payment is confirmed. No partial states should be user-visible.” It runs the upgrade flow with a range of simulated payment latencies (500ms, 1000ms, 2000ms, 5000ms). Past roughly 2000ms, it detects the race by checking feature accessibility during the window between plan update and payment confirmation.

Business impact: Revenue leakage from users exploiting — intentionally or not — the race window. Worse, a payment failure after a premium action creates an awkward customer-service situation: do you revoke the exported data? Charge them retroactively? Real payment processing regularly takes multiple seconds for a non-trivial fraction of transactions, so this isn’t a rare edge case — it’s a timing window that opens routinely.

Bug 5: Stale Cache Serving Old Prices

The change: Edge caching (CloudFront) on the pricing page to improve load times, cutting time-to-first-byte dramatically. The setup: All tests passing, approved.

The pattern: A QA agent flags that after a price change is made in the admin panel, the pricing page keeps showing the old prices for the full duration of the cache TTL (say, 24 hours). The cache-invalidation logic exists — but only fires on full deployments, not on admin-panel price updates.

Why humans miss it: The engineer who implements caching tests it by changing the page content in code and redeploying, which triggers invalidation correctly. The scenario of changing prices through the admin panel without a deploy isn’t tested, because it isn’t part of the PR’s scope. The reviewer focuses on the caching implementation, not on every way cached content could go stale.

What the QA agent does: The agent has a business rule: “The pricing page must reflect the current prices within a minute of any price change.” It updates a price via the admin API, waits, and reloads the pricing page. The old price is still displayed. It flags the staleness, with the cache-header information showing the CDN is serving a cached response.

Business impact: Imagine updating your prices on Monday and discovering on Tuesday that every prospect has been seeing yesterday’s numbers. Or worse: running a promotional discount while the cached page keeps showing the old, higher price. Stale-cache bugs with real revenue impact are common enough that they’re a recognized failure mode in web performance work.

The Pattern Across All Five Bugs

Look at the five bugs together and a pattern emerges:

Login redirect loop: Code-level correct, user-level broken
Currency formatting: Technically accurate, experientially confusing
Mobile nav: Visually correct, functionally broken
Race condition: Logic correct, timing wrong
Stale cache: Feature correct, integration broken

None of these would be caught by unit tests, because unit tests verify code behavior in isolation. None would be caught by standard integration tests, because integration tests verify that components communicate correctly. All five are caught because the QA agent validates the actual user experience, what a real user would see, touch, and encounter.

This is the distinction between testing code and testing product. Traditional CI pipelines are excellent at the former. They verify that functions return expected values, APIs respond with correct status codes, and components render without errors. But they don’t verify that the user can actually accomplish their goal.

How the QA Agent Works

Apollo Space’s QA agent is not a test runner. It’s an agent that understands business rules and validates them against the running application.

The core loop:

Trigger: A PR is opened or updated that touches frontend code, business logic, or API contracts (detected via file path patterns and change analysis)
Environment: The agent spins up a preview deployment of the PR branch
Rule evaluation: The agent loads its business rule set — a curated collection of rules covering authentication, navigation, data display, payment flows, permissions, and cross-browser compatibility
Execution: The agent runs through each applicable rule using browser automation, checking real rendered output against expected behavior
Reporting: Failures are posted as PR comments with the specific rule violated, the expected behavior, the actual behavior, and a screenshot or recording

The business rules are the key differentiator. These aren’t assertions generated from code. They’re human-defined rules about how the product should behave. “After login, the user sees the dashboard.” “All prices are displayed in the user’s selected currency.” “Navigation works at all viewport widths.” These rules don’t change when the code changes, which is exactly why they catch bugs that code-level tests miss.

The Human QA Question

Does the QA agent replace human QA engineers? Partially.

It replaces the repetitive parts: regression testing, edge case checking across browsers and viewports, business rule validation on every PR. A human QA engineer running these same checks manually would take hours per PR. The agent does it in minutes.

But human QA engineers bring something the agent doesn’t: intuition about usability, judgment about user experience quality, and the ability to test novel features where the expected behavior hasn’t been defined yet. The agent can tell you “the button works.” It can’t tell you “the button is in the wrong place.”

The model to use: the QA agent handles regression and rule-based testing on every PR. Human QA engineers focus on exploratory testing of new features, usability reviews, and maintaining the business rule set that the agent enforces.

This division pushes QA coverage toward every PR instead of the subset humans have time for, and lifts the repetitive grind off human QA engineers so they can spend their time on higher-value work. The humans do the judgment; the agent does the grind.

The Cost of Bugs Caught Late

Bugs like the five above, caught during code review before merge, take minutes to hours to fix.

If they reach production, the calculus changes entirely. It’s a long-observed rule of thumb in software engineering — going back to the IBM Systems Sciences Institute and echoed by cost-of-quality studies since — that bugs found in production cost many times more to fix than bugs found during development, once you factor in incident response, hotfix deployment, customer communication, and reputation damage.

Take the authentication bug: a production incident locking out most of the user base would mean an all-hands P0 — emergency rollback, customer communication to affected users, a post-mortem, and hours of senior engineering time. Easily a five-figure cost in labor and opportunity alone. The QA agent catches it in minutes, and the fix takes minutes more. The math is not subtle.

What This Means for Engineering Teams

The QA agent isn’t magic. It’s the systematic application of business rules against every change, without shortcuts, without fatigue, without the human tendency to focus review effort on the parts of the code that changed rather than the parts of the product those changes affect.

Humans review code. The QA agent reviews product. Both are necessary. Neither is sufficient alone. Bugs like the five in this piece slip past humans not because humans are bad at QA, but because humans are bad at repetitive, exhaustive, cross-dimensional testing on every single PR. That’s what agents are for.

See how Apollo Space's QA agent protects your product, book a demo

Join the waitlist for early access, founding-user pricing, and a front-row seat as we ship.

Join the waitlist

Use Cases

The QA agent that catches what humans miss

The Gap Between Tests Passing and Product Working

Bug 2: The Currency Formatting Edge Case

Bug 3: The Broken Mobile Navigation

Bug 4: The Race Condition in Checkout

Bug 5: Stale Cache Serving Old Prices

The Pattern Across All Five Bugs

How the QA Agent Works

The Human QA Question

The Cost of Bugs Caught Late

What This Means for Engineering Teams

Can Apollo write your investor update?

Can Apollo triage your security alerts? The one real signal was buried in ten thousand

Can Apollo run your partnerships desk? Yes, because BD is a memory problem

The Gap Between Tests Passing and Product Working

Bug 1: The Login Redirect Loop

Bug 2: The Currency Formatting Edge Case

Bug 3: The Broken Mobile Navigation

Bug 4: The Race Condition in Checkout

Bug 5: Stale Cache Serving Old Prices

The Pattern Across All Five Bugs

How the QA Agent Works

The Human QA Question

The Cost of Bugs Caught Late

What This Means for Engineering Teams

Can Apollo write your investor update?

Can Apollo triage your security alerts? The one real signal was buried in ten thousand

Can Apollo run your partnerships desk? Yes, because BD is a memory problem