Screen Readers Solved Browser Agents Before Browser Agents Existed

Status: Early Preview. The latency/cost A/B numbers below are from provisional runs and are marked (provisional) inline. I’m re-running the comparison across more pages and viewports before publish and will backfill the table; the architecture and the qualitative argument are already settled. Build-in-public, same as the refusal benchmark.

The problem I actually had

I wanted to point an LLM agent at an arbitrary web target and have it work — log in, read the page, fill the form, submit — without writing a bespoke scraper for every site, without triggering bot-fight modes, and without being detected by rate-limiters and scraping blockers (especially since I wasn’t scraping).

The standard assumption when people build this is that the agent needs to see the page. Screenshot in, “click at (640, 312)” out. That’s the whole premise of vision-based computer-use.

For the structured web (forms, buttons, links, text, the 95% of pages that aren’t a <canvas> game), that assumption is wrong, and it’s expensive. The page is already describing itself in a format that was designed decades ago to explain a UI to something that can’t see it.

That something used to be a blind user with a screen reader. Now it’s your agent. Same data. Same need. The web already solved this.

From the Margin: I don’t know where to fit into this world I am in. The hollywood depiction of hackers and AI has been a total myth that I some how romanticised growing up. There is so much love/hate for the AI usage in code, cyber security, research, whatever, all I know is that I want to contribute and would love to connect with anyone doing new and exciting work in the space.

Here is my disclaimer, I use AI. I use it to the absolute limits that I feel confident and comfortable using it. It helps me write, it corrects me when I make a false claim and teaches me why and where I messed up. AI has advanced my abilities and knowledge in this space imeasurably and without AI, I would still be solving easy labs on htb instead of writing this article.

I don’t mind admitting I am using AI for this work, and I hope that by me clearly stating it up front, you will appreciate the work more.

I believe that it takes actual skill and practice to achieve the results I have. I watch first hand as user after user is denied, banned, blocked, and otherwise refused by AI models when they attempt any of this.

I am not using abbliterated models, I am using common, every day, “frontier” commercial AI. Most of the time that ends up being claude because it appears to steer the best, but sometimes it’s gpt, cursor, gemini, whatever the project calls for.

I have watched AI bleed context way out of bounds, make crazy claims, and do silly things like print(expected_results). I have also used it to solve complex problems I could have never done, find bash vulnerabilities within simple OOO that I would have missed. Scripts and commands I hadn’t learned yet, even how to rapidly compare and scale attacks without crashing a site.

Anyway, that is just a fraction of it, and this note has gone long enough. If you find this type of work and research interesting and valuable, feel free to reach out, or follow along.

Back to the article:

The accessibility tree is the interface

Every modern browser maintains an accessibility tree: a structured, semantic description of the page exposed for assistive technology. Roles (button, textbox, link), accessible names (“Submit order”), states (disabled, checked, expanded). It’s what a screen reader narrates.

Playwright exposes it in one call:

async def page_state(page):
    """Compact, LLM-friendly snapshot of the current page."""
    a11y = await page.locator("body").aria_snapshot()
    return {
        "url": page.url,
        "title": await page.title(),
        "accessibility": a11y,   # YAML: role + name + state
        "dom_hash": _hash(await page.content()),
    }

aria_snapshot() returns YAML. Here’s roughly what comes back for a login form:

- heading "Sign in" [level=1]
- textbox "Email"
- textbox "Password"
- button "Sign in"
- link "Forgot password?"

Compare that to the raw HTML it replaces: the nested <div> soup, the utility classes, the inline SVGs, the analytics attributes, the three wrapper elements around every input. None of that helps a model decide what to click. The accessibility tree is the page with the noise already removed, because removing the noise is the entire point of accessibility. A screen reader doesn’t narrate <div class="flex gap-2 items-center">. Neither should your agent’s context window.

No custom DOM flattener. No heuristics for “is this element interactive.” The browser already computed it, for reasons that have nothing to do with AI, and it’s been shipping in every browser for twenty years.

Vision vs. structure: the tradeoffs

I didn’t reject vision on principle. I measured it and reserved it for where it earns its cost.

	Accessibility tree (`page_state`)	Screenshot (vision)
Input tokens / step	~400–1,500 (provisional)	~1,100–1,600 image block (provisional)
Latency / perception call	~150–400 ms (provisional)	~300–650 ms (provisional)
Relative cost / step	1× (baseline)	~5–10×
Targeting	semantic (role + name)	pixel coordinates
Survives viewport / theme change	yes	no (coords shift)
Works on `<canvas>` / image-only UI	no (tree is empty)	yes

(Token and latency figures provisional pending the formal A/B; cost multiple is the one number I’m already confident in — it falls straight out of how the Anthropic API prices image blocks vs. a short text snapshot.)

The structural read wins on cost, latency, and stability for anything with real DOM semantics. Pixel coordinates are brittle in a way that bites you the moment the layout reflows: a button at (640, 312) on your screen is somewhere else on a narrower viewport, and the model has no way to know it missed. Semantic targeting, “click the button named Sign in,” doesn’t care where the button moved to.

Vision wins in exactly one situation: when the accessibility tree comes back empty or useless. Canvas-rendered apps, custom-drawn widgets that never set ARIA roles, image-only content, a chart you need to actually read. That’s real, and it’s why the screenshot tool exists in the harness, but it’s the escape hatch, not the front door. The tool description Claude sees says so out loud:

Use ONLY when the accessibility tree is empty/unhelpful
(canvas, custom widget, image-only content).
Costs ~5-10x a page_state call.

You reserve computer-use. You don’t reach for it first.

Verbs, not coordinates

Once perception is semantic, action wants to be semantic too. The agent’s whole toolset is verbs that map to intent, not to screen geometry:

page_state          # what's on screen (default perception)
find_text           # locate before acting
click_text / click_role
fill_label / fill_placeholder
select_option / press / type_text
wait_for_text / wait_for_url
submit_finding      # terminal: "done, here's why"

click_role("button", name="Sign in") is how a human describes the action. It’s also how the accessibility tree describes the element. The perception layer and the action layer speak the same language, which means the model rarely has to translate “the thing I see” into “the place I click.” That translation step is where vision agents burn turns and misfire.

There’s a portability dividend too. These verbs are thin wrappers over the engine. Swapping Playwright for a different backend later means re-implementing one client file, not rewriting every caller. The agent’s contract is the verbs, not the engine.

Make the browser a service

One architectural decision that paid off more than I expected: the browser runs as a long-lived process, separate from the stateless agent that drives it. A local HTTP API exposes the verbs; the agent calls them.

This isn’t about scale. It’s about the realities of driving real web apps:

Log in once. Cookies and local storage persist across agent restarts, so OAuth/SSO happens one time, not every run.
Watch it work. Headed Chromium opens and you see every click — invaluable for debugging and for not trusting a black box.
Intervene mid-flow. Captcha, MFA, a manual recovery step: switch to the browser window, do the thing by hand, the agent picks back up where it left off.
Flip one flag for CI. Headless is the same code path.

The agent process can crash, restart, or be re-prompted without tearing down the session. That separation turned out to be the difference between a demo and something I’d actually leave running.

The economics nobody structures for

A perception-heavy agent loop is the ideal prompt-caching workload, and most implementations leave the savings on the floor.

The system prompt and the tool definitions don’t change between iterations of a loop. Mark them cached, and every step after the first pays cache-read rates instead of re-sending the whole preamble. In this harness the tool block is cached by setting cache_control on the last tool definition:

def cached_tools():
    out = [dict(t) for t in TOOLS]
    out[-1] = dict(out[-1], cache_control={"type": "ephemeral"})
    return out

The payoff: a typical 20-iteration loop on Sonnet 4.6 lands around $0.05–0.15 on a small target. The cache has a ~5-minute TTL, which is fine because an agent loop fires steps far faster than that. Run on a heavier model and you multiply by roughly 5×, but the shape of the cost stays the same: it’s cheap, because the expensive part is cached and the per-step perception is a short text snapshot. If your browser agent’s per-step cost is dominated by re-sending context or by full-page screenshots, you’re paying for an architecture choice, not a hard requirement.

Where it breaks — and the part that actually matters

The honest limits.

The accessibility tree is only as good as the page’s semantics. Apps that render to canvas, never set ARIA roles, or bury everything in unlabeled <div>s give you a thin or empty tree. That’s the screenshot escape hatch’s whole reason to exist. A well-built, accessible app is, pleasantly, also the easiest one for an agent to drive: accessibility and agent-readiness turn out to be the same property.

But here’s the part worth sitting with. On a controlled test target with a ladder of difficulty, clean accessibility-tree perception lets the agent solve the easy rungs reliably and then stalls on the hard ones; it stalls regardless of how perfectly it can see the page. Better perception didn’t make it better at the actual problem.

That’s the real lesson, and it generalizes past browsers. Perception was never the bottleneck. Giving an agent a clean, cheap, stable read of its environment makes it competent at acting: it clicks the right thing, fills the right field, doesn’t fumble the mechanics. It does not make it clever at the task. Those are different problems, solved in different places, and conflating them is how you end up spending vision-model money to make a confused agent see a form in higher resolution.

Use the accessibility tree. It’s cheaper, it’s more stable, and it’s been waiting in the browser the whole time. Save vision for when the page genuinely refuses to describe itself, and don’t expect either one to fix an agent that doesn’t know what it’s doing.

Harness details (two-process design, verb surface, dedup/state, prompt caching) are from my own target-agnostic red-team browser harness; the tree-search and escalation drivers (TAP, Tree’d Escalation) come from its sibling harnesses in the same red-team toolkit. Specifics of any particular authorized target are deliberately scrubbed; this is a methodology note, not a writeup of any one engagement.

I self-fund all of this, over $1,000/m in subscriptions, tokens, credits, api, hosting, labs, domains, everything. If you enjoy it, feel free to buymecoffee

Stay Loud, Stay Curious, Question Everything