DOM vs Screenshot AI Testing: Why Tokens Matter

There are two fundamentally different architectures for AI browser automation in 2026. One sends screenshots to the LLM and asks it to click by pixel coordinates. The other sends structured DOM extracts and asks the LLM to reason about semantic elements. The choice between them drives almost every other downstream metric — cost, speed, reliability, LLM flexibility, and enterprise viability.

This post is about why the architectural choice matters, and why DOM-based wins for enterprise testing.

The two approaches, explained

Screenshot-based (vision)

The LLM receives a pixel image of the rendered page. It reasons about what it sees and returns an action: “click at coordinate (423, 182).” The agent clicks there. Repeat.

Examples: Anthropic Claude computer use, OpenAI Operator, various computer-use agents.

DOM-based (structured)

The agent extracts a structured representation of the page from the browser DOM — interactive elements, their roles, labels, state, position in the tree. The LLM reasons about this structure and returns an action: “click the button with text ‘Submit’.” The agent resolves it to the correct DOM element and clicks.

Examples: Karate Agent.

Why token cost matters more than anything else

A single 1920×1080 screenshot consumes roughly 1,500-3,000 LLM input tokens when sent via the Anthropic or OpenAI APIs. A structured DOM extract of the same page — every interactive element with labels and state — typically runs 300-800 tokens.

That’s a 5× difference, minimum, per page view. Over a test suite with hundreds of page views, it’s the difference between a sustainable enterprise tool and a budget-exploding pilot.

The math at scale

A realistic enterprise workload: 100 regression tests, 20 page views per test, 10 runs per day. That’s 20,000 page views per day.

Screenshot-based, Claude Sonnet: ~40M input tokens/day → ~$120/day → ~$3,600/month
DOM-based, Claude Sonnet: ~10M input tokens/day → ~$30/day → ~$900/month
DOM-based, Llama 3.3 70B self-hosted: ~$0 marginal (hardware amortized)

The self-hosted open-source option is essentially free at steady state. Screenshot-based with a frontier cloud model is $40-50K/year for a moderate test suite. The choice compounds over time.

Why DOM-based is faster

Beyond tokens, latency matters. Every second per step compounds across a test suite.

Screenshot-based: ~1.5-3 seconds per step (image encode + LLM inference + decode)
DOM-based: ~0.3-0.8 seconds per step (text prompt + text response)

Over a 100-step test: screenshot-based is 2-5 minutes of LLM time. DOM-based is 30-80 seconds. The difference at test-suite scale is hours vs. days.

Why DOM-based is more reliable

Screenshot-based agents click by pixel coordinates. The LLM looks at the image and says “click at (423, 182).” Three failure modes:

Coordinate drift. The LLM’s spatial reasoning isn’t perfect. It clicks at (421, 185) — close but missing the button.
Rendering differences. Font rendering, antialiasing, scrollbar width can shift pixel positions between environments.
Ambiguity on visually similar elements. Two “Submit” buttons on the page, the LLM clicks the wrong one.

DOM-based agents don’t have these failure modes. The LLM names the element (“the Submit button in the checkout form”) and the agent resolves it deterministically via CSS or ARIA.

Why DOM-based works with smaller LLMs

This is the under-appreciated advantage: DOM-based reasoning is a text task. Screenshot-based reasoning is a vision task. Vision is harder.

Frontier models (Claude Opus, GPT-4, Gemini Ultra) handle vision well. Smaller and open-source models often struggle. Llama 3.3 70B handles DOM reasoning excellently; its vision capability is weaker. Qwen 2.5 72B similar. DeepSeek V3 similar.

The practical effect: DOM-based architectures let enterprise teams use smaller, cheaper, self-hosted models for routine workloads. Screenshot-based architectures typically require frontier cloud models to achieve acceptable accuracy — which means higher costs and cloud dependency.

When screenshot-based wins

There are cases where DOM-based falls short and screenshot-based wins:

Canvas-heavy applications. Google Sheets, Figma, games — the “interactive content” isn’t in the DOM.
Flash and legacy embeds. Rare in 2026 but occasionally surfaces in legacy enterprise apps.
Non-browser automation. Desktop apps, mobile native apps — no DOM to read.
Visual regression testing. Pixel-level comparison is inherently visual.

For general-purpose agents (desktop automation, non-browser workflows), screenshot-based is the right architecture. For enterprise browser testing, DOM-based wins decisively.

The `look()` diffing optimization

Karate Agent takes the DOM-based approach one step further with diffing. Instead of sending the full DOM on every step, it sends only what changed since the last step.

In a typical test, most steps cause small UI updates — a field gets filled, a checkbox toggles, a dropdown opens. The DOM diff is 10-50 tokens. Occasionally a page navigation produces a larger diff. The average case is dramatically cheaper than sending the full page every time.

Net effect: 72× fewer page scans compared to sending full HTML on every step. This compounds with the DOM-vs-screenshot savings for a total token reduction of roughly 100× vs. vision-based agents. At scale, this is the difference between an enterprise tool and a budget disaster.

Summary

For enterprise browser testing in 2026, DOM-based AI automation wins on:

Cost — 10-100× cheaper than screenshot-based
Speed — 3-5× faster per step
Reliability — deterministic element resolution
LLM flexibility — works with smaller, cheaper, self-hosted models

Screenshot-based wins on coverage of non-DOM UIs and general-purpose desktop automation. Use the right tool for the right job.

LLM browser automation — deep dive on the architecture
Claude computer use alternative
Karate Agent

DOM-Based vs Screenshot-Based AI Testing: Why Tokens Matter

The two approaches, explained

Screenshot-based (vision)

DOM-based (structured)

Why token cost matters more than anything else

The math at scale

Why DOM-based is faster

Why DOM-based is more reliable

Why DOM-based works with smaller LLMs

When screenshot-based wins

The `look()` diffing optimization

Summary

Explore Karate Agent

DOM-Based vs Screenshot-Based AI Testing: Why Tokens Matter

The two approaches, explained

Screenshot-based (vision)

DOM-based (structured)

Why token cost matters more than anything else

The math at scale

Why DOM-based is faster

Why DOM-based is more reliable

Why DOM-based works with smaller LLMs

When screenshot-based wins

The look() diffing optimization

Summary

Explore Karate Agent

The `look()` diffing optimization