LLM Browser Automation | DOM-First AI Browser Testing, Open-Weight Verified

Q: What's the difference between DOM-first and vision-based LLM browser automation?

Vision-based agents (like Anthropic Claude computer use) send screenshots to the LLM and get back pixel coordinates. Tokens are expensive (images are heavy), latency is higher, and pixel accuracy is imperfect. DOM-first agents (like Karate Agent) send structured DOM representations — elements, roles, labels, states — which are 10–50× more token-efficient, faster, and deterministic. For enterprise testing, DOM-first wins on cost, speed, and reliability.

Q: How does LLM browser automation integrate with MCP (Model Context Protocol)?

Karate Agent exposes a karate_eval tool via MCP, so developers can drive browser tests from Claude Code, VS Code Copilot, Cursor, and other MCP-compatible clients. This turns the AI coding assistants teams already use into first-class test authoring partners — write a feature with Cursor, run a verification with Karate Agent, all from the same editor.

The definition

What is LLM browser automation?

LLM browser automation is the use of large language models to drive web browsers — clicking, typing, navigating, validating — based on natural language or structured intent rather than hardcoded selectors. It’s the engine behind the new generation of AI testing tools, RPA agents, and browsing assistants.

The LLM never touches the browser directly. A runtime sits between them: it reads the page (DOM or screenshot), passes a representation to the LLM, receives back an action, and executes it. That runtime is where the entire architecture decision lives — and where the cost, speed, and reliability trade-offs play out.

Two architectures dominate today. They produce very different economics.

The architecture choice

Vision-based vs DOM-first

Same end goal: LLM drives browser. Wildly different paths to get there.

Approach A

Vision-based

Send a screenshot. Get back pixel coordinates.

Browser renders the page
Runtime captures a screenshot (typically PNG, 1280×720+)
Screenshot encoded as base64 image tokens
LLM reasons over pixels, outputs click(x: 412, y: 278)
Runtime sends the click to the OS

Costs: 5,000–15,000 tokens per step at enterprise resolution

Requires: frontier vision models (GPT-4o, Claude Sonnet, Gemini Pro)

Self-host: theoretical (no good open vision models yet)

Approach B · Karate Agent

DOM-first

Send a structured page. Get back intent.

Browser renders the page
Runtime extracts interactive elements (buttons, links, inputs) with roles, labels, and ARIA
Compact structured representation sent as plain text
LLM picks the element by semantic intent: click("Sign in")
Runtime resolves the label to a CDP-level action — deterministic, no coordinate hallucinations

Costs: 200–1,000 tokens per step (10–50× less)

Works with: small/open models — Llama 70B, Qwen 72B, DeepSeek V3

Self-host: standard deployment via Ollama / vLLM

The DOM-first approach trades one capability for three. It can’t handle pure canvas/WebGL/PDF content the way vision can — a real limit for some apps. In exchange: an order of magnitude cheaper, an order of magnitude faster, and able to run on a $50/month GPU instead of a $2,000/month API bill. For enterprise testing of real web apps, that’s an obvious trade.

A specific optimization

The `look()` pattern

Even DOM-first agents would burn tokens if they re-extracted the full page on every step. Karate Agent doesn’t.

Naive approach

Full DOM each step

Every action re-scans the entire DOM. On a complex SPA, that’s 5,000+ tokens per step just for context.

A 20-step test: ~100K tokens of redundant context

Karate Agent

Diff since last action

Only changes from the last DOM extract are sent. A button click might surface 50 tokens of new menu, not 5,000 of unchanged sidebar.

72× reduction in page-scan tokens at scale

# Scripted flow — zero LLM calls, native speed
agent.do("go to /login")
agent.do("sign in as admin@test.io")

# LLM fires here — first dynamic decision after the page changes
agent.look()  # diff vs last extract; only new elements
agent.verify("dashboard shows monthly revenue card")

The combination of DOM-first + diff-based look() + scripted-when-possible is what makes enterprise-scale runs economical. A 1,000-test regression at frontier-vision pricing is unaffordable. The same suite on this architecture costs less than the CI minutes it consumes.

Token economics

A real 20-step test, compared

Logging in, navigating to a settings page, updating a field, verifying it persists.

Architecture	Tokens / step	Total (20 steps)	Cost (Claude Sonnet)	Cost (Llama 70B local)
Vision-based	~10,000	~200,000	$0.60	N/A — no open vision
DOM-first naive	~3,000	~60,000	$0.18	~$0.0006
DOM-first + look()	~400	~8,000	$0.024	~$0.00008
DOM-first + scripted happy path	~0–100	~500	$0.0015	essentially free

Numbers are illustrative averages for a typical SPA flow; your suite will vary. The point is the four orders of magnitude between the top and bottom rows. That spread is the difference between AI testing as an experiment and AI testing as a CI budget line item your CFO doesn’t question.

Model compatibility

Any model. Tested.

Because the input is structured text, not images, the model just needs to be good at structured reasoning. That’s a much shorter list of requirements than vision.

Cloud — frontier

Highest quality

•Anthropic Claude Opus / Sonnet / Haiku 4
•OpenAI GPT-5, GPT-4o
•Google Gemini 2.5 Pro

Best for: complex decision logic, ambiguous flows, exploratory testing.

Open-weight

Cost-optimized

•Gemma 4 26B verified
•Qwen 3.6 35B-A3B verified
•Meta Llama 3.3 70B
•DeepSeek V3 / R1
•Mistral, GLM, Kimi

Gemma 4 and Qwen 3.6 pass the full UI-automation benchmark suite, each on one 24 GB GPU. Best for: regression suites, deterministic flows, high-volume CI.

Self-hosted

Air-gapped

•Ollama — any GGUF model
•vLLM — production inference
•TGI — HuggingFace text-gen
•LM Studio — local desktop
•Any OpenAI-compatible endpoint

Best for: regulated industries, data sovereignty, zero outbound calls.

Switch providers via configuration. Same agent code, different model. Bring your own LLM covers the deployment story in detail.

MCP integration

Call from any AI assistant

Karate Agent exposes a karate_eval tool via Model Context Protocol. That means Claude Code, Cursor, GitHub Copilot, and any other MCP-compatible assistant can drive browser tests during normal feature work.

# In your AI coding assistant’s chat:
> Add a password-reset flow to the auth page.
> ... agent writes code ...
> Now verify it works.

# Assistant calls karate_eval via MCP:
{
  "tool": "karate_eval",
  "scenario": "reset password for test@example.com via /auth/reset",
  "url": "http://localhost:3000"
}

# Karate Agent runs the test, returns structured pass/fail with screenshots.
# Assistant reports back, iterates if needed, all without leaving the IDE.

This collapses the develop-test loop into a single conversation. Same person writing the code can verify it without context-switching to a test runner. Same AI assistant that wrote the feature can run the regression that proves the feature works. See testing AI-generated code for the broader pattern.

Inside your perimeter

No outbound calls.·No telemetry.·No hosted control plane.

Pair Karate Agent with a local LLM via Ollama or vLLM. The entire stack — Chrome, agent server, model — runs in your Kubernetes cluster, on your hardware, with zero internet dependency. Standard deployment for financial services, insurance, and healthcare. Self-hosted AI testing covers the configuration in detail.

FAQ

Frequently asked questions

What is LLM browser automation?

LLM browser automation is the use of large language models to control web browsers — clicking, typing, navigating, validating — based on natural language or structured intent. The LLM receives either a screenshot (vision-based) or a DOM extract (DOM-first) of the page and decides what action to take next. It powers the new wave of AI-driven testing, RPA, and browsing agents.

What’s the difference between DOM-first and vision-based LLM browser automation?

Vision-based agents (like Claude computer use) send screenshots to the LLM and get back pixel coordinates. Tokens are expensive (images are heavy), latency is higher, and pixel accuracy is imperfect. DOM-first agents (like Karate Agent) send structured DOM representations — elements, roles, labels, states — which are 10–50× more token-efficient, faster, and deterministic. For enterprise testing, DOM-first wins on cost, speed, and reliability.

Which LLMs work with Karate Agent?

Any of them. Karate Agent is provider-agnostic: Anthropic Claude (Opus, Sonnet, Haiku), OpenAI (GPT-4, GPT-4o, GPT-5), Google Gemini (Pro, Ultra), Gemma 4 26B and Qwen 3.6 35B-A3B (both benchmark-verified across the full UI-automation suite, each on a single 24 GB GPU), Meta Llama, DeepSeek, Mistral, GLM, Kimi, and any model served via Ollama or an OpenAI-compatible endpoint (vLLM, TGI, LM Studio).

Can I use open-source LLMs for browser automation?

Yes — and this is one of the strongest reasons to choose a DOM-first platform. Vision-based agents typically require frontier models (GPT-4, Claude Opus) to handle screenshot reasoning. DOM-first agents work with smaller, cheaper models because the input is structured. Gemma 4 26B and Qwen 3.6 35B-A3B are benchmark-verified across the full UI-automation suite, each on a single 24 GB GPU, and Llama 3.3 70B, DeepSeek V3 and Mistral are also viable for enterprise testing. See Karate Agent for specifics.

What does LLM browser automation cost in tokens?

Wide range. Vision-based agents can burn 10,000+ tokens per step at scale. DOM-first agents like Karate Agent use structured responses and look() diffing for a 72× reduction in page scans. Combined with scripted flows that consume zero tokens (the LLM is only invoked on recovery), typical enterprise test runs cost cents, not dollars.

Can LLM browser automation run fully on-premises?

Yes. Pair Karate Agent (Docker-deployed) with a local model served via Ollama or vLLM. The entire pipeline — browser, agent server, LLM — runs inside your firewall. No data leaves, no telemetry, no internet required. This is the standard deployment for regulated financial services and insurance customers. See self-hosted AI testing and enterprise AI testing.

How does LLM browser automation compare to Claude computer use?

Claude computer use is a vision-based, cloud-hosted agent from Anthropic. It sends screenshots to Anthropic’s servers and returns actions. Strengths: works on anything visible. Weaknesses: cloud-only, vendor-locked, token-heavy, data leaves your network. Karate Agent is self-hosted, DOM-first, LLM-agnostic, and designed for enterprise testing specifically. Different tools for different jobs — see the side-by-side.

How does LLM browser automation integrate with MCP (Model Context Protocol)?

Karate Agent exposes a karate_eval tool via MCP, so developers can drive browser tests from Claude Code, VS Code Copilot, Cursor, and other MCP-compatible clients. This turns the AI coding assistants teams already use into first-class test authoring partners — write a feature with Cursor, run a verification with Karate Agent, all from the same editor. See testing AI-generated code.

LLM browser automation,
without the token tax.

What is LLM browser automation?

Vision-based vs DOM-first

Vision-based

DOM-first

The `look()` pattern

Full DOM each step

Diff since last action

A real 20-step test, compared

Any model. Tested.

Highest quality

Cost-optimized

Air-gapped

Call from any AI assistant

Frequently asked questions

Cheap models, real browsers,
structured intent.

LLM browser automation, without the token tax.

What is LLM browser automation?

Vision-based vs DOM-first

Vision-based

DOM-first

The look() pattern

Full DOM each step

Diff since last action

A real 20-step test, compared

Any model. Tested.

Highest quality

Cost-optimized

Air-gapped

Call from any AI assistant

Frequently asked questions

Cheap models, real browsers,structured intent.

LLM browser automation,
without the token tax.

The `look()` pattern

Cheap models, real browsers,
structured intent.