Vision-based agents send the LLM a screenshot. That’s slow, expensive, and locks you into frontier models. The architecture below is different — structured DOM extraction, 10–50× fewer tokens, works with cheap open-source models, runs anywhere.
The definition
LLM browser automation is the use of large language models to drive web browsers — clicking, typing, navigating, validating — based on natural language or structured intent rather than hardcoded selectors. It’s the engine behind the new generation of AI testing tools, RPA agents, and browsing assistants.
The LLM never touches the browser directly. A runtime sits between them: it reads the page (DOM or screenshot), passes a representation to the LLM, receives back an action, and executes it. That runtime is where the entire architecture decision lives — and where the cost, speed, and reliability trade-offs play out.
Two architectures dominate today. They produce very different economics.
The architecture choice
Same end goal: LLM drives browser. Wildly different paths to get there.
Approach A
Send a screenshot. Get back pixel coordinates.
click(x: 412, y: 278)Costs: 5,000–15,000 tokens per step at enterprise resolution
Requires: frontier vision models (GPT-4o, Claude Sonnet, Gemini Pro)
Self-host: theoretical (no good open vision models yet)
Approach B · Karate Agent
Send a structured page. Get back intent.
click("Sign in")Costs: 200–1,000 tokens per step (10–50× less)
Works with: small/open models — Llama 70B, Qwen 72B, DeepSeek V3
Self-host: standard deployment via Ollama / vLLM
The DOM-first approach trades one capability for three. It can’t handle pure canvas/WebGL/PDF content the way vision can — a real limit for some apps. In exchange: an order of magnitude cheaper, an order of magnitude faster, and able to run on a $50/month GPU instead of a $2,000/month API bill. For enterprise testing of real web apps, that’s an obvious trade.
A specific optimization
look() patternEven DOM-first agents would burn tokens if they re-extracted the full page on every step. Karate Agent doesn’t.
Naive approach
Every action re-scans the entire DOM. On a complex SPA, that’s 5,000+ tokens per step just for context.
A 20-step test: ~100K tokens of redundant context
Karate Agent
Only changes from the last DOM extract are sent. A button click might surface 50 tokens of new menu, not 5,000 of unchanged sidebar.
72× reduction in page-scan tokens at scale
# Scripted flow — zero LLM calls, native speed
agent.do("go to /login")
agent.do("sign in as admin@test.io")
# LLM fires here — first dynamic decision after the page changes
agent.look() # diff vs last extract; only new elements
agent.verify("dashboard shows monthly revenue card")
The combination of DOM-first + diff-based look() + scripted-when-possible is what makes enterprise-scale runs economical. A 1,000-test regression at frontier-vision pricing is unaffordable. The same suite on this architecture costs less than the CI minutes it consumes.
Token economics
Logging in, navigating to a settings page, updating a field, verifying it persists.
| Architecture | Tokens / step | Total (20 steps) | Cost (Claude Sonnet) | Cost (Llama 70B local) |
|---|---|---|---|---|
| Vision-based | ~10,000 | ~200,000 | $0.60 | N/A — no open vision |
| DOM-first naive | ~3,000 | ~60,000 | $0.18 | ~$0.0006 |
| DOM-first + look() | ~400 | ~8,000 | $0.024 | ~$0.00008 |
| DOM-first + scripted happy path | ~0–100 | ~500 | $0.0015 | essentially free |
Numbers are illustrative averages for a typical SPA flow; your suite will vary. The point is the four orders of magnitude between the top and bottom rows. That spread is the difference between AI testing as an experiment and AI testing as a CI budget line item your CFO doesn’t question.
Model compatibility
Because the input is structured text, not images, the model just needs to be good at structured reasoning. That’s a much shorter list of requirements than vision.
Cloud — frontier
Best for: complex decision logic, ambiguous flows, exploratory testing.
Open source
Best for: regression suites, deterministic flows, high-volume CI.
Self-hosted
Best for: regulated industries, data sovereignty, zero outbound calls.
Switch providers via configuration. Same agent code, different model. Bring your own LLM covers the deployment story in detail.
MCP integration
Karate Agent exposes a karate_eval tool via Model Context Protocol. That means Claude Code, Cursor, GitHub Copilot, and any other MCP-compatible assistant can drive browser tests during normal feature work.
# In your AI coding assistant’s chat:
> Add a password-reset flow to the auth page.
> ... agent writes code ...
> Now verify it works.
# Assistant calls karate_eval via MCP:
{
"tool": "karate_eval",
"scenario": "reset password for test@example.com via /auth/reset",
"url": "http://localhost:3000"
}
# Karate Agent runs the test, returns structured pass/fail with screenshots.
# Assistant reports back, iterates if needed, all without leaving the IDE.
This collapses the develop-test loop into a single conversation. Same person writing the code can verify it without context-switching to a test runner. Same AI assistant that wrote the feature can run the regression that proves the feature works. See testing AI-generated code for the broader pattern.
Inside your perimeter
No outbound calls.·No telemetry.·No hosted control plane.
Pair Karate Agent with a local LLM via Ollama or vLLM. The entire stack — Chrome, agent server, model — runs in your Kubernetes cluster, on your hardware, with zero internet dependency. Standard deployment for financial services, insurance, and healthcare. Self-hosted AI testing covers the configuration in detail.
FAQ
LLM browser automation is the use of large language models to control web browsers — clicking, typing, navigating, validating — based on natural language or structured intent. The LLM receives either a screenshot (vision-based) or a DOM extract (DOM-first) of the page and decides what action to take next. It powers the new wave of AI-driven testing, RPA, and browsing agents.
Vision-based agents (like Claude computer use) send screenshots to the LLM and get back pixel coordinates. Tokens are expensive (images are heavy), latency is higher, and pixel accuracy is imperfect. DOM-first agents (like Karate Agent) send structured DOM representations — elements, roles, labels, states — which are 10–50× more token-efficient, faster, and deterministic. For enterprise testing, DOM-first wins on cost, speed, and reliability.
Any of them. Karate Agent is provider-agnostic: Anthropic Claude (Opus, Sonnet, Haiku), OpenAI (GPT-4, GPT-4o, GPT-5), Google Gemini (Pro, Ultra), Meta Llama (3, 3.1, 3.2, 3.3), Qwen (2, 2.5, 3), DeepSeek (R1, V3), Mistral, Gemma, GLM, Kimi, and any model served via Ollama or an OpenAI-compatible endpoint (vLLM, TGI, LM Studio).
Yes — and this is one of the strongest reasons to choose a DOM-first platform. Vision-based agents typically require frontier models (GPT-4, Claude Opus) to handle screenshot reasoning. DOM-first agents work with smaller, cheaper models because the input is structured. Llama 3.3 70B, Qwen 2.5 72B, DeepSeek V3, Mistral Large — all viable for enterprise testing. See Karate Agent for specifics.
Wide range. Vision-based agents can burn 10,000+ tokens per step at scale. DOM-first agents like Karate Agent use structured responses and look() diffing for a 72× reduction in page scans. Combined with scripted flows that consume zero tokens (the LLM is only invoked on recovery), typical enterprise test runs cost cents, not dollars.
Yes. Pair Karate Agent (Docker-deployed) with a local model served via Ollama or vLLM. The entire pipeline — browser, agent server, LLM — runs inside your firewall. No data leaves, no telemetry, no internet required. This is the standard deployment for regulated financial services and insurance customers. See self-hosted AI testing and enterprise AI testing.
Claude computer use is a vision-based, cloud-hosted agent from Anthropic. It sends screenshots to Anthropic’s servers and returns actions. Strengths: works on anything visible. Weaknesses: cloud-only, vendor-locked, token-heavy, data leaves your network. Karate Agent is self-hosted, DOM-first, LLM-agnostic, and designed for enterprise testing specifically. Different tools for different jobs — see the side-by-side.
Karate Agent exposes a karate_eval tool via MCP, so developers can drive browser tests from Claude Code, VS Code Copilot, Cursor, and other MCP-compatible clients. This turns the AI coding assistants teams already use into first-class test authoring partners — write a feature with Cursor, run a verification with Karate Agent, all from the same editor. See testing AI-generated code.
Karate Agent runs in your infrastructure, works with any LLM you choose, and stays cheap enough to run on every PR. Free to try, free to keep using.