AI testing tools that bundle their own LLM lock you into their model choice, their pricing, and their data handling. Karate Agent doesn’t. Plug in Claude, GPT, Llama, Qwen, or anything you can serve over HTTP — switch any time, switch per test, never call us about token costs again.
The principle
No lock-in. A bundled LLM is a hidden lock-in. When the vendor changes models, raises prices, or changes terms, you have no recourse — the only escape is a full platform migration. BYO LLM means the model is yours; the platform is just the runtime calling it.
Cost control. Your enterprise probably has negotiated Anthropic, OpenAI, or Azure agreements. BYO LLM means tests run under those agreements, not a per-seat reseller markup. For self-hosted open-source models, marginal cost is essentially zero once the hardware is in place.
Data sovereignty. Bundled LLMs route your test data through the vendor’s pipeline by default. BYO LLM gives you the choice: cloud LLM under your contract, or self-hosted model where nothing ever leaves your network. Regulated industries can finally adopt AI testing without an exception request.
Supported providers
Karate Agent talks to anything that speaks OpenAI-compatible HTTP or Anthropic Messages format. That covers virtually every cloud provider, every aggregator, and every self-hosted inference server worth running.
Cloud — first-party
Aggregators
Self-hosted
Open-source models proven in production: Llama 3.3 70B, Qwen 2.5 72B, DeepSeek V3, Mistral Large, Gemma 2 27B, GLM 4, Kimi K1. Smaller models work for routine flows — see the cost section below.
In practice
Same test code, same scenarios, different LLM. Compare cost, latency, and reasoning quality without rewriting anything.
# karate-agent.yaml
# Anthropic Claude (cloud, first-party)
llm:
provider: anthropic
model: claude-sonnet-4-5
api_key: ${ANTHROPIC_API_KEY}
# Or: OpenRouter aggregator (200+ models, one bill)
llm:
provider: openai # OpenAI-compatible
base_url: https://openrouter.ai/api/v1
model: qwen/qwen-2.5-72b-instruct
api_key: ${OPENROUTER_API_KEY}
# Or: local Llama via Ollama (zero outbound calls)
llm:
provider: openai
base_url: http://ollama:11434/v1
model: llama3.3:70b
That’s the entire switch. Test files don’t change. Scenarios don’t change. Re-run, observe the difference, choose what fits your CI budget.
Mix-and-match
Routine regression doesn’t need a frontier model. Complex acceptance scenarios shouldn’t run on the cheap one. Route per-test for optimal cost-quality.
Fast & cheap
Routine regression, smoke tests, happy-path verification. <200ms per step, $0 marginal.
Use for: 80% of CI runs
Balanced
Moderate complexity, recovery from unexpected states. Sweet spot for most tests.
Use for: standard acceptance, integration
Maximum reasoning
Long multi-step flows, ambiguous UIs, exploratory testing. Highest quality, highest cost.
Use for: critical-path E2E, exploratory runs
Cost at scale
Karate Agent’s DOM-first architecture keeps token usage low. These are real-world ballparks for an average mid-complexity suite.
| Provider / Model | Monthly cost | Notes |
|---|---|---|
| Claude Sonnet 4.5 | $200–500 | Strong default; cheap, fast, smart |
| GPT-4 Turbo | $300–600 | Comparable quality, slightly higher cost |
| Gemini 2.5 Flash | $50–150 | Very cheap, good for routine |
| OpenRouter / Qwen 2.5 72B | $80–200 | Open-source via aggregator |
| Self-hosted Llama 3.3 70B | ~$0 marginal | After hardware amortization |
For comparison: vision-based agents on the same workload typically run $2,000–8,000/month at frontier model pricing. The architectural difference is the entire reason BYO LLM is even practical for testing.
Data handling
Karate Agent doesn’t send screenshots to the LLM. It sends a compact structured DOM extract: interactive elements, their roles, labels, ARIA states. The LLM gets enough to make decisions, never sees raw page content or visual data.
For cloud LLMs, this structured data flows through the provider under your enterprise agreement. Sensitive data scrubbed at the agent layer (configurable redaction patterns) never reaches the model.
For self-hosted LLMs, none of it leaves your network. The agent server, the browser, the LLM, the reports — all behind your firewall. See self-hosted AI testing for the deployment story.
FAQ
BYO LLM means the testing platform doesn’t ship with a bundled model or force you to use a vendor’s cloud API. You choose the LLM provider — Claude, GPT, Gemini, Llama, Qwen, DeepSeek — and the platform calls it using your credentials. You control cost, compliance, and provider choice.
Three reasons: (1) No lock-in — switch providers without changing tests; (2) Cost control — your enterprise LLM agreement, your negotiated pricing, your budget; (3) Data sovereignty — when paired with self-hosting, your test data never touches a vendor’s infrastructure.
All of them. Cloud: Anthropic Claude, OpenAI GPT-4/4o, Google Gemini, Azure OpenAI. Aggregators: OpenRouter (200+ models), Together AI, Groq. Self-hosted: Ollama, vLLM, LM Studio, TGI, any OpenAI-compatible endpoint. Open-source models: Llama, Qwen, DeepSeek, Mistral, Gemma, GLM, Kimi.
Three factors: (1) Cost per test run — varies 100× across providers; (2) Latency — affects total test run duration; (3) Reasoning quality — matters for complex test scenarios and recovery from unexpected states. We recommend benchmarking 2–3 models against your specific workload. See LLM browser automation for technical depth.
Yes. Configure per-test or per-suite LLM routing. Pattern: use a cheap fast model (Llama 3.1 8B, Gemini Flash) for routine regression, and a high-quality model (Claude Opus, GPT-5) for complex acceptance scenarios. Optimize cost and quality simultaneously.
Depends heavily on model and test volume. Ballpark for 10,000 test runs/month with Karate Agent’s DOM-first architecture: Claude Sonnet ~$200–500, GPT-4 Turbo ~$300–600, Llama 3.3 70B self-hosted ~$0 marginal (hardware amortized). Self-hosted open-source models are typically 10–100× cheaper at scale.
Only what Karate Agent sends — which is structured DOM (interactive elements, labels, roles, state), not screenshots or raw page content. For cloud LLMs, this data flows through the provider under their data handling agreement. For self-hosted LLMs, data never leaves your infrastructure.
Yes — and increasingly this is the dominant pattern for enterprise deployments. Llama 3.3 70B, Qwen 2.5 72B, DeepSeek V3, and Mistral Large handle enterprise test workloads well. Smaller models (Qwen 2.5 32B, Llama 3.1 8B) work for routine flows with good prompt engineering.
Karate Agent gives you the runtime; you bring the LLM. Switch any time, mix per test, or self-host the entire stack.