Bring Your Own LLM

Any model. Any provider.
Your bill, your data.

AI testing tools that bundle their own LLM lock you into their model choice, their pricing, and their data handling. Karate Agent doesn’t. Plug in Claude, GPT, Llama, Qwen, or anything you can serve over HTTP — switch any time, switch per test, never call us about token costs again.

The principle

Three reasons BYO LLM matters

No lock-in. A bundled LLM is a hidden lock-in. When the vendor changes models, raises prices, or changes terms, you have no recourse — the only escape is a full platform migration. BYO LLM means the model is yours; the platform is just the runtime calling it.

Cost control. Your enterprise probably has negotiated Anthropic, OpenAI, or Azure agreements. BYO LLM means tests run under those agreements, not a per-seat reseller markup. For self-hosted open-source models, marginal cost is essentially zero once the hardware is in place.

Data sovereignty. Bundled LLMs route your test data through the vendor’s pipeline by default. BYO LLM gives you the choice: cloud LLM under your contract, or self-hosted model where nothing ever leaves your network. Regulated industries can finally adopt AI testing without an exception request.

Supported providers

Pick from anywhere

Karate Agent talks to anything that speaks OpenAI-compatible HTTP or Anthropic Messages format. That covers virtually every cloud provider, every aggregator, and every self-hosted inference server worth running.

Cloud — first-party

  • Anthropic Claude Opus / Sonnet / Haiku 4
  • OpenAI GPT-5, GPT-4o, GPT-4 Turbo
  • Google Gemini 2.5 Pro / Flash / Ultra
  • Azure OpenAI deployment-pinned
  • AWS Bedrock via Anthropic-compatible endpoint

Aggregators

  • OpenRouter 200+ models, one API
  • Together AI open-source at scale
  • Groq high-throughput inference
  • Fireworks production OSS hosting
  • DeepInfra, Replicate & others

Self-hosted

  • Ollama easy local serving
  • vLLM production inference
  • TGI HuggingFace text-gen
  • LM Studio desktop dev
  • Any OpenAI-compatible endpoint

Open-source models proven in production: Llama 3.3 70B, Qwen 2.5 72B, DeepSeek V3, Mistral Large, Gemma 2 27B, GLM 4, Kimi K1. Smaller models work for routine flows — see the cost section below.

In practice

Switch providers in one config line

Same test code, same scenarios, different LLM. Compare cost, latency, and reasoning quality without rewriting anything.

# karate-agent.yaml

# Anthropic Claude (cloud, first-party)
llm:
  provider: anthropic
  model: claude-sonnet-4-5
  api_key: ${ANTHROPIC_API_KEY}

# Or: OpenRouter aggregator (200+ models, one bill)
llm:
  provider: openai           # OpenAI-compatible
  base_url: https://openrouter.ai/api/v1
  model: qwen/qwen-2.5-72b-instruct
  api_key: ${OPENROUTER_API_KEY}

# Or: local Llama via Ollama (zero outbound calls)
llm:
  provider: openai
  base_url: http://ollama:11434/v1
  model: llama3.3:70b

That’s the entire switch. Test files don’t change. Scenarios don’t change. Re-run, observe the difference, choose what fits your CI budget.

Mix-and-match

Different models for different jobs

Routine regression doesn’t need a frontier model. Complex acceptance scenarios shouldn’t run on the cheap one. Route per-test for optimal cost-quality.

Fast & cheap

Llama 3.1 8B (local)

Routine regression, smoke tests, happy-path verification. <200ms per step, $0 marginal.

Use for: 80% of CI runs

Balanced

Claude Sonnet / Qwen 2.5 72B

Moderate complexity, recovery from unexpected states. Sweet spot for most tests.

Use for: standard acceptance, integration

Maximum reasoning

Claude Opus / GPT-5

Long multi-step flows, ambiguous UIs, exploratory testing. Highest quality, highest cost.

Use for: critical-path E2E, exploratory runs

Cost at scale

10,000 test runs / month

Karate Agent’s DOM-first architecture keeps token usage low. These are real-world ballparks for an average mid-complexity suite.

Provider / Model Monthly cost Notes
Claude Sonnet 4.5$200–500Strong default; cheap, fast, smart
GPT-4 Turbo$300–600Comparable quality, slightly higher cost
Gemini 2.5 Flash$50–150Very cheap, good for routine
OpenRouter / Qwen 2.5 72B$80–200Open-source via aggregator
Self-hosted Llama 3.3 70B~$0 marginalAfter hardware amortization

For comparison: vision-based agents on the same workload typically run $2,000–8,000/month at frontier model pricing. The architectural difference is the entire reason BYO LLM is even practical for testing.

Data handling

What the LLM actually sees

Karate Agent doesn’t send screenshots to the LLM. It sends a compact structured DOM extract: interactive elements, their roles, labels, ARIA states. The LLM gets enough to make decisions, never sees raw page content or visual data.

For cloud LLMs, this structured data flows through the provider under your enterprise agreement. Sensitive data scrubbed at the agent layer (configurable redaction patterns) never reaches the model.

For self-hosted LLMs, none of it leaves your network. The agent server, the browser, the LLM, the reports — all behind your firewall. See self-hosted AI testing for the deployment story.

FAQ

Frequently asked questions

What does “bring your own LLM” mean for AI testing?

BYO LLM means the testing platform doesn’t ship with a bundled model or force you to use a vendor’s cloud API. You choose the LLM provider — Claude, GPT, Gemini, Llama, Qwen, DeepSeek — and the platform calls it using your credentials. You control cost, compliance, and provider choice.

Why is BYO LLM important?

Three reasons: (1) No lock-in — switch providers without changing tests; (2) Cost control — your enterprise LLM agreement, your negotiated pricing, your budget; (3) Data sovereignty — when paired with self-hosting, your test data never touches a vendor’s infrastructure.

Which LLM providers does Karate Agent support?

All of them. Cloud: Anthropic Claude, OpenAI GPT-4/4o, Google Gemini, Azure OpenAI. Aggregators: OpenRouter (200+ models), Together AI, Groq. Self-hosted: Ollama, vLLM, LM Studio, TGI, any OpenAI-compatible endpoint. Open-source models: Llama, Qwen, DeepSeek, Mistral, Gemma, GLM, Kimi.

How do I choose the right LLM for my testing workload?

Three factors: (1) Cost per test run — varies 100× across providers; (2) Latency — affects total test run duration; (3) Reasoning quality — matters for complex test scenarios and recovery from unexpected states. We recommend benchmarking 2–3 models against your specific workload. See LLM browser automation for technical depth.

Can I use different LLMs for different tests?

Yes. Configure per-test or per-suite LLM routing. Pattern: use a cheap fast model (Llama 3.1 8B, Gemini Flash) for routine regression, and a high-quality model (Claude Opus, GPT-5) for complex acceptance scenarios. Optimize cost and quality simultaneously.

What about API costs at enterprise scale?

Depends heavily on model and test volume. Ballpark for 10,000 test runs/month with Karate Agent’s DOM-first architecture: Claude Sonnet ~$200–500, GPT-4 Turbo ~$300–600, Llama 3.3 70B self-hosted ~$0 marginal (hardware amortized). Self-hosted open-source models are typically 10–100× cheaper at scale.

Does the LLM see my application data?

Only what Karate Agent sends — which is structured DOM (interactive elements, labels, roles, state), not screenshots or raw page content. For cloud LLMs, this data flows through the provider under their data handling agreement. For self-hosted LLMs, data never leaves your infrastructure.

Can I use open-source LLMs in production?

Yes — and increasingly this is the dominant pattern for enterprise deployments. Llama 3.3 70B, Qwen 2.5 72B, DeepSeek V3, and Mistral Large handle enterprise test workloads well. Smaller models (Qwen 2.5 32B, Llama 3.1 8B) work for routine flows with good prompt engineering.

Your model. Your bill.
Your data, in your perimeter.

Karate Agent gives you the runtime; you bring the LLM. Switch any time, mix per test, or self-host the entire stack.