Bring Your Own LLM | Open-Weight AI Testing With Gemma 4 & Qwen 3.6

The principle

Three reasons BYO LLM matters

No lock-in. A bundled LLM is a hidden lock-in. When the vendor changes models, raises prices, or changes terms, you have no recourse — the only escape is a full platform migration. BYO LLM means the model is yours; the platform is just the runtime calling it.

Cost control. Your enterprise probably has negotiated Anthropic, OpenAI, or Azure agreements. BYO LLM means tests run under those agreements, not a per-seat reseller markup. For self-hosted open-source models, marginal cost is essentially zero once the hardware is in place.

Data sovereignty. Bundled LLMs route your test data through the vendor’s pipeline by default. BYO LLM gives you the choice: cloud LLM under your contract, or self-hosted model where nothing ever leaves your network. Regulated industries can finally adopt AI testing without an exception request.

Supported providers

Pick from anywhere

Karate Agent talks to anything that speaks OpenAI-compatible HTTP or Anthropic Messages format. That covers virtually every cloud provider, every aggregator, and every self-hosted inference server worth running.

Cloud — first-party

•Anthropic Claude Opus / Sonnet / Haiku 4
•OpenAI GPT-5, GPT-4o, GPT-4 Turbo
•Google Gemini 2.5 Pro / Flash / Ultra
•Azure OpenAI deployment-pinned
•AWS Bedrock via Anthropic-compatible endpoint

Aggregators

•OpenRouter 200+ models, one API
•Together AI open-source at scale
•Groq high-throughput inference
•Fireworks production OSS hosting
•DeepInfra, Replicate & others

Self-hosted

•Ollama easy local serving
•vLLM production inference
•TGI HuggingFace text-gen
•LM Studio desktop dev
•Any OpenAI-compatible endpoint

Benchmark-verified open-weight

Verified on open-weight models, not just frontier APIs.

Gemma 4 26B and Qwen 3.6 35B-A3B (Apache 2.0) both pass the full karate-agent UI-automation benchmark suite: form fills, vision reads, data extraction, deep navigation, and an end-to-end purchase. At parity, each on a single 24 GB GPU via Ollama, or via OpenRouter. Your model, your key, your network.

Benchmark-verified

Gemma 4 26B

Google · 4B active, MoE · vision + tool calling

google/gemma-4-26b-a4b-it

Benchmark-verified

Qwen 3.6 35B-A3B

Alibaba · 3B active, MoE · Apache 2.0 · vision + tool calling

qwen/qwen3.6-35b-a3b

Two verified models from two vendors, so your open-weight path never rests on a single one. Both ship in the console’s zero-config picker, with Gemma 4 as the default.

Everything else still runs through any OpenAI-compatible endpoint: Llama, DeepSeek, Mistral, GLM, Kimi, and 200+ more via OpenRouter. Smaller models work for routine flows, see the cost section below.

In practice

Switch providers in one config line

Same test code, same scenarios, different LLM. Compare cost, latency, and reasoning quality without rewriting anything.

# karate-agent.yaml

# Anthropic Claude (cloud, first-party)
llm:
  provider: anthropic
  model: claude-sonnet-4-5
  api_key: ${ANTHROPIC_API_KEY}

# Or: OpenRouter aggregator (200+ models, one bill)
llm:
  provider: openai           # OpenAI-compatible
  base_url: https://openrouter.ai/api/v1
  model: qwen/qwen-2.5-72b-instruct
  api_key: ${OPENROUTER_API_KEY}

# Or: local Llama via Ollama (zero outbound calls)
llm:
  provider: openai
  base_url: http://ollama:11434/v1
  model: llama3.3:70b

That’s the entire switch. Test files don’t change. Scenarios don’t change. Re-run, observe the difference, choose what fits your CI budget.

Mix-and-match

Different models for different jobs

Routine regression doesn’t need a frontier model. Complex acceptance scenarios shouldn’t run on the cheap one. Route per-test for optimal cost-quality.

Fast & cheap

Gemma 4 26B (local)

Benchmark-verified. Routine regression, smoke tests, happy-path verification. $0 marginal on your own 24 GB GPU.

Use for: 80% of CI runs

Balanced

Claude Sonnet / Qwen 3.6 35B-A3B

Moderate complexity, recovery from unexpected states. Sweet spot for most tests.

Use for: standard acceptance, integration

Maximum reasoning

Claude Opus / GPT-5

Long multi-step flows, ambiguous UIs, exploratory testing. Highest quality, highest cost.

Use for: critical-path E2E, exploratory runs

Cost at scale

10,000 test runs / month

Karate Agent’s DOM-first architecture keeps token usage low. These are real-world ballparks for an average mid-complexity suite.

Provider / Model	Monthly cost	Notes
Claude Sonnet 4.5	$200–500	Strong default; cheap, fast, smart
GPT-4 Turbo	$300–600	Comparable quality, slightly higher cost
Gemini 2.5 Flash	$50–150	Very cheap, good for routine
OpenRouter / Qwen 3.6 35B-A3B	$80–200	Benchmark-verified, open-weight via aggregator
Self-hosted Gemma 4 or Qwen 3.6	~$0 marginal	Benchmark-verified, after hardware amortization

For comparison: vision-based agents on the same workload typically run $2,000–8,000/month at frontier model pricing. The architectural difference is the entire reason BYO LLM is even practical for testing.

Data handling

What the LLM actually sees

Karate Agent doesn’t send screenshots to the LLM. It sends a compact structured DOM extract: interactive elements, their roles, labels, ARIA states. The LLM gets enough to make decisions, never sees raw page content or visual data.

For cloud LLMs, this structured data flows through the provider under your enterprise agreement. Sensitive data scrubbed at the agent layer (configurable redaction patterns) never reaches the model.

For self-hosted LLMs, none of it leaves your network. The agent server, the browser, the LLM, the reports — all behind your firewall. See self-hosted AI testing for the deployment story.

FAQ

Frequently asked questions

What does “bring your own LLM” mean for AI testing?

BYO LLM means the testing platform doesn’t ship with a bundled model or force you to use a vendor’s cloud API. You choose the LLM provider — Claude, GPT, Gemini, Llama, Qwen, DeepSeek — and the platform calls it using your credentials. You control cost, compliance, and provider choice.

Why is BYO LLM important?

Three reasons: (1) No lock-in — switch providers without changing tests; (2) Cost control — your enterprise LLM agreement, your negotiated pricing, your budget; (3) Data sovereignty — when paired with self-hosting, your test data never touches a vendor’s infrastructure.

Which LLM providers does Karate Agent support?

All of them. Cloud: Anthropic Claude, OpenAI GPT-4/4o, Google Gemini, Azure OpenAI. Aggregators: OpenRouter (200+ models), Together AI, Groq. Self-hosted: Ollama, vLLM, LM Studio, TGI, any OpenAI-compatible endpoint. Open-weight: Gemma 4 26B and Qwen 3.6 35B-A3B (Apache 2.0) are benchmark-verified across the full UI-automation suite, each on a single 24 GB GPU; Llama, DeepSeek, Mistral, GLM and Kimi also run.

How do I choose the right LLM for my testing workload?

Three factors: (1) Cost per test run — varies 100× across providers; (2) Latency — affects total test run duration; (3) Reasoning quality — matters for complex test scenarios and recovery from unexpected states. We recommend benchmarking 2–3 models against your specific workload. See LLM browser automation for technical depth.

Can I use different LLMs for different tests?

Yes. Configure per-test or per-suite LLM routing. Pattern: use a cheap fast model (Llama 3.1 8B, Gemini Flash) for routine regression, and a high-quality model (Claude Opus, GPT-5) for complex acceptance scenarios. Optimize cost and quality simultaneously.

What about API costs at enterprise scale?

Depends heavily on model and test volume. Ballpark for 10,000 test runs/month with Karate Agent’s DOM-first architecture: Claude Sonnet ~$200–500, GPT-4 Turbo ~$300–600, Llama 3.3 70B self-hosted ~$0 marginal (hardware amortized). Self-hosted open-source models are typically 10–100× cheaper at scale.

Does the LLM see my application data?

Only what Karate Agent sends — which is structured DOM (interactive elements, labels, roles, state), not screenshots or raw page content. For cloud LLMs, this data flows through the provider under their data handling agreement. For self-hosted LLMs, data never leaves your infrastructure.

Can I use open-source LLMs in production?

Yes — and increasingly this is the dominant pattern for enterprise deployments. Gemma 4 26B and Qwen 3.6 35B-A3B (Apache 2.0) are benchmark-verified across the full UI-automation suite and each run on a single 24 GB GPU, which makes them the safest open-weight starting point. Llama 3.3 70B, DeepSeek V3 and Mistral also handle enterprise test workloads, though only Gemma 4 and Qwen 3.6 carry the benchmark stamp.

Any model. Any provider.
Your bill, your data.

Three reasons BYO LLM matters

Pick from anywhere

Verified on open-weight models, not just frontier APIs.

Switch providers in one config line

Different models for different jobs

Gemma 4 26B (local)

Claude Sonnet / Qwen 3.6 35B-A3B

Claude Opus / GPT-5

10,000 test runs / month

What the LLM actually sees

Frequently asked questions

Your model. Your bill.
Your data, in your perimeter.

Any model. Any provider. Your bill, your data.

Three reasons BYO LLM matters

Pick from anywhere

Verified on open-weight models, not just frontier APIs.

Switch providers in one config line

Different models for different jobs

Gemma 4 26B (local)

Claude Sonnet / Qwen 3.6 35B-A3B

Claude Opus / GPT-5

10,000 test runs / month

What the LLM actually sees

Frequently asked questions

Your model. Your bill.Your data, in your perimeter.

Any model. Any provider.
Your bill, your data.

Your model. Your bill.
Your data, in your perimeter.