Browser. Agent. LLM. Reports. Every component runs on infrastructure you control. No outbound calls, no telemetry, no hosted control plane — the only AI testing architecture regulated industries can actually deploy.
Why self-host
Regulatory compliance. Financial services, insurance, and healthcare have data-residency, processing-jurisdiction, and audit-trail requirements that no SaaS testing vendor can satisfy by writing them on a marketing page. The compliance officer wants the application data to stay inside the perimeter. That’s self-hosted.
IP protection. Test data is often production-like data — or actual production data anonymized. Sending it through a third-party LLM, however well-meaning the vendor, introduces an attack surface most security teams will never sign off on. Self-hosted removes the question.
Cost predictability at scale. Cloud testing tools price per-seat or per-test. Self-hosted is amortized infrastructure cost. At 50,000+ test runs/month, the break-even moves decisively toward self-hosted, even before factoring in negotiated hardware pricing.
The architecture
Lightweight orchestrator. Receives test requests, manages browser sessions, coordinates LLM calls.
~500MB RAM, minimal CPU
One headless Chrome per test session. Pinned versions, ephemeral state, no cross-contamination.
~1GB each, parallel
Ollama, vLLM, or TGI serving Llama / Qwen / DeepSeek. The reasoning brain of the system.
64–140GB VRAM for 70B
HTML reports, JUnit XML, Cucumber JSON, session video. Stored on your S3-compatible bucket or NFS.
No vendor blob store
Zero outbound calls. Once deployed, the system makes no network requests outside your cluster. No phone-home telemetry, no license-server pings, no auto-update calls. The only network traffic is from the agent to the LLM endpoint, and from the agent to the application under test — both inside your perimeter.
In practice
The shape of a real production deployment. Configure once, scale by adding nodes.
# docker-compose.yml (or equivalent Kubernetes manifest)
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1 # 1x H100 or A100
volumes:
- ./models:/root/.ollama
command: "serve"
karate-agent:
image: karatelabs/karate-agent:latest
depends_on: [ollama]
environment:
# Point at the local Ollama, not a cloud LLM
LLM_BASE_URL: http://ollama:11434/v1
LLM_MODEL: llama3.3:70b
ports: ["8080:8080"]
Same pattern in Kubernetes: a Deployment for Ollama with GPU resource requests, a Deployment for the agent, a Service routing between them. Standard ops — no special operators or proprietary CRDs.
Hardware
| Component | Lean dev | Team prod | Enterprise scale |
|---|---|---|---|
| Agent + browsers | 2 vCPU / 4GB | 8 vCPU / 16GB | K8s, autoscaling |
| LLM (small) | RTX 4090 + Llama 3.1 8B | RTX 4090 + Qwen 32B | — |
| LLM (production) | — | A100 80GB + Llama 70B | 2–4× H100 + 70B model |
| Parallel sessions | ~10 | ~50 | 500+ |
The agent and browsers scale linearly with CPU/RAM — standard ops. The LLM is the budget item. A single H100 can serve dozens of concurrent test sessions; most enterprise deployments need 2–4 GPUs for sustained CI throughput.
Security properties
Never leaves the cluster. The agent talks to your app, extracts DOM, sends only structured data to the local LLM. No screenshots transmitted off-host.
Test credentials, API tokens, secrets — all live in your secrets manager. Karate Agent reads them via standard Kubernetes Secrets / HashiCorp Vault / AWS Secrets Manager.
HTML, video, JUnit XML, Cucumber JSON — all written to your storage. Datadog, Splunk, Elastic, S3-compatible buckets — pick yours.
None. Karate Agent does not phone home, does not call out to license servers, does not report anonymous usage stats. Zero outbound connections by design.
Observability
Karate Agent emits structured JSON logs (parseable by anything), Prometheus metrics (success rate, latency, LLM token usage), and JUnit XML + HTML reports per run.
Tested integrations: Datadog, Grafana / Loki, Splunk, Elastic Stack, New Relic. Anything that consumes JSON logs or Prometheus metrics works without configuration.
For air-gapped deployments specifically, the metrics endpoint stays on-cluster — you scrape into your local Prometheus, no cloud telemetry endpoint required.
FAQ
Self-hosted AI testing runs the entire test automation pipeline — agent, browser, LLM — on infrastructure you control. No calls to a vendor’s cloud for orchestration. No application data or screenshots sent to third parties for model inference. For regulated industries, self-hosted is the only viable deployment model.
Air-gapped is a specific variant of self-hosted where the system has no internet connectivity at all. Self-hosted can still reach external LLMs (Claude, GPT) over the internet if you choose. Air-gapped means the LLM runs locally too — typically via Ollama or vLLM with open-source models. See bring your own LLM.
The agent server is lightweight (~500MB RAM, minimal CPU). The browser containers are standard headless Chrome (~1GB each, scales with parallelism). Local LLMs are the heaviest component: a 70B model needs 64–140GB VRAM depending on quantization. A single A100 or H100 can serve dozens of concurrent test sessions; smaller deployments use RTX 4090 or similar.
Yes, in a self-hosted-but-cloud-LLM pattern: agent runs on your infrastructure, but calls out to Anthropic or OpenAI for inference. Data that goes to the LLM (structured DOM, not screenshots) flows through their APIs under your existing enterprise agreement. Good fit for teams without data residency constraints.
Strong performers in 2026 for DOM-first testing: Llama 3.3 70B, Qwen 2.5 72B, DeepSeek V3, Mistral Large 2. Smaller viable models for cost-sensitive deployments: Qwen 2.5 32B, Llama 3.1 8B (with careful prompt engineering). See LLM browser automation for the full matrix.
Two containers: ollama/ollama for the model server, karatelabs/karate-agent for the test agent. Karate Agent’s config points to the Ollama endpoint instead of a cloud API. Kubernetes manifest or Docker Compose — both well supported.
Yes. Karate Agent emits structured logs in JSON, exports Prometheus metrics, and produces JUnit XML + HTML reports per run. Integrates naturally with Datadog, Grafana, Splunk, Elastic, or any standard observability stack.
Self-hosted trades operational complexity for data sovereignty, cost control, and regulatory compliance. Cloud-based tools are easier to start (no infra), but expose you to data residency risk, vendor lock-in, and per-token pricing surprises at scale. For enterprise and regulated industries, self-hosted is the dominant pattern. See enterprise AI testing.
Standard Kubernetes manifests, standard Docker Compose, standard observability integrations. The most enterprise-friendly AI testing deployment available.