Cloud-based AI browser agents are impressive. Claude computer use, OpenAI Operator, Google’s experimental agents — all can drive browsers to accomplish meaningful work. Demos are compelling. Proof-of-concepts are fast to set up.
Enterprise production deployment is a different story. Three hard constraints stop cloud agents at the procurement gate: data residency, cost at scale, and vendor lock-in. This post is about how enterprises are running AI browser agents on their own infrastructure in 2026, and what that deployment actually looks like.
Why cloud AI agents don’t work for regulated workloads
When a cloud-based browser agent operates against your web application, it sends screenshots or DOM extracts of that application to the vendor’s inference servers. Those screenshots and DOM extracts often contain:
- Personally identifiable information (PII) — names, emails, addresses in form fields
- Financial data — account numbers, transactions, portfolio positions
- Healthcare data — patient records, appointment information
- Business-sensitive data — pricing, deal flow, strategic documents
- Session credentials — authentication tokens, cookies, session state
For regulated industries — banking, insurance, healthcare, government, defense — this is a non-starter. Compliance frameworks (GDPR, HIPAA, PCI DSS, SOX, industry-specific rules) require that sensitive data stay inside the organization’s control plane.
Even for unregulated enterprises, the risk calculus is unfavorable. Why expose your application’s UI — which is your business logic, your user experience, your competitive position — to a third party when you don’t have to?
The on-premises AI agent stack
Running AI browser agents on your own infrastructure requires three components: the agent server, the browser runtime, and the LLM. Modern tools let you deploy all three inside your network.
Component 1: the agent server
Karate Agent is the reference implementation for enterprise on-premises deployment. It ships as a Docker image. One container runs the agent server; additional containers run browser sessions on demand. Kubernetes orchestrates horizontal scale.
Component 2: the browser runtime
Headless Chrome inside Docker. Each test session runs in its own container — complete isolation, no state leakage. The agent server communicates with browsers via Chrome DevTools Protocol.
Component 3: the LLM
This is the part that cloud-based agents dictate. Self-hosted AI agents let you choose.
- Cloud LLM (hybrid): keep the agent on-prem, but call out to Anthropic/OpenAI/Google for inference. Acceptable for non-sensitive workloads under existing enterprise LLM agreements.
- Self-hosted LLM (fully on-prem): run Llama, Qwen, DeepSeek, or Mistral on your own GPU infrastructure via Ollama or vLLM. No data leaves.
Reference architecture: fully on-premises
# Kubernetes deployment (excerpt)
# Agent server
apiVersion: apps/v1
kind: Deployment
metadata:
name: karate-agent
spec:
replicas: 3
template:
spec:
containers:
- name: agent
image: karatelabs/agent:latest
env:
- name: LLM_PROVIDER
value: ollama
- name: LLM_ENDPOINT
value: http://ollama:11434
- name: LLM_MODEL
value: llama3.3:70b
---
# LLM runtime (GPU node)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
spec:
serviceName: ollama
replicas: 1
template:
spec:
nodeSelector:
node.kubernetes.io/gpu: "true"
containers:
- name: ollama
image: ollama/ollama:latest
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: models
mountPath: /root/.ollama
This is running enterprise AI browser testing with zero external dependencies. No internet required after deployment. No telemetry. No model calls phoning home.
Hardware planning
Sizing depends on which LLM you run and how many concurrent sessions you need.
LLM hardware (the expensive part)
- Llama 3.1 8B (Q4): runs on a single RTX 4070 Ti (~12GB VRAM). Fine for routine workloads.
- Llama 3.3 70B (Q4): runs on an A100 40GB with spill, or 2×RTX 4090. Good for most enterprise workloads.
- Qwen 2.5 72B (FP16): needs 144GB VRAM, multi-GPU (2×A100 80GB or 4×A100 40GB).
- DeepSeek V3: even larger, typically needs H100 nodes.
A single H100 node serves dozens of concurrent test sessions. A smaller RTX 4090 workstation handles departmental-scale QA.
Agent + browser containers (the cheap part)
- Agent server: 2 vCPU, 4GB RAM for hundreds of concurrent sessions
- Browser containers: ~1GB RAM each, roughly 0.5 vCPU under load
- Typical enterprise worker (16GB RAM, 8 vCPU): 10-20 concurrent sessions
Operations and observability
Self-hosted AI agents integrate with standard enterprise ops tooling:
- Logs: structured JSON to stdout → Fluentd → Elastic / Splunk / Loki
- Metrics: Prometheus scrape endpoints → Grafana dashboards
- Traces: OpenTelemetry-compatible for distributed tracing
- Reports: archived to S3 / Azure Blob / GCS with retention policies
- Alerts: on failure-rate spikes, session pool exhaustion, GPU utilization, LLM latency
Security posture
Standard enterprise security hygiene applies:
- Image provenance verified via signatures (Cosign, Notary)
- Containers run as non-root with minimum capabilities
- Network policies restrict egress (critical for air-gapped deployments)
- SSO via SAML/OIDC against enterprise IdP
- Role-based access control governs who can trigger tests, view reports, modify configuration
- Secrets (API keys, SSL certs) managed via enterprise vault integration
Who’s running this in production
Financial services customers run the pattern we’ve described, typically with Llama 3.3 70B self-hosted, for trading platform and core banking regression. Insurance customers do the same for Guidewire deployments. Healthcare customers run against EHR front-ends. In each case, the combination of on-premises deployment + self-hosted LLM + Docker-native architecture removes all the regulatory blockers that stop cloud agents.
Where to start
- Read the self-hosted AI testing landing page for deployment architectures
- Review the enterprise evaluation page for procurement answers
- Pilot with Karate Agent using a cloud LLM first, then swap to self-hosted in week two
- Scale to production over a quarter
Enterprise AI browser testing doesn’t require the cloud. It requires the right architecture. And by late 2026, the on-premises pattern is the dominant one for teams that care about data, cost, and control.