Documents
performance
performance
Type
External
Status
Published
Created
Mar 25, 2026
Updated
Apr 4, 2026
Updated by
Dosu Bot

Performance#

Pipelock adds microseconds of overhead per request. The proxy is I/O bound (waiting for upstream responses), not CPU bound. For the request-side URL scanning hot path, CPU is never the bottleneck. Response scanning and MCP scanning on large payloads can use measurable CPU at high throughput (see tables below).

All numbers from Go benchmarks on AMD Ryzen 7 7800X3D (8 cores / 16 threads) / Go 1.25 / Linux. Run make bench to reproduce on your hardware. See benchmarks.md for raw ns/op data.

Scanning Latency (single request)#

URL Scanning (fetch/forward proxy hot path)#

11-layer pipeline: scheme, CRLF injection, path traversal, blocklist, DLP, path entropy, subdomain entropy, SSRF, rate limit, URL length, data budget.

OperationLatencyThroughput (1 core)
Full pipeline (allowed URL)~32 μs~31,000/sec
Blocklist block (early exit)~2 μs~500,000/sec
DLP pattern match (47 patterns, pre-filtered)~8 μs~130,000/sec
DLP pre-filter only (clean text, zero alloc)~400 ns~2,500,000/sec
Entropy detection~58 μs~17,000/sec
Complex URL (ports, query params)~60 μs~17,000/sec

MCP Scanning (tool call/response inspection)#

JSON-RPC parsing + text extraction + prompt injection pattern matching.

OperationLatencyThroughput (1 core)
Clean tool response~78 μs~13,000/sec
Injection detected (early exit)~36 μs~28,000/sec
Text extraction~2.5 μs~400,000/sec

Response Scanning (fetched content injection detection)#

Pattern matching against 25 prompt injection patterns (including 6 state/control patterns and 4 CJK-language patterns) on fetched page content.

OperationLatencyThroughput (1 core)
Short clean text (~90B)~76 μs~13,000/sec
10KB clean text~8.4 ms~120/sec
Injection detected (early exit)~42 μs~24,000/sec
State/control clean~134 μs~7,500/sec

The keyword pre-filter (added in v1.3.0) short-circuits regex evaluation when no injection keywords are present in the normalized text. This cut clean-text latency by 29%, large-content latency by 27%, and injection-detected latency by 3.1x (early keyword match skips later normalization passes). The 10KB response scan remains the current ceiling due to 6 sequential normalization passes. Content size tiering (skipping passes 3-6 for large content) is planned.

Supporting Operations#

OperationLatency
Unicode normalization (DLP mode)~950 ns
Unicode normalization (matching mode)~1.3 μs
Unicode normalization (tool text mode)~2.0 μs
Shannon entropy calculation~2.2 μs
Domain matching (exact)~50 ns
Domain matching (wildcard)~53 ns

Concurrent Scaling#

The scanner's core detection pipeline (scheme, blocklist, DLP, entropy, SSRF) is stateless per request with no shared mutable state. Config reads use atomic pointer swap. Rate limiting and data budget tracking use per-scanner mutexes, but these are low-contention (one lock acquisition per request). Benchmarks below are run with rate limiting and data budget disabled to isolate scanning throughput.

Parallel throughput (b.RunParallel)#

These benchmarks run across all available goroutines simultaneously, measuring total operations per second as parallelism increases.

URL Scanning:

GOMAXPROCSns/opThroughputScaling vs 1
144,13522,700/sec1.0x
223,05243,400/sec1.9x
412,35680,900/sec3.6x
87,177139,300/sec6.1x
166,500153,800/sec6.8x

DLP Block (early exit):

GOMAXPROCSns/opThroughputScaling vs 1
17,625131,100/sec1.0x
24,017248,900/sec1.9x
42,204453,700/sec3.5x
81,414707,200/sec5.4x
161,184844,600/sec6.4x

Response Scanning (short content):

GOMAXPROCSns/opThroughputScaling vs 1
187,81811,400/sec1.0x
245,76721,800/sec1.9x
423,97841,700/sec3.7x
814,62868,400/sec6.0x
1612,90077,500/sec6.8x

Response Scanning (10KB content):

GOMAXPROCSns/opThroughputScaling vs 1
111,780,29585/sec1.0x
26,657,276150/sec1.8x
43,093,228323/sec3.8x
81,898,905527/sec6.2x
161,928,156519/sec6.1x

MCP Scanning (clean response):

GOMAXPROCSns/opThroughputScaling vs 1
187,76411,400/sec1.0x
423,54042,500/sec3.7x
813,44274,400/sec6.5x
1611,51086,900/sec7.6x

Blocklist (early exit):

GOMAXPROCSns/opThroughputScaling vs 1
12,139467,500/sec1.0x
21,132883,400/sec1.9x
46331,580,300/sec3.4x
84232,364,100/sec5.1x
163642,747,300/sec5.9x

Concurrent throughput scaling (goroutine ramp)#

Sustained 2-second runs at increasing goroutine counts. Measures total operations completed, not per-goroutine latency.

URL Scan:

GoroutinesOps/secScaling
119,4661.0x
237,1221.9x
467,7223.5x
8106,3215.5x
16121,3376.2x
32115,8756.0x
64123,9596.4x

Response Scan:

GoroutinesOps/secScaling
18,2841.0x
216,1351.9x
431,4173.8x
852,4056.3x
1662,7767.6x
3266,5758.0x
6465,4707.9x

The pattern: near-linear scaling up to physical core count (8), small gains from hyperthreading (16), then plateau. No degradation past core count. Adding more concurrent agents doesn't slow anything down, you just stop getting additional throughput once all cores are saturated.

HTTP Proxy Overhead#

Raw HTTP handler throughput measured with hey against the running proxy.

ConcurrencyRequestsReq/secP50P99
502,00043,4740.5 ms18.5 ms
20010,000102,6000.7 ms23.2 ms
50020,00097,2682.0 ms51.9 ms

This measures HTTP accept/parse/route/respond overhead. Actual scanning latency adds the per-operation costs from the tables above.

CPU Cost at Scale#

How much CPU does scanning consume at various request rates? These numbers cover scanning overhead only, not network I/O.

Request-side scanning (URL + MCP)#

Request rateCPU (URL scan)CPU (MCP scan)
100/sec0.4% of 1 core0.9% of 1 core
1,000/sec3.7% of 1 core8.9% of 1 core
10,000/sec37% of 1 core0.9 cores
100,000/sec3.7 cores8.9 cores

Response-side scanning#

Request rateCPU (short ~90B)CPU (10KB content)
100/sec0.8% of 1 core1.2 cores
1,000/sec8.1% of 1 core12.1 cores

Response scanning is the most CPU-intensive path. At high throughput with large payloads, it dominates. For request-side scanning only, 1,000 requests per second uses less than 15% of a single CPU core. Network latency (waiting for upstream HTTP responses) dominates total request time by orders of magnitude.

Deployment Sizing#

DeploymentExpected loadCPU recommendation
Single developer (local proxy)1-10 req/secAny (negligible overhead)
Team sidecar (per-agent)10-100 req/sec0.1 CPU, 64MB RAM
Shared proxy (small org)100-1,000 req/sec0.5 CPU, 128MB RAM
Platform deployment10,000+ req/sec2+ CPU, 256MB RAM

The binary is ~18MB static (release build with symbol stripping). Memory usage is dominated by the DLP regex compilation (~40MB RSS at idle with default patterns) and scales linearly with concurrent connections, not pattern count.

Design Decisions That Affect Performance#

Early exit on block. Blocked URLs short-circuit at the first failing layer. Blocklist hits resolve in ~2μs. DLP matches exit before DNS resolution.

Pre-DNS checks. CRLF injection, path traversal, allowlist, blocklist, and DLP checks all execute before any network call. This prevents secret exfiltration via DNS queries and keeps the fast path fast.

Stateless detection pipeline. Each scan allocates its own working state. The core detection layers (scheme through SSRF) have no shared mutable state, enabling linear scaling with cores. Rate limiting and data budget use per-scanner mutexes but are low-contention.

Fire-and-forget event emission. Webhook events use an async buffered channel. Syslog is UDP. Neither blocks the scanning pipeline.

Atomic config reload. Hot-reload swaps the entire scanner via atomic.Pointer, so scanning never blocks on config changes.

Reproducing These Numbers#

# Full benchmark suite (sequential)
make bench

# Parallel scaling (URL scanner)
go test -bench=BenchmarkParallel -benchtime=3s -cpu=1,2,4,8,16 ./internal/scanner/

# Parallel scaling (MCP scanner)
go test -bench=BenchmarkParallel -benchtime=3s -cpu=1,4,8,16 ./internal/mcp/

# Concurrent throughput scaling test (~28s)
PIPELOCK_BENCH_SCALING=1 go test -v -run=TestConcurrentThroughputScaling ./internal/scanner/

# HTTP proxy overhead (requires running pipelock instance)
hey -n 10000 -c 200 http://localhost:8888/health