Production Load Balancing#

Introduction#

When many cracking agents are active simultaneously, a single CipherSwarm web instance can become a bottleneck. Each agent sends periodic status updates, crack submissions, and task requests — all of which compete for the same Puma thread pool. This guide describes how to horizontally scale the web tier using nginx as a reverse proxy load balancer in front of multiple Puma replicas.

Architecture Overview#

Component roles:

Nginx — Accepts all incoming HTTP traffic on port 80 and distributes requests across web replicas using the least_conn algorithm. Uses Docker's embedded DNS (127.0.0.11) to discover replicas automatically. Includes a dedicated /cable location for Action Cable WebSocket connections (Turbo Streams).
Web replicas (Puma) — Each replica is an independent container running the full Rails stack. Puma serves Rails requests using a clustered mode with multiple worker processes (configurable via WEB_CONCURRENCY, default 2) and a thread pool per worker. Nginx handles HTTP/2, compression, and asset caching at the load balancer level.
Backend services — PostgreSQL, Redis, and Sidekiq are shared by all replicas. Rails cookie-based sessions are stateless, so no sticky sessions are required.

Scaling Guidelines#

The n+1 Formula#

Set the number of web replicas to n + 1, where n is the number of fully active cracking nodes. This is a conservative upper bound assuming worst-case scenarios where all agents submit status updates, crack results, and task requests simultaneously:

Active Nodes	Recommended Replicas	Rationale
4	5	Small deployment
8	9	Default configuration
16	17	Medium deployment
32	33	Large deployment

The +1 buffer ensures that even if one replica is temporarily unhealthy or handling a slow request, the remaining replicas can absorb the load without queuing.

For typical deployments with 30-second heartbeat intervals, fewer replicas may suffice. Monitor nginx access logs (check upstream_response_time) and Puma queue depth to right-size your deployment.

Important for horizontal scaling: Before scaling to multiple web replicas, run database migrations once using RUN_DB_PREPARE=true (see Step-by-Step Deployment). This prevents migration races where multiple containers might try to run migrations simultaneously.

Resource Considerations#

Each web replica is constrained to:

CPU: 1 core (limit), 0.5 core (reservation)
Memory: 2 GB (limit), 1 GB (reservation)

Both web and Sidekiq services need memory headroom for tmpfs mounts (up to 768 MB combined for /tmp and /rails/tmp) alongside the Ruby process. PostgreSQL also has higher limits (2 GB) to handle connection pooling from multiple web replicas.

Plan your host resources accordingly. For example, 9 web replicas require at minimum 4.5 CPU cores and 9 GB RAM reserved, with burst capacity up to 9 cores and 18 GB. See docker-compose.prod.yml for the canonical resource definitions.

Configuration#

Nginx (`docker/nginx/nginx.conf`)#

Key settings and their purpose:

Setting	Value	Purpose
`resolver 127.0.0.11`	Docker DNS	Discovers all web replica IPs dynamically
`zone web_backend_zone 64k`	Shared memory	Required for the `resolve` parameter on upstream servers
`least_conn`	Algorithm	Sends new requests to the replica with the fewest active connections
`resolve`	DNS re-query	Re-resolves DNS so new/removed replicas are picked up (nginx 1.27.3+)
`max_fails=3`	Passive health	Marks a replica as down after 3 consecutive failures
`fail_timeout=30s`	Recovery window	Waits 30 s before retrying a failed replica
`keepalive 32`	Connection pool	Reuses TCP connections to backends for efficiency
`proxy_read_timeout 300s`	Long reads	Allows slow API responses in the general `/` location
`proxy_next_upstream`	Retry policy	Retries GET/HEAD on error, timeout, 502, 503, 504
`proxy_next_upstream_timeout 30s`	Retry budget	Bounds total retry duration to prevent cascading delays
`client_max_body_size 100M`	Default limit	Server-level default for non-storage endpoints
`client_max_body_size 0`	Unlimited	Active Storage location only — allows arbitrarily large uploads
`proxy_buffering off`	Streaming	Active Storage location — streams downloads directly to clients
`/cable` location	WebSocket	Upgrades connections for Action Cable (Turbo Streams)
`/rails/active_storage/` location	File transfers	Unbuffered uploads/downloads with 1-hour timeouts

Nginx version requirement: The resolve parameter in upstream blocks requires nginx >= 1.27.3 (open-sourced from nginx Plus). The default nginx:alpine image (currently 1.29.x) supports this. Do not pin to an older version.

Adjusting Timeouts#

The general / location uses proxy_read_timeout 300s and proxy_send_timeout 60s, which are sufficient for API requests and normal web traffic. The /rails/active_storage/ location uses 1-hour timeouts (3600s) for large file uploads and downloads. If agents experience timeouts during file transfers, check that they are using the Active Storage endpoints (which have the longer timeouts). See docker/nginx/nginx.conf for the canonical timeout values.

Deployment Instructions#

Prerequisites#

Docker Engine 20.10+
Docker Compose V2
Required environment variables:
- RAILS_MASTER_KEY — Rails credentials encryption key
- POSTGRES_PASSWORD — PostgreSQL root password (required; fail-fast if not set)
- TUSD_HOOK_SECRET — Shared secret for authenticating tusd webhook requests (required; prevents cache poisoning attacks). Generate with openssl rand -hex 32
- APPLICATION_HOST — Application hostname for mailers, redirects, and DNS rebinding protection
- MINIO_PUBLIC_IP — Public IP for MinIO (if using MinIO storage)

Note: See Environment Variables Reference for comprehensive documentation of all configuration options, including DISABLE_SSL for reverse proxy setups, production validation requirements, common configuration scenarios, and troubleshooting guidance.

Step-by-Step Deployment#

Set environment variables:

export RAILS_MASTER_KEY=your_master_key
export POSTGRES_PASSWORD=your_secure_password
export TUSD_HOOK_SECRET=$(openssl rand -hex 32)

Run database migrations (first-time setup or after updates):
```
# Run migrations once before scaling to multiple replicas to avoid migration races
docker compose -f docker-compose.yml -f docker-compose.prod.yml run --rm -e RUN_DB_PREPARE=true web
```
Important: Database migrations must be run once before starting multiple web replicas. The RUN_DB_PREPARE=true flag prevents migration races where multiple containers might try to run migrations simultaneously. Regular web service containers should not have this flag set.
Adjust the replica count (optional — edit docker-compose.prod.yml):
```
web:
  deploy:
    replicas: 9 # n+1 where n = active cracking nodes
```
Or pass it at the command line (see step 4).

Deploy the stack:

docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Or with a specific replica count:
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --scale web=9

Verify all replicas are healthy:
```
docker compose -f docker-compose.yml -f docker-compose.prod.yml ps
```
All web replicas and the nginx service should show healthy status. Note that web replicas take 10-45 seconds to boot Rails, and nginx waits for at least one healthy web replica before starting.

Test load distribution (optional):

# Send several requests and observe different upstream addresses in nginx logs
for i in $(seq 1 10); do
  curl -s -o /dev/null -w "%{http_code}\n" http://localhost/up
done

Scaling Without Downtime#

Scale web replicas up or down at any time without restarting other services:

# Scale to 17 replicas (for 16 active nodes)
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --scale web=17

# Scale down to 5 replicas
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --scale web=5

Or use the justfile shortcut:

just docker-prod-scale 17

Nginx's DNS re-resolution (every 10 s) automatically picks up added or removed replicas.

Monitoring and Troubleshooting#

Checking Service Health#

# Overview of all services
just docker-prod-status

# Or directly
docker compose -f docker-compose.yml -f docker-compose.prod.yml ps

Viewing Logs#

# Nginx access/error logs
just docker-prod-logs-nginx

# Web replica logs (all replicas interleaved)
just docker-prod-logs-web

# Specific replica logs
docker compose -f docker-compose.yml -f docker-compose.prod.yml logs web --index 1

The nginx access log includes upstream context (upstream=, upstream_status=, upstream_response_time=, request_time=) to help diagnose load distribution and slow backends.

Common Issues#

Symptom	Likely Cause	Solution
Replicas fail to start	Insufficient host resources	Reduce replica count or increase host capacity
Uneven load distribution	DNS cache stale	Restart nginx: `docker compose -f docker-compose.yml -f docker-compose.prod.yml restart nginx`
Connection timeouts on uploads	`proxy_read_timeout` too low	Increase timeout in `docker/nginx/nginx.conf`
502 Bad Gateway	All replicas down or starting	Wait for health checks to pass; check web replica logs
Database connection errors	Too many connections	Tune `pool` size in `config/database.yml` or add PgBouncer
OOM kills (exit code 137)	Memory limit too low	Check `docker inspect --format='{{.State.OOMKilled}}'`; increase service limit
WebSocket disconnects	Missing `/cable` location	Verify `docker/nginx/nginx.conf` has the `/cable` WebSocket block

Performance Monitoring#

Watch nginx access logs for slow responses (upstream_response_time > 1 s)
Monitor PostgreSQL connection count: SELECT count(*) FROM pg_stat_activity;
Monitor Redis memory usage: redis-cli INFO memory
Track Sidekiq queue depth via the Sidekiq Web UI at /sidekiq

Security Considerations#

SSL/TLS Termination#

The current configuration serves plain HTTP on port 80. The DISABLE_SSL environment variable controls Rails SSL/HTTPS enforcement (see Environment Variables Reference for detailed DISABLE_SSL documentation).

For deployments exposed to untrusted networks:

Recommended: Place an external TLS-terminating reverse proxy (e.g., Caddy, Traefik, or a cloud load balancer) in front of the nginx service. Set DISABLE_SSL=true so Rails delegates SSL enforcement to the upstream proxy.
Self-signed certificates (typical for lab environments): Add TLS certificates to the nginx configuration directly by mounting certs and updating the server block to listen on 443 with ssl_certificate and ssl_certificate_key. Leave DISABLE_SSL unset or set to empty.
Isolated lab environments: The default configuration (plain HTTP) is appropriate when the deployment is not exposed to untrusted networks. Set DISABLE_SSL=true to prevent Rails from forcing HTTPS redirects.

Header Forwarding#

Nginx forwards X-Real-IP, X-Forwarded-For, and X-Forwarded-Proto headers so Rails can correctly identify client IPs and protocol. Ensure any upstream proxy also preserves these headers.

Rate Limiting#

For deployments exposed to the public internet, consider adding nginx rate limiting:

# Example: limit API endpoints to 10 requests/second per IP
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location /api/ {
    limit_req zone=api burst=20 nodelay;
    proxy_pass http://web_backend;
    # ... other proxy settings
}

Maintenance#

Rolling Updates#

To update the web application image without downtime:

# Pull the latest image
docker compose -f docker-compose.yml -f docker-compose.prod.yml pull web

# Run migrations once before restarting replicas (if schema changed)
docker compose -f docker-compose.yml -f docker-compose.prod.yml run --rm -e RUN_DB_PREPARE=true web

# Recreate web replicas (nginx continues serving via remaining replicas)
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --no-deps web

Nginx's passive health checks automatically route traffic away from replicas that are restarting.

Backup Considerations#

PostgreSQL data lives in the postgres volume — back up with pg_dump or volume snapshots
Redis data lives in the redis volume — back up with redis-cli BGSAVE
Application storage lives in the storage volume — back up with your preferred method

Scaling Automation#

For dynamic scaling based on agent count, you could script the replica adjustment. Note that this requires a custom API endpoint that does not yet exist:

#!/usr/bin/env bash
# Hypothetical example — requires implementing an /api/v1/agents/active_count endpoint.
ACTIVE_AGENTS=$(curl -s http://localhost/api/v1/agents/active_count)
REPLICAS=$((ACTIVE_AGENTS + 1))
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d --scale web=$REPLICAS