Production Load Balancing#
Introduction#
When many cracking agents are active simultaneously, a single CipherSwarm web instance can become a bottleneck. Each agent sends periodic status updates, crack submissions, and task requests — all of which compete for the same Puma thread pool. This guide describes how to horizontally scale the web tier using nginx as a reverse proxy load balancer in front of multiple Puma replicas.
Architecture Overview#
Component roles:
- Nginx — Accepts all incoming HTTP traffic on port 80 and distributes requests across web replicas using the
least_connalgorithm. Uses Docker's embedded DNS (127.0.0.11) to discover replicas automatically. Includes a dedicated/cablelocation for Action Cable WebSocket connections (Turbo Streams). - Web replicas (Puma) — Each replica is an independent container running the full Rails stack. Puma serves Rails requests using a clustered mode with multiple worker processes (configurable via
WEB_CONCURRENCY, default 2) and a thread pool per worker. Nginx handles HTTP/2, compression, and asset caching at the load balancer level. - Backend services — PostgreSQL, Redis, and Sidekiq are shared by all replicas. Rails cookie-based sessions are stateless, so no sticky sessions are required.
Scaling Guidelines#
The n+1 Formula#
Set the number of web replicas to n + 1, where n is the number of fully active cracking nodes. This is a conservative upper bound assuming worst-case scenarios where all agents submit status updates, crack results, and task requests simultaneously:
| Active Nodes | Recommended Replicas | Rationale |
|---|---|---|
| 4 | 5 | Small deployment |
| 8 | 9 | Default configuration |
| 16 | 17 | Medium deployment |
| 32 | 33 | Large deployment |
The +1 buffer ensures that even if one replica is temporarily unhealthy or handling a slow request, the remaining replicas can absorb the load without queuing.
For typical deployments with 30-second heartbeat intervals, fewer replicas may suffice. Monitor nginx access logs (check upstream_response_time) and Puma queue depth to right-size your deployment.
Important for horizontal scaling: Before scaling to multiple web replicas, run database migrations once using
RUN_DB_PREPARE=true(see Step-by-Step Deployment). This prevents migration races where multiple containers might try to run migrations simultaneously.
Resource Considerations#
Each web replica is constrained to:
- CPU: 1 core (limit), 0.5 core (reservation)
- Memory: 2 GB (limit), 1 GB (reservation)
Both web and Sidekiq services need memory headroom for tmpfs mounts (up to 768 MB combined for /tmp and /rails/tmp) alongside the Ruby process. PostgreSQL also has higher limits (2 GB) to handle connection pooling from multiple web replicas.
Plan your host resources accordingly. For example, 9 web replicas require at minimum 4.5 CPU cores and 9 GB RAM reserved, with burst capacity up to 9 cores and 18 GB. See docker-compose-production.yml for the canonical resource definitions.
Configuration#
Nginx (docker/nginx/nginx.conf)#
Key settings and their purpose:
| Setting | Value | Purpose |
|---|---|---|
resolver 127.0.0.11 | Docker DNS | Discovers all web replica IPs dynamically |
zone web_backend_zone 64k | Shared memory | Required for the resolve parameter on upstream servers |
least_conn | Algorithm | Sends new requests to the replica with the fewest active connections |
resolve | DNS re-query | Re-resolves DNS so new/removed replicas are picked up (nginx 1.27.3+) |
max_fails=3 | Passive health | Marks a replica as down after 3 consecutive failures |
fail_timeout=30s | Recovery window | Waits 30 s before retrying a failed replica |
keepalive 32 | Connection pool | Reuses TCP connections to backends for efficiency |
proxy_read_timeout 300s | Long reads | Allows slow API responses in the general / location |
proxy_next_upstream | Retry policy | Retries GET/HEAD on error, timeout, 502, 503, 504 |
proxy_next_upstream_timeout 30s | Retry budget | Bounds total retry duration to prevent cascading delays |
client_max_body_size 100M | Default limit | Server-level default for non-storage endpoints |
client_max_body_size 0 | Unlimited | Active Storage location only — allows arbitrarily large uploads |
proxy_buffering off | Streaming | Active Storage location — streams downloads directly to clients |
/cable location | WebSocket | Upgrades connections for Action Cable (Turbo Streams) |
/rails/active_storage/ location | File transfers | Unbuffered uploads/downloads with 1-hour timeouts |
Nginx version requirement: The resolve parameter in upstream blocks requires nginx >= 1.27.3 (open-sourced from nginx Plus). The default nginx:alpine image (currently 1.29.x) supports this. Do not pin to an older version.
Adjusting Timeouts#
The general / location uses proxy_read_timeout 300s and proxy_send_timeout 60s, which are sufficient for API requests and normal web traffic. The /rails/active_storage/ location uses 1-hour timeouts (3600s) for large file uploads and downloads. If agents experience timeouts during file transfers, check that they are using the Active Storage endpoints (which have the longer timeouts). See docker/nginx/nginx.conf for the canonical timeout values.
Deployment Instructions#
Prerequisites#
- Docker Engine 20.10+
- Docker Compose V2
- Required environment variables:
RAILS_MASTER_KEY— Rails credentials encryption keyPOSTGRES_PASSWORD— PostgreSQL root password (required; fail-fast if not set)TUSD_HOOK_SECRET— Shared secret for authenticating tusd webhook requests (required; prevents cache poisoning attacks). Generate withopenssl rand -hex 32APPLICATION_HOST— Application hostname for mailers, redirects, and DNS rebinding protectionMINIO_PUBLIC_IP— Public IP for MinIO (if using MinIO storage)
Note: See Environment Variables Reference for comprehensive documentation of all configuration options, including
DISABLE_SSLfor reverse proxy setups, production validation requirements, common configuration scenarios, and troubleshooting guidance.
Step-by-Step Deployment#
-
Set environment variables:
export RAILS_MASTER_KEY=your_master_key export POSTGRES_PASSWORD=your_secure_password export TUSD_HOOK_SECRET=$(openssl rand -hex 32) -
Run database migrations (first-time setup or after updates):
# Run migrations once before scaling to multiple replicas to avoid migration races docker compose -f docker-compose-production.yml run --rm -e RUN_DB_PREPARE=true webImportant: Database migrations must be run once before starting multiple web replicas. The
RUN_DB_PREPARE=trueflag prevents migration races where multiple containers might try to run migrations simultaneously. Regular web service containers should not have this flag set. -
Adjust the replica count (optional — edit
docker-compose-production.yml):web: deploy: replicas: 9 # n+1 where n = active cracking nodesOr pass it at the command line (see step 4).
-
Deploy the stack:
docker compose -f docker-compose-production.yml up -d # Or with a specific replica count: docker compose -f docker-compose-production.yml up -d --scale web=9 -
Verify all replicas are healthy:
docker compose -f docker-compose-production.yml psAll web replicas and the nginx service should show
healthystatus. Note that web replicas take 10-45 seconds to boot Rails, and nginx waits for at least one healthy web replica before starting. -
Test load distribution (optional):
# Send several requests and observe different upstream addresses in nginx logs for i in $(seq 1 10); do curl -s -o /dev/null -w "%{http_code}\n" http://localhost/up done
Scaling Without Downtime#
Scale web replicas up or down at any time without restarting other services:
# Scale to 17 replicas (for 16 active nodes)
docker compose -f docker-compose-production.yml up -d --scale web=17
# Scale down to 5 replicas
docker compose -f docker-compose-production.yml up -d --scale web=5
Or use the justfile shortcut:
just docker-prod-scale 17
Nginx's DNS re-resolution (every 10 s) automatically picks up added or removed replicas.
Monitoring and Troubleshooting#
Checking Service Health#
# Overview of all services
just docker-prod-status
# Or directly
docker compose -f docker-compose-production.yml ps
Viewing Logs#
# Nginx access/error logs
just docker-prod-logs-nginx
# Web replica logs (all replicas interleaved)
just docker-prod-logs-web
# Specific replica logs
docker compose -f docker-compose-production.yml logs web --index 1
The nginx access log includes upstream context (upstream=, upstream_status=, upstream_response_time=, request_time=) to help diagnose load distribution and slow backends.
Common Issues#
| Symptom | Likely Cause | Solution |
|---|---|---|
| Replicas fail to start | Insufficient host resources | Reduce replica count or increase host capacity |
| Uneven load distribution | DNS cache stale | Restart nginx: docker compose -f docker-compose-production.yml restart nginx |
| Connection timeouts on uploads | proxy_read_timeout too low | Increase timeout in docker/nginx/nginx.conf |
| 502 Bad Gateway | All replicas down or starting | Wait for health checks to pass; check web replica logs |
| Database connection errors | Too many connections | Tune pool size in config/database.yml or add PgBouncer |
| OOM kills (exit code 137) | Memory limit too low | Check docker inspect --format='{{.State.OOMKilled}}'; increase service limit |
| WebSocket disconnects | Missing /cable location | Verify docker/nginx/nginx.conf has the /cable WebSocket block |
Performance Monitoring#
- Watch nginx access logs for slow responses (
upstream_response_time> 1 s) - Monitor PostgreSQL connection count:
SELECT count(*) FROM pg_stat_activity; - Monitor Redis memory usage:
redis-cli INFO memory - Track Sidekiq queue depth via the Sidekiq Web UI at
/sidekiq
Security Considerations#
SSL/TLS Termination#
The current configuration serves plain HTTP on port 80. The DISABLE_SSL environment variable controls Rails SSL/HTTPS enforcement (see Environment Variables Reference for detailed DISABLE_SSL documentation).
For deployments exposed to untrusted networks:
- Recommended: Place an external TLS-terminating reverse proxy (e.g., Caddy, Traefik, or a cloud load balancer) in front of the nginx service. Set
DISABLE_SSL=trueso Rails delegates SSL enforcement to the upstream proxy. - Self-signed certificates (typical for lab environments): Add TLS certificates to the nginx configuration directly by mounting certs and updating the server block to listen on 443 with
ssl_certificateandssl_certificate_key. LeaveDISABLE_SSLunset or set to empty. - Isolated lab environments: The default configuration (plain HTTP) is appropriate when the deployment is not exposed to untrusted networks. Set
DISABLE_SSL=trueto prevent Rails from forcing HTTPS redirects.
Header Forwarding#
Nginx forwards X-Real-IP, X-Forwarded-For, and X-Forwarded-Proto headers so Rails can correctly identify client IPs and protocol. Ensure any upstream proxy also preserves these headers.
Rate Limiting#
For deployments exposed to the public internet, consider adding nginx rate limiting:
# Example: limit API endpoints to 10 requests/second per IP
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
location /api/ {
limit_req zone=api burst=20 nodelay;
proxy_pass http://web_backend;
# ... other proxy settings
}
Maintenance#
Rolling Updates#
To update the web application image without downtime:
# Pull the latest image
docker compose -f docker-compose-production.yml pull web
# Run migrations once before restarting replicas (if schema changed)
docker compose -f docker-compose-production.yml run --rm -e RUN_DB_PREPARE=true web
# Recreate web replicas (nginx continues serving via remaining replicas)
docker compose -f docker-compose-production.yml up -d --no-deps web
Nginx's passive health checks automatically route traffic away from replicas that are restarting.
Backup Considerations#
- PostgreSQL data lives in the
postgresvolume — back up withpg_dumpor volume snapshots - Redis data lives in the
redisvolume — back up withredis-cli BGSAVE - Application storage lives in the
storagevolume — back up with your preferred method
Scaling Automation#
For dynamic scaling based on agent count, you could script the replica adjustment. Note that this requires a custom API endpoint that does not yet exist:
#!/usr/bin/env bash
# Hypothetical example — requires implementing an /api/v1/agents/active_count endpoint.
ACTIVE_AGENTS=$(curl -s http://localhost/api/v1/agents/active_count)
REPLICAS=$((ACTIVE_AGENTS + 1))
docker compose -f docker-compose-production.yml up -d --scale web=$REPLICAS