Documents
production-load-balancing
production-load-balancing
Type
External
Status
Published
Created
Feb 27, 2026
Updated
Mar 26, 2026
Updated by
Dosu Bot

Production Load Balancing#

Introduction#

When many cracking agents are active simultaneously, a single CipherSwarm web instance can become a bottleneck. Each agent sends periodic status updates, crack submissions, and task requests — all of which compete for the same Puma thread pool. This guide describes how to horizontally scale the web tier using nginx as a reverse proxy load balancer in front of multiple Puma replicas.

Architecture Overview#

Component roles:

  • Nginx — Accepts all incoming HTTP traffic on port 80 and distributes requests across web replicas using the least_conn algorithm. Uses Docker's embedded DNS (127.0.0.11) to discover replicas automatically. Includes a dedicated /cable location for Action Cable WebSocket connections (Turbo Streams).
  • Web replicas (Puma) — Each replica is an independent container running the full Rails stack. Puma serves Rails requests using a clustered mode with multiple worker processes (configurable via WEB_CONCURRENCY, default 2) and a thread pool per worker. Nginx handles HTTP/2, compression, and asset caching at the load balancer level.
  • Backend services — PostgreSQL, Redis, and Sidekiq are shared by all replicas. Rails cookie-based sessions are stateless, so no sticky sessions are required.

Scaling Guidelines#

The n+1 Formula#

Set the number of web replicas to n + 1, where n is the number of fully active cracking nodes. This is a conservative upper bound assuming worst-case scenarios where all agents submit status updates, crack results, and task requests simultaneously:

Active NodesRecommended ReplicasRationale
45Small deployment
89Default configuration
1617Medium deployment
3233Large deployment

The +1 buffer ensures that even if one replica is temporarily unhealthy or handling a slow request, the remaining replicas can absorb the load without queuing.

For typical deployments with 30-second heartbeat intervals, fewer replicas may suffice. Monitor nginx access logs (check upstream_response_time) and Puma queue depth to right-size your deployment.

Important for horizontal scaling: Before scaling to multiple web replicas, run database migrations once using RUN_DB_PREPARE=true (see Step-by-Step Deployment). This prevents migration races where multiple containers might try to run migrations simultaneously.

Resource Considerations#

Each web replica is constrained to:

  • CPU: 1 core (limit), 0.5 core (reservation)
  • Memory: 2 GB (limit), 1 GB (reservation)

Both web and Sidekiq services need memory headroom for tmpfs mounts (up to 768 MB combined for /tmp and /rails/tmp) alongside the Ruby process. PostgreSQL also has higher limits (2 GB) to handle connection pooling from multiple web replicas.

Plan your host resources accordingly. For example, 9 web replicas require at minimum 4.5 CPU cores and 9 GB RAM reserved, with burst capacity up to 9 cores and 18 GB. See docker-compose-production.yml for the canonical resource definitions.

Configuration#

Nginx (docker/nginx/nginx.conf)#

Key settings and their purpose:

SettingValuePurpose
resolver 127.0.0.11Docker DNSDiscovers all web replica IPs dynamically
zone web_backend_zone 64kShared memoryRequired for the resolve parameter on upstream servers
least_connAlgorithmSends new requests to the replica with the fewest active connections
resolveDNS re-queryRe-resolves DNS so new/removed replicas are picked up (nginx 1.27.3+)
max_fails=3Passive healthMarks a replica as down after 3 consecutive failures
fail_timeout=30sRecovery windowWaits 30 s before retrying a failed replica
keepalive 32Connection poolReuses TCP connections to backends for efficiency
proxy_read_timeout 300sLong readsAllows slow API responses in the general / location
proxy_next_upstreamRetry policyRetries GET/HEAD on error, timeout, 502, 503, 504
proxy_next_upstream_timeout 30sRetry budgetBounds total retry duration to prevent cascading delays
client_max_body_size 100MDefault limitServer-level default for non-storage endpoints
client_max_body_size 0UnlimitedActive Storage location only — allows arbitrarily large uploads
proxy_buffering offStreamingActive Storage location — streams downloads directly to clients
/cable locationWebSocketUpgrades connections for Action Cable (Turbo Streams)
/rails/active_storage/ locationFile transfersUnbuffered uploads/downloads with 1-hour timeouts

Nginx version requirement: The resolve parameter in upstream blocks requires nginx >= 1.27.3 (open-sourced from nginx Plus). The default nginx:alpine image (currently 1.29.x) supports this. Do not pin to an older version.

Adjusting Timeouts#

The general / location uses proxy_read_timeout 300s and proxy_send_timeout 60s, which are sufficient for API requests and normal web traffic. The /rails/active_storage/ location uses 1-hour timeouts (3600s) for large file uploads and downloads. If agents experience timeouts during file transfers, check that they are using the Active Storage endpoints (which have the longer timeouts). See docker/nginx/nginx.conf for the canonical timeout values.

Deployment Instructions#

Prerequisites#

  • Docker Engine 20.10+
  • Docker Compose V2
  • Required environment variables:
    • RAILS_MASTER_KEY — Rails credentials encryption key
    • POSTGRES_PASSWORD — PostgreSQL root password (required; fail-fast if not set)
    • TUSD_HOOK_SECRET — Shared secret for authenticating tusd webhook requests (required; prevents cache poisoning attacks). Generate with openssl rand -hex 32
    • APPLICATION_HOST — Application hostname for mailers, redirects, and DNS rebinding protection
    • MINIO_PUBLIC_IP — Public IP for MinIO (if using MinIO storage)

Note: See Environment Variables Reference for comprehensive documentation of all configuration options, including DISABLE_SSL for reverse proxy setups, production validation requirements, common configuration scenarios, and troubleshooting guidance.

Step-by-Step Deployment#

  1. Set environment variables:

    export RAILS_MASTER_KEY=your_master_key
    export POSTGRES_PASSWORD=your_secure_password
    export TUSD_HOOK_SECRET=$(openssl rand -hex 32)
    
  2. Run database migrations (first-time setup or after updates):

    # Run migrations once before scaling to multiple replicas to avoid migration races
    docker compose -f docker-compose-production.yml run --rm -e RUN_DB_PREPARE=true web
    

    Important: Database migrations must be run once before starting multiple web replicas. The RUN_DB_PREPARE=true flag prevents migration races where multiple containers might try to run migrations simultaneously. Regular web service containers should not have this flag set.

  3. Adjust the replica count (optional — edit docker-compose-production.yml):

    web:
      deploy:
        replicas: 9 # n+1 where n = active cracking nodes
    

    Or pass it at the command line (see step 4).

  4. Deploy the stack:

    docker compose -f docker-compose-production.yml up -d
    
    # Or with a specific replica count:
    docker compose -f docker-compose-production.yml up -d --scale web=9
    
  5. Verify all replicas are healthy:

    docker compose -f docker-compose-production.yml ps
    

    All web replicas and the nginx service should show healthy status. Note that web replicas take 10-45 seconds to boot Rails, and nginx waits for at least one healthy web replica before starting.

  6. Test load distribution (optional):

    # Send several requests and observe different upstream addresses in nginx logs
    for i in $(seq 1 10); do
      curl -s -o /dev/null -w "%{http_code}\n" http://localhost/up
    done
    

Scaling Without Downtime#

Scale web replicas up or down at any time without restarting other services:

# Scale to 17 replicas (for 16 active nodes)
docker compose -f docker-compose-production.yml up -d --scale web=17

# Scale down to 5 replicas
docker compose -f docker-compose-production.yml up -d --scale web=5

Or use the justfile shortcut:

just docker-prod-scale 17

Nginx's DNS re-resolution (every 10 s) automatically picks up added or removed replicas.

Monitoring and Troubleshooting#

Checking Service Health#

# Overview of all services
just docker-prod-status

# Or directly
docker compose -f docker-compose-production.yml ps

Viewing Logs#

# Nginx access/error logs
just docker-prod-logs-nginx

# Web replica logs (all replicas interleaved)
just docker-prod-logs-web

# Specific replica logs
docker compose -f docker-compose-production.yml logs web --index 1

The nginx access log includes upstream context (upstream=, upstream_status=, upstream_response_time=, request_time=) to help diagnose load distribution and slow backends.

Common Issues#

SymptomLikely CauseSolution
Replicas fail to startInsufficient host resourcesReduce replica count or increase host capacity
Uneven load distributionDNS cache staleRestart nginx: docker compose -f docker-compose-production.yml restart nginx
Connection timeouts on uploadsproxy_read_timeout too lowIncrease timeout in docker/nginx/nginx.conf
502 Bad GatewayAll replicas down or startingWait for health checks to pass; check web replica logs
Database connection errorsToo many connectionsTune pool size in config/database.yml or add PgBouncer
OOM kills (exit code 137)Memory limit too lowCheck docker inspect --format='{{.State.OOMKilled}}'; increase service limit
WebSocket disconnectsMissing /cable locationVerify docker/nginx/nginx.conf has the /cable WebSocket block

Performance Monitoring#

  • Watch nginx access logs for slow responses (upstream_response_time > 1 s)
  • Monitor PostgreSQL connection count: SELECT count(*) FROM pg_stat_activity;
  • Monitor Redis memory usage: redis-cli INFO memory
  • Track Sidekiq queue depth via the Sidekiq Web UI at /sidekiq

Security Considerations#

SSL/TLS Termination#

The current configuration serves plain HTTP on port 80. The DISABLE_SSL environment variable controls Rails SSL/HTTPS enforcement (see Environment Variables Reference for detailed DISABLE_SSL documentation).

For deployments exposed to untrusted networks:

  1. Recommended: Place an external TLS-terminating reverse proxy (e.g., Caddy, Traefik, or a cloud load balancer) in front of the nginx service. Set DISABLE_SSL=true so Rails delegates SSL enforcement to the upstream proxy.
  2. Self-signed certificates (typical for lab environments): Add TLS certificates to the nginx configuration directly by mounting certs and updating the server block to listen on 443 with ssl_certificate and ssl_certificate_key. Leave DISABLE_SSL unset or set to empty.
  3. Isolated lab environments: The default configuration (plain HTTP) is appropriate when the deployment is not exposed to untrusted networks. Set DISABLE_SSL=true to prevent Rails from forcing HTTPS redirects.

Header Forwarding#

Nginx forwards X-Real-IP, X-Forwarded-For, and X-Forwarded-Proto headers so Rails can correctly identify client IPs and protocol. Ensure any upstream proxy also preserves these headers.

Rate Limiting#

For deployments exposed to the public internet, consider adding nginx rate limiting:

# Example: limit API endpoints to 10 requests/second per IP
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

location /api/ {
    limit_req zone=api burst=20 nodelay;
    proxy_pass http://web_backend;
    # ... other proxy settings
}

Maintenance#

Rolling Updates#

To update the web application image without downtime:

# Pull the latest image
docker compose -f docker-compose-production.yml pull web

# Run migrations once before restarting replicas (if schema changed)
docker compose -f docker-compose-production.yml run --rm -e RUN_DB_PREPARE=true web

# Recreate web replicas (nginx continues serving via remaining replicas)
docker compose -f docker-compose-production.yml up -d --no-deps web

Nginx's passive health checks automatically route traffic away from replicas that are restarting.

Backup Considerations#

  • PostgreSQL data lives in the postgres volume — back up with pg_dump or volume snapshots
  • Redis data lives in the redis volume — back up with redis-cli BGSAVE
  • Application storage lives in the storage volume — back up with your preferred method

Scaling Automation#

For dynamic scaling based on agent count, you could script the replica adjustment. Note that this requires a custom API endpoint that does not yet exist:

#!/usr/bin/env bash
# Hypothetical example — requires implementing an /api/v1/agents/active_count endpoint.
ACTIVE_AGENTS=$(curl -s http://localhost/api/v1/agents/active_count)
REPLICAS=$((ACTIVE_AGENTS + 1))
docker compose -f docker-compose-production.yml up -d --scale web=$REPLICAS

References#