Documents
ActiveStorage Backend Configuration
ActiveStorage Backend Configuration
Type
Topic
Status
Published
Created
Feb 27, 2026
Updated
Apr 20, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

ActiveStorage Backend Configuration#

ActiveStorage Backend Configuration is the storage system architecture used by CipherSwarm, a distributed password cracking coordination platform built on Rails ActiveStorage. The system supports both local disk storage and S3-compatible object storage backends, allowing operators to choose the appropriate storage solution for their deployment environment.

Version Requirements#

This configuration requires Rails/ActiveStorage 8.1.2.1 or later, which includes critical security fixes for:

  • DirectUpload metadata filtering (CVE-2026-33173): Enhanced filtering of user-provided metadata to prevent security issues
  • Streaming chunk size limits (CVE-2026-33174): Configurable maximum streaming chunk sizes to prevent denial of service attacks
  • Range request restrictions (CVE-2026-33658): Limited range requests to a single range to prevent resource exhaustion
  • Path traversal prevention (CVE-2026-33195): Protection against directory traversal attacks in DiskService
  • Glob injection prevention (CVE-2026-33202): Escaped glob metacharacters in DiskService#delete_prefixed to prevent injection attacks

These security enhancements are transparent to normal operations but provide essential protections for production deployments.

The architecture is designed specifically for deployment in air-gapped (non-Internet-connected) laboratory and secure environments common in penetration testing and security research operations. By default, CipherSwarm uses local disk storage shared via Docker volumes, requiring no additional infrastructure beyond the core application stack. Organizations requiring distributed storage can optionally deploy S3-compatible services such as MinIO, SeaweedFS, or Garage within their isolated environment.

A key design principle is backend agnosticism: the application layer uses Rails ActiveStorage's unified API without any storage-backend-specific code, enabling seamless migration between storage systems through environment variable configuration alone. This flexibility allows the same container image to run on different storage backends based purely on runtime configuration, making it ideal for diverse deployment scenarios from single-server lab setups to distributed enterprise environments.

Storage Architecture#

CipherSwarm's storage architecture is built on a layered design that separates storage backend implementation from application logic. This architecture diagram illustrates the key components:

Tus Upload Protocol and Storage Paths#

While CipherSwarm is built on Rails ActiveStorage, hash lists uploaded via the tus resumable upload protocol bypass Active Storage during the upload phase. Attack resources (wordlists, rule lists, mask lists) use tus as their upload protocol but still store files through Active Storage after upload completes.

Hash list tus uploads:

  • Use tus protocol for browser-based resumable uploads
  • Files are stored directly to the filesystem at temp_file_path (typically in storage/attack_resources/hash_lists_staging/)
  • Do not use Active Storage attachments (has_one_attached :file is present but unused for tus uploads)
  • TusUploadHandler#process_tus_hash_list_upload moves the uploaded file to staging storage and enqueues ProcessHashListJob
  • This approach avoids downloading large hash lists to /tmp tmpfs during background job processing

Attack resource tus uploads:

  • Use tus protocol for browser-based resumable uploads
  • After upload completes, files are moved to file_path and attached via Active Storage
  • TusUploadHandler#process_tus_upload handles the post-upload file movement
  • Attack resources continue to use Active Storage's storage backend abstraction

Validation relaxation:

The HashList model's file presence validation is conditional: validates :file, presence: { on: :create }, unless: -> { tus_upload_pending || temp_file_path.present? }. This allows hash list creation to succeed when using tus uploads, where the file is set after the record is saved.

Background job behavior:

ProcessHashListJob and CountFileLinesJob check for temp_file_path or file_path first before falling back to Active Storage's blob.open. If they fall back to blob.open for a tus-uploaded file, they log a warning indicating the file will be downloaded to /tmp tmpfs, which may fail for large files.

Default Local Storage Configuration#

CipherSwarm defaults to local disk storage at /rails/storage inside the Docker container, which is backed by a host bind mount at ./storage/ shared between the web and Sidekiq worker services. This configuration is ideal for air-gapped deployments because it requires no additional services beyond the core application stack (PostgreSQL database, Redis cache, and the Rails application itself).

The bind mount architecture ensures that files uploaded through the web interface are immediately accessible to background worker processes that handle tasks like hash list processing, wordlist distribution, and attack coordination. This eliminates the need for network file systems or distributed storage in single-server deployments.

Security note: Rails 8.1.2.1 introduces path traversal prevention (CVE-2026-33195) in DiskService, which validates that all file keys are safe and prevents directory traversal attacks using dot segments (".", ".."). The service also includes glob injection prevention (CVE-2026-33202) that escapes glob metacharacters when deleting files by prefix. These protections ensure that malicious file keys cannot be used to access or delete files outside the storage root directory.

To inspect the contents of local storage, operators can access the storage directory directly within the web container:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web ls -la /rails/storage

S3-Compatible Storage Options#

For deployments requiring distributed storage, scalability beyond a single server, or redundancy, CipherSwarm supports S3-compatible object storage backends. Three primary S3-compatible services are supported:

ServiceLicenseCharacteristicsUse Case
MinIOAGPL (paid license required for production)Single-binary deployment, simple setup, widely adoptedQuick prototyping and development environments
SeaweedFSApache 2.0Lightweight, high performance, S3 gateway modeProduction deployments requiring open-source licensing
GarageAGPLDesigned for geo-distributed deployments and self-hostingMulti-site deployments with geographic distribution

All S3-compatible services must be deployed within the air-gapped environment alongside CipherSwarm. The application communicates with the S3 service over the internal Docker network or local network infrastructure, never requiring Internet connectivity.

Large File Upload Optimization#

CipherSwarm implements custom behavior for large file uploads to prevent browser performance issues during client-side MD5 checksum calculation. This optimization addresses GitHub issue #747, where browsers silently stall when computing checksums for files larger than 10-20 GB.

Client-Side Checksum Skipping#

The default Rails ActiveStorage direct upload workflow computes an MD5 checksum of the entire file client-side (using SparkMD5, reading 2 MB chunks via FileReader) before the upload begins. For very large wordlists, rule files, or mask files, this computation can freeze the browser with no error message or progress feedback.

CipherSwarm's solution skips client-side checksum calculation for files exceeding a configurable threshold (default: 1 GB). The threshold is configurable per-form via a data attribute and is managed per-file using a WeakMap to avoid interference between multiple upload forms on the same page.

How it works:

  1. The direct_upload_override.js utility patches FileChecksum.create from @rails/activestorage/src/file_checksum
  2. Each file is registered with a size threshold via setFileChecksumThreshold(file, thresholdBytes)
  3. If the file size exceeds the threshold, the checksum callback returns null instead of computing the MD5 digest
  4. The upload proceeds immediately without the potentially browser-stalling computation

Implementation note: The patch imports from @rails/activestorage/src/file_checksum, not the public package entrypoint, as FileChecksum is not exported. This coupling to internal package structure is documented and requires verification after ActiveStorage major version upgrades.

Custom DirectUploadsController#

The custom ActiveStorage::DirectUploadsController (located at app/controllers/active_storage/direct_uploads_controller.rb) overrides Rails' default controller to accept nil checksums from the client:

  • When checksum is blank in the upload parameters, it's set to nil
  • The checksum_skipped: true metadata flag is attached to the blob
  • The controller returns signed upload URLs and headers as usual, but the blob record is created with a NULL checksum

This controller override is preferred over monkey-patching the base controller in an initializer because it provides clear, testable, and upgrade-safe customization.

Security note: Rails 8.1.2.1 includes enhanced metadata filtering (CVE-2026-33173) that validates and sanitizes all user-provided metadata, including the checksum_skipped flag. This prevents malicious metadata injection attacks while allowing legitimate custom metadata fields like those used by CipherSwarm's large file upload optimization.

Active Storage Initializer Patches#

The config/initializers/active_storage_large_upload.rb initializer applies two patches to support nil checksums:

Blob Validation Relaxation:

The standard ActiveStorage::Blob model validates that checksum is present (unless the blob is composed). The initializer:

  1. Removes the existing checksum presence validator using targeted removal (_validators.delete(:checksum) and callback filtering)
  2. Adds a new conditional validator: validates :checksum, presence: true, unless: -> { composed || checksum_skipped? }
  3. Defines the private checksum_skipped? method that checks for metadata["checksum_skipped"] == true

Important: The initializer uses targeted validator removal, not clear_validators!, which would remove all validators including service_name presence.

S3 Service Patch:

When using S3-compatible storage, the S3 service would include a "Content-MD5" => nil header in the direct upload request, which S3 rejects as invalid. The initializer patches ActiveStorage::Service::S3Service#headers_for_direct_upload to compact the headers hash, removing any nil values.

The patch uses alias_method with a guard (unless method_defined?(:original_headers_for_direct_upload)) to prevent double-patching on code reload in development.

Deferred Checksum Verification#

To maintain file integrity despite skipping client-side checksums, CipherSwarm implements server-side verification after upload completes.

VerifyChecksumJob:

The app/jobs/verify_checksum_job.rb background job:

  1. Is automatically enqueued by the AttackResource concern's after_commit callback when a file with checksum_skipped metadata is attached
  2. Computes the MD5 checksum by downloading and hashing the entire file server-side
  3. Calls blob.service.open(blob.key, checksum: blob.checksum, verify: false) to stream the file without triggering ActiveStorage's built-in integrity check (which would fail on nil checksums)
  4. Includes TempStorageValidation to ensure sufficient /tmp space before downloading the blob
  5. Operates on WordList, RuleList, and MaskList models (defined in ALLOWED_TYPES)
  6. Retries I/O errors (Errno::EIO, Errno::ENOENT, Errno::EACCES) up to 5 times with polynomial backoff
  7. On permanent failure after retries, logs [ChecksumVerify] FILE_IO_FAILURE at error level with resource details and remediation guidance

Verification outcomes:

  • Nil checksum in database: The computed checksum is backfilled into active_storage_blobs.checksum, the checksum_skipped metadata is removed, and checksum_verified is set to true on the resource model
  • Checksum match: If the blob already has a checksum that matches the computed value, only checksum_verified is set to true
  • Checksum mismatch: checksum_verified is set to false, and an error is logged recommending re-upload
  • File not found: When the file path is absent or the file is missing on disk, the job logs [ChecksumVerify] FILE_NOT_FOUND at error level with actionable remediation messages

The job is transactional when backfilling checksums, ensuring the blob and resource model are updated atomically.

Automatic Recovery:

CipherSwarm includes RequeueUnverifiedResourcesJob, which runs every 6 hours (configured in config/schedule.yml) to automatically re-enqueue verification jobs for resources that remain stuck with checksum_verified: false. The job queries each resource type (WordList, RuleList, MaskList) for records with checksum_verified: false and updated_at older than the configured threshold (ApplicationConfig.checksum_verification_retry_threshold, default 6 hours).

This provides automatic recovery from transient verification failures such as temporary storage unavailability, I/O errors, or insufficient /tmp space. Partial indexes on updated_at WHERE checksum_verified = false ensure efficient sweep queries without full-table scans.

Admin Visibility:

The checksum_verified field is visible in admin dashboards (WordList, RuleList, MaskList) with an unverified: collection filter, allowing administrators to easily identify resources with pending or failed verification.

Database Schema Changes#

The migration db/migrate/20260317034648_add_checksum_verified_to_attack_resources.rb adds a checksum_verified boolean column to:

  • word_lists
  • rule_lists
  • mask_lists

Default value: true for existing records (assumes legacy records with client-side checksums are valid)

Set to false: When a new file with checksum_skipped metadata is uploaded, the AttackResource concern sets this to false before enqueueing VerifyChecksumJob

Set to true: When VerifyChecksumJob successfully computes and verifies the checksum

This column allows operators to query the verification status of attack resources and identify files that failed integrity checks.

Performance optimization: The migration db/migrate/20260329211021_add_checksum_sweep_indexes.rb adds partial indexes on updated_at WHERE checksum_verified = false for each resource type. These indexes support efficient sweep queries by RequeueUnverifiedResourcesJob without impacting write performance on verified resources.

Integrity and Security Considerations#

Small files (< threshold):

  • Client-side MD5 checksum is computed before upload
  • ActiveStorage verifies the checksum during the upload process (Disk service) or via S3's Content-MD5 header verification
  • Full integrity protection is maintained

Large files (> threshold):

  • Client-side checksum is skipped (browser performance optimization)
  • Disk service: The ensure_integrity_of check is skipped when checksum is nil — no digest verification occurs during upload
  • S3-compatible services: The Content-MD5 header is omitted from the direct upload PUT request — S3 does not verify integrity on receipt
  • Server-side verification via VerifyChecksumJob occurs asynchronously after upload completes
  • Between upload completion and job execution, there is a window where silent data corruption could go undetected

Note on streaming security: Rails 8.1.2.1 introduces configurable maximum streaming chunk sizes (CVE-2026-33174) and restricts range requests to a single range (CVE-2026-33658). These protections prevent denial of service attacks through excessive byte range requests during file downloads. The default 100 MB chunk size limit is appropriate for CipherSwarm's typical wordlist and hash list sizes, but can be adjusted if larger files are regularly served to clients.

Risk mitigation:

  • VerifyChecksumJob runs automatically after each large file upload
  • Corruption is detected post-upload and logged with recommendations to re-upload
  • The checksum_verified attribute provides visibility into verification status
  • Operators can monitor for checksum_verified: false records in their database

This trade-off (deferred verification vs. immediate browser usability) is necessary to support very large wordlists and hash lists without requiring operators to use CLI upload tools.

Configuration Options#

Environment variable: LARGE_FILE_THRESHOLD_MB

  • Default: 1024 (1 GB)
  • Purpose: Controls the file size threshold above which client-side checksum calculation is skipped
  • Units: Megabytes
  • Set in: .env file or environment configuration
  • Example:
# Skip checksum for files larger than 500 MB
LARGE_FILE_THRESHOLD_MB=500

The threshold is converted to bytes in the frontend code and registered per-file via setFileChecksumThreshold(). Different forms on the same page can have different thresholds without conflict.

Configuration#

Environment Variables#

Storage backend selection and configuration in CipherSwarm is controlled entirely through environment variables, allowing the same application container to adapt to different storage infrastructures without code changes.

In production, critical environment variables are validated at startup. The application will fail to start if required variables (such as APPLICATION_HOST for mailers or S3 credentials when using S3 storage) are not properly configured. This fail-fast approach prevents runtime failures and provides clear error messages at deployment time.

Primary Configuration#

The ACTIVE_STORAGE_SERVICE environment variable controls which storage backend the application uses:

  • ACTIVE_STORAGE_SERVICE=local (default): Uses local disk storage at /rails/storage
  • ACTIVE_STORAGE_SERVICE=s3: Uses S3-compatible object storage

If this variable is not set, the application defaults to local storage for backward compatibility and ease of initial deployment.

S3-Compatible Storage Configuration#

When ACTIVE_STORAGE_SERVICE=s3, the following environment variables configure the S3 connection:

VariablePurposeDefaultRequired
AWS_ACCESS_KEY_IDS3 access keyYes
AWS_SECRET_ACCESS_KEYS3 secret keyYes
AWS_BUCKETS3 bucket nameapplicationNo
AWS_REGIONAWS region identifierus-east-1No
AWS_ENDPOINTCustom S3 endpoint URLYes (for non-AWS services)
AWS_FORCE_PATH_STYLEUse path-style URLsfalseNo (set to true for MinIO)

All AWS_* credentials are required when using S3 storage — the application will fail at startup if they are missing.

For complete details on all S3-related environment variables including defaults, validation, and examples, see the S3 Storage Configuration section in the Environment Variables Reference.

Tmpfs Configuration#

VariablePurposeDefaultRequired
TMPFS_TMP_SIZEControls tmpfs mount size at /tmp for Active Storage blob downloads during ingest jobs512m (production), 1g (development)No
TMPFS_RAILS_TMP_SIZEControls tmpfs mount size at /rails/tmp for Rails framework temp files (Bootsnap cache, etc.)256mNo

Sizing formula for TMPFS_TMP_SIZE: >= 1.5 × largest_attack_resource_file

Attack resources (wordlists, rule lists, mask lists) that use Active Storage blob.open will download to /tmp during background job processing. For deployments processing 100 GB+ files, increase TMPFS_TMP_SIZE proportionally and ensure the container memory limit accommodates the tmpfs allocation, or use the TMPDIR volume approach for disk-backed temporary storage.

See Docker Storage and /tmp Configuration for comprehensive tmpfs sizing guidance, monitoring, and troubleshooting.

Tus Upload Endpoint Configuration#

VariablePurposeDefaultRequired
TUS_ENDPOINT_URLCustom tus endpoint URL for upload forms/uploads/No

The tus_endpoint helper in ApplicationHelper returns ENV["TUS_ENDPOINT_URL"] or defaults to /uploads/. This allows override for development/test environments (e.g., with testcontainers-based tusd running on a different port).

Example .env configuration for MinIO:

ACTIVE_STORAGE_SERVICE=s3
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin
AWS_BUCKET=cipherswarm
AWS_REGION=us-east-1
AWS_ENDPOINT=http://minio:9000
AWS_FORCE_PATH_STYLE=true

Example .env configuration for SeaweedFS:

ACTIVE_STORAGE_SERVICE=s3
AWS_ACCESS_KEY_ID=any
AWS_SECRET_ACCESS_KEY=any
AWS_BUCKET=cipherswarm
AWS_REGION=us-east-1
AWS_ENDPOINT=http://seaweedfs:8333
AWS_FORCE_PATH_STYLE=true

You can also refer to .env.example in the repository root for a complete template of all available configuration options.

Configuration Files#

config/storage.yml#

The config/storage.yml file defines three active storage services that CipherSwarm can use:

:test serviceTemporary disk storage at tmp/storage used during automated testing:

test:
  service: Disk
  root: <%= Rails.root.join("tmp/storage") %>

:local serviceProduction disk storage at the storage/ directory, which maps to /rails/storage in Docker:

local:
  service: Disk
  root: <%= Rails.root.join("storage") %>

:s3 serviceS3-compatible storage configuration that reads credentials from environment variables:

s3:
  service: S3
  bucket: <%= ENV.fetch("AWS_BUCKET", "application") %>
  access_key_id: <%= ENV["AWS_ACCESS_KEY_ID"] %>
  secret_access_key: <%= ENV["AWS_SECRET_ACCESS_KEY"] %>
  region: <%= ENV.fetch("AWS_REGION", "us-east-1") %>
  endpoint: <%= ENV["AWS_ENDPOINT"] %>
  force_path_style: <%= ENV.fetch("AWS_FORCE_PATH_STYLE", "false") == "true" %>

The file also includes commented-out examples for Google Cloud Storage (:google), Azure Storage (:microsoft), and mirrored storage (:mirror) services that can be enabled if needed.

config/environments/production.rb#

The production environment configuration file contains a single line that selects the active storage service based on the ACTIVE_STORAGE_SERVICE environment variable:

config.active_storage.service = ENV.fetch("ACTIVE_STORAGE_SERVICE", "local").to_sym

This uses Ruby's ENV.fetch method to read the environment variable, defaulting to "local" if not set, then converts the string to a symbol (:local or :s3) that corresponds to a service name in config/storage.yml.

Deploying S3-Compatible Storage in Air-Gapped Environments#

Organizations deploying CipherSwarm in air-gapped environments can add S3-compatible storage by following a workflow that transfers Docker images from an Internet-connected system to the isolated network.

Step-by-Step Deployment Process#

Step 1: Export the storage service image (on an Internet-connected system)

# For MinIO
docker pull minio/minio:latest
docker save minio/minio:latest -o minio.tar

# For SeaweedFS
docker pull chrislusf/seaweedfs:latest
docker save chrislusf/seaweedfs:latest -o seaweedfs.tar

# For Garage
docker pull dxflrs/garage:latest
docker save dxflrs/garage:latest -o garage.tar

Step 2: Transfer the image to the air-gapped system using approved transfer methods (USB drive, CD/DVD, secure file transfer).

Step 3: Load the image on the air-gapped system

docker load -i minio.tar
# or
docker load -i seaweedfs.tar
# or
docker load -i garage.tar

Step 4: Add the storage service to docker-compose.prod.yml

Step 5: Set environment variables in the .env file

Step 6: Create the storage bucket before first use

When deploying S3-compatible storage in air-gapped environments, ensure sufficient RAM is available for tmpfs mounts in addition to persistent storage volumes. The /tmp tmpfs mount must accommodate the largest attack files operators intend to process — see the Tmpfs Mounts for Temporary Storage section for sizing guidance.

MinIO Deployment Example#

Add the following service definition to docker-compose.prod.yml:

services:
  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: ${AWS_ACCESS_KEY_ID}
      MINIO_ROOT_PASSWORD: ${AWS_SECRET_ACCESS_KEY}
    volumes:
      - minio-data:/data
    ports:
      - "9000:9000" # S3 API
      - "9001:9001" # Web Console
    healthcheck:
      test: ["CMD", "mc", "ready", "local"]
      interval: 10s
      timeout: 5s
      retries: 5
    networks:
      - default

volumes:
  minio-data:

Configure the .env file:

ACTIVE_STORAGE_SERVICE=s3
AWS_ACCESS_KEY_ID=minioadmin
AWS_SECRET_ACCESS_KEY=minioadmin_secret_key
AWS_BUCKET=cipherswarm
AWS_ENDPOINT=http://minio:9000
AWS_FORCE_PATH_STYLE=true
AWS_REGION=us-east-1

Create the bucket using the MinIO client:

# Start the services
docker compose -f docker-compose.yml -f docker-compose.prod.yml up -d

# Configure MinIO client alias
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec minio \
  mc alias set local http://localhost:9000 minioadmin minioadmin_secret_key

# Create the bucket
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec minio \
  mc mb local/cipherswarm

# Verify bucket creation
docker compose -f docker-compose.yml -f docker-compose.prod.yml exec minio \
  mc ls local/

SeaweedFS Deployment Example#

Add the following service definition to docker-compose.prod.yml:

services:
  seaweedfs:
    image: chrislusf/seaweedfs:latest
    command: server -s3 -dir=/data -s3.port=8333
    volumes:
      - seaweedfs-data:/data
    ports:
      - "8333:8333" # S3 API
      - "9333:9333" # Master server
      - "8080:8080" # Volume server
    networks:
      - default

volumes:
  seaweedfs-data:

Configure the .env file:

ACTIVE_STORAGE_SERVICE=s3
AWS_ACCESS_KEY_ID=any
AWS_SECRET_ACCESS_KEY=any
AWS_BUCKET=cipherswarm
AWS_ENDPOINT=http://seaweedfs:8333
AWS_FORCE_PATH_STYLE=true
AWS_REGION=us-east-1

SeaweedFS automatically creates buckets on first use, so no manual bucket creation is required.

Storage Migration Tooling#

CipherSwarm includes a built-in rake task for migrating files between storage backends, implemented in lib/tasks/storage_migrate.rake. This tool supports migrating from S3-compatible storage (MinIO, SeaweedFS, AWS S3) to local disk storage with comprehensive safety features.

Migration Workflow#

Key Features#

Dry Run Mode#

The DRY_RUN environment variable enables preview mode, allowing operators to see what would be migrated without making any changes. This logs each blob that would be processed and is the recommended first step before performing an actual migration.

# Preview the migration
docker compose -f docker-compose-production.yml exec web \
  bin/rails storage:migrate_to_local DRY_RUN=true

Source Service Override#

The SOURCE_SERVICE environment variable overrides which service to download from. This is useful when blob records reference a service name that doesn't match the current config/storage.yml configuration. The task validates that the source service exists during initialization, providing clear error messages if the service is not configured.

# Override source service for blobs with old service names
docker compose -f docker-compose-production.yml exec web \
  bin/rails storage:migrate_to_local SOURCE_SERVICE=s3

Idempotency#

The migration task is fully idempotent — it checks if files already exist on disk before downloading. If a file is already present, the task only updates the blob's service_name field to "local" and skips the download. This makes the task safe to re-run after interruption or failure.

Checksum Verification#

The migration implements a single-pass download, verify, and write approach that computes an MD5 digest while streaming data from S3 to a temporary file. The computed checksum is compared against the stored blob.checksum value before writing the file to local storage. If checksums don't match, the blob is skipped and an error is logged.

Interrupt Handling#

The task installs a SIGINT signal handler for graceful interruption. When Ctrl+C is pressed, the migration loop checks an interrupted flag between blobs, allowing the current blob transfer to complete before stopping. This prevents partial migrations and file corruption.

Error Handling#

The migration task includes specific error handling strategies:

  • Disk space and permission errors: Abort the entire task immediately with a clear error message, as these indicate systemic problems that must be resolved before continuing
  • Other errors: Log the error and continue with the next blob, allowing the migration to proceed even if individual files fail

Processing and Reporting#

The task processes blobs in batches of 100 to manage memory usage efficiently. Progress is displayed during execution, showing the current blob being processed. A detailed summary is printed at the end with counts of migrated, skipped, and failed blobs.

Running a Migration#

1. Preview the migration (recommended first step):

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web \
  bin/rails storage:migrate_to_local DRY_RUN=true

2. Execute the migration:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web \
  bin/rails storage:migrate_to_local

3. Override source service if needed:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web \
  bin/rails storage:migrate_to_local SOURCE_SERVICE=s3

Safety Guarantees and Prerequisites#

The migration is non-destructive — source files in S3 are never deleted automatically. Operators must manually remove files from S3 after verifying the migration succeeded.

Before starting a migration:

Post-Migration Steps#

After successful migration:

  1. Update .env to set ACTIVE_STORAGE_SERVICE=local (or remove the variable — local is the default)
  2. Remove AWS_* environment variables if S3 is no longer needed
  3. Remove the S3-compatible storage service from docker-compose.prod.yml
  4. Verify that file downloads work correctly in the web UI
  5. Restart the application to apply the configuration changes

Health Check Implementation#

CipherSwarm includes storage health monitoring through the SystemHealthCheckService class, which verifies storage backend connectivity and collects metrics. This health check works with any storage backend through ActiveStorage's unified API.

Connectivity Verification#

The health check calls ActiveStorage::Blob.service.exist?("health_check") to verify that the storage service can respond to queries. This is a read-only operation that doesn't create or delete test files. The check operates with a 5-second timeout to prevent hanging on unresponsive storage backends.

Latency is measured using Process.clock_gettime with monotonic clock timing, providing accurate response time measurements independent of system clock adjustments.

Extended Metrics#

Beyond basic connectivity, the health check attempts to gather extended metrics:

Failures in extended metrics collection don't affect the overall health status, ensuring basic health checks succeed even if advanced metrics are unavailable.

Health Check Response#

The health check method returns a hash with the following fields:

  • :status: Either :healthy or :unhealthy
  • :latency: Response time in milliseconds (or nil if the check failed)
  • :error: Error message describing the failure (or nil if successful)
  • :storage_used: Total bytes used across all stored blobs
  • :bucket_count: Number of buckets (if the storage service supports this feature)

If any exception occurs during the check, it's caught, logged, and the method returns an unhealthy status with the error message.

Health Dashboard#

The system health dashboard is accessible at http://<host-ip>/system_health and displays the status of all services including storage. This provides operators with a centralized view of infrastructure health.

To verify storage health after deployment:

# Check basic health endpoint
curl http://localhost/up

# Check comprehensive system health dashboard
curl http://localhost/system_health

Backend-Agnostic Implementation Patterns#

A key architectural principle in CipherSwarm is maintaining storage backend independence throughout the application layer. The application uses Rails ActiveStorage's uniform API without any backend-specific code, allowing seamless migration between storage systems.

Application Layer Abstraction#

Models#

CipherSwarm models use ActiveStorage's standard attachment declarations:

These declarations provide a consistent interface for file attachment regardless of the underlying storage backend.

Hash List Model: Dual Upload Paths#

The HashList model supports two upload paths:

  1. Traditional Active Storage upload: Files are uploaded via has_one_attached :file and stored through the configured Active Storage service (local disk or S3)
  2. Tus upload path: Files are uploaded via the tus resumable upload protocol and stored directly to temp_file_path (bypassing Active Storage)

When using tus uploads:

  • TusUploadHandler#process_tus_hash_list_upload (called from HashListsController#create) moves the uploaded file from tusd's staging area to storage/attack_resources/hash_lists_staging/
  • The temp_file_path attribute is set to the final file location
  • Job enqueuing (ProcessHashListJob) happens in TusUploadHandler, not via model callbacks
  • Validation is relaxed: validates :file, presence: { on: :create }, unless: -> { tus_upload_pending || temp_file_path.present? }

When background jobs (ProcessHashListJob, CountFileLinesJob) open hash list files, they check for temp_file_path first before falling back to blob.open. If they fall back to blob.open for a tus-uploaded file, a warning is logged indicating the file will be downloaded to /tmp tmpfs, which may fail for very large hash lists.

Controllers#

Controllers use ActiveStorage's URL generation and file access APIs:

These methods work identically whether files are stored locally or on S3-compatible services.

Background Jobs#

Background workers process files using ActiveStorage's streaming API:

File operations use the same code path regardless of storage location, simplifying maintenance and testing.

Pre-Download Temp Storage Validation#

To prevent filesystem exhaustion during blob downloads, ProcessHashListJob, CountFileLinesJob, and CalculateMaskComplexityJob include the TempStorageValidation concern, which checks available space in /tmp before calling blob.open. This check occurs before ActiveStorage downloads the file from the storage backend, preventing mid-transfer ENOSPC (no space left on device) failures.

The validation compares the blob's byte_size against available space in Dir.tmpdir (which resolves to /tmp in the container). If insufficient space is detected, the job raises InsufficientTempStorageError, which triggers automatic retry with polynomial backoff (5 attempts total) in case concurrent jobs free up space. After exhausting retries, the job is discarded with a structured error message pointing operators to Docker Storage and /tmp Configuration for remediation guidance.

This pattern ensures that space issues are detected early and handled gracefully, rather than failing mid-processing with cryptic Errno::ENOSPC errors.

Configuration-Driven Backend Selection#

The storage backend is determined entirely by the ACTIVE_STORAGE_SERVICE environment variable, with service definitions in config/storage.yml. This separation allows:

  • Same container image for different deployment environments
  • Runtime backend selection without code changes or rebuilds
  • Easy testing of different storage backends in development
  • Simplified deployment in air-gapped environments

Docker Volume Architecture#

For local storage deployments, CipherSwarm uses host bind mounts to persist files and share them between containers. Understanding this architecture is essential for backup, maintenance, and capacity planning.

Volume Configuration#

Local persistent stores are mounted as host bind mounts at ./storage/, ./postgres-data/, and ./redis-data/ in both the web container and the Sidekiq worker container. These bind mounts ensure both services can read and write the same files, enabling the web interface to upload files that background workers can immediately process.

The bind mount approach stores data directly on the host filesystem in the deployment directory, making backups straightforward with standard host-level tools. The tus_uploads and attack_resources volumes remain as named Docker volumes because they are typically mounted to dedicated filesystems in production environments.

All CipherSwarm Volumes#

The complete set of persistent storage used by CipherSwarm includes:

MountTypeMount PointPurpose
./storage/bind mount/rails/storageApplication file storage (hash lists, wordlists, rules, masks)
./postgres-data/bind mount/var/lib/postgresql/dataPostgreSQL database files
./redis-data/bind mount/dataRedis cache and job queue persistence
tus_uploadsnamed volume/srv/tusd-datatus resumable upload staging
attack_resourcesnamed volume/data/attack_resourcesAttack resource storage

Tmpfs Mounts for Temporary Storage#

Both docker-compose.yml and docker-compose.prod.yml mount tmpfs (memory-backed temporary filesystems) on the web and sidekiq services to prevent disk exhaustion during Active Storage blob downloads and Rails caching operations:

Mount PointSize (Development)Size (Production)Purpose
/tmp1 GB512 MBActive Storage blob downloads during background job processing
/rails/tmp256 MB256 MBBootsnap bytecode cache (~27 MB typical)

Why tmpfs is required:

When background jobs call blob.open (in ProcessHashListJob, CountFileLinesJob, and CalculateMaskComplexityJob), ActiveStorage downloads the entire file from the storage backend to Dir.tmpdir (which resolves to /tmp in the container) before making it available for processing. Without dedicated tmpfs mounts, these downloads write to the Docker overlay filesystem, which can quickly exhaust available space during large file operations.

The /tmp tmpfs mount isolates this temporary download activity to a dedicated, memory-backed filesystem that automatically releases space when files are deleted. The /rails/tmp mount similarly isolates Bootsnap's bytecode cache, preventing it from competing with application storage.

Sizing guidance:

The /tmp tmpfs mount must be at least as large as the largest single attack file (hash list, wordlist, rule file, or mask file) that operators intend to process, with additional headroom for concurrent job execution. For example:

  • Single-file environments: If the largest file is 10 GB, allocate at least 12 GB for /tmp
  • Concurrent processing: If Sidekiq runs 10 concurrent jobs and the average file size is 2 GB, allocate at least 25 GB (10 × 2 GB + 25% headroom)

Production deployments default to 512 MB for /tmp, suitable for typical wordlists and hash lists up to ~400 MB. Organizations processing larger files must increase this value in their docker-compose.prod.yml configuration:

tmpfs:
  - /tmp:size=10g,mode=1777 # Increase to 10 GB for large files
  - /rails/tmp:size=256m,mode=1777

For comprehensive guidance on tmpfs sizing, monitoring, and troubleshooting, see Docker Storage and /tmp Configuration.

Volume Management Operations#

Inspect volume contents:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web ls -la /rails/storage

Check volume disk usage:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web du -sh /rails/storage

Backup the storage volume:

# Create a backup archive from bind mount
tar czf storage-backup-$(date +%Y%m%d).tar.gz -C ./storage .

Restore from backup:

# Extract backup to bind mount
tar xzf storage-backup-20240101.tar.gz -C ./storage

Check available disk space:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web df -h /rails/storage

Capacity Planning#

When planning storage capacity for CipherSwarm deployments:

  • Hash lists: Varies widely based on size; NTLM hash lists can be several GB
  • Wordlists: Large wordlists (rockyou, crackstation) range from 100 MB to 15+ GB
  • Rule files: Typically small (< 10 MB each)
  • Mask files: Typically very small (< 1 MB each)
  • Growth rate: Depends on usage patterns; consider 2-3x the initial dataset size for operational buffer

For production deployments, monitor disk usage trends and plan for expansion before reaching 80% capacity.

Troubleshooting#

Common Issues#

SymptomCauseSolution
Storage connection refusedWrong AWS_ENDPOINTVerify AWS_ENDPOINT points to the correct S3-compatible host and port
Application fails at startup with S3Missing required credentialsEnsure all AWS_* variables are set in .env when using S3 backend; see Environment Variables Reference for details
Migration fails midwayNetwork interruption or insufficient disk spaceRun with DRY_RUN=true to preview; check available disk space with df -h
Files not accessible after migrationService still points to old backendVerify ACTIVE_STORAGE_SERVICE=local in .env and restart containers
Slow storage performanceVolume on slow diskConsider moving Docker volumes to SSD storage or using S3-compatible backend
Background jobs fail with InsufficientTempStorageError/tmp tmpfs mount is too small for the files being processedIncrease tmpfs size in docker-compose.prod.yml; see Docker Storage and /tmp Configuration
Background jobs fail with Errno::ENOSPC/tmp tmpfs exhausted during blob downloadSame as InsufficientTempStorageError; increase tmpfs size or reduce Sidekiq concurrency
Large file uploads hang with no progressClient-side checksum computation stalling browserVerify LARGE_FILE_THRESHOLD_MB is set appropriately; check browser console for JavaScript errors
checksum_verified: false persists on attack resourcesVerifyChecksumJob failed or checksum mismatchReview Sidekiq logs for errors; check available /tmp space for job processing; re-upload the file if integrity failure is logged
S3 upload fails with invalid Content-MD5 headerInitializer patches not loadedVerify config/initializers/active_storage_large_upload.rb is present; restart application to reload initializers

Diagnostic Commands#

Verify active storage configuration:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web \
  bin/rails runner "puts Rails.application.config.active_storage.service"

List all stored blobs:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web \
  bin/rails runner "ActiveStorage::Blob.find_each { |b| puts \"#{b.filename}: #{b.byte_size} bytes\" }"

Check storage service connectivity:

docker compose -f docker-compose.yml -f docker-compose.prod.yml exec web \
  bin/rails runner "puts ActiveStorage::Blob.service.exist?('test') ? 'Connected' : 'Failed'"

Code Files Reference#

File PathPurpose
config/storage.ymlDefines available storage services (:local, :test, :s3)
config/environments/production.rbSets active storage service via ACTIVE_STORAGE_SERVICE environment variable
config/initializers/active_storage_large_upload.rbPatches ActiveStorage::Blob validation and S3Service to support nil checksums for large files
app/controllers/active_storage/direct_uploads_controller.rbCustom DirectUploadsController that accepts nil checksums and sets checksum_skipped metadata
app/javascript/utils/direct_upload_override.jsPatches FileChecksum.create to skip client-side MD5 for large files
app/jobs/verify_checksum_job.rbBackground job that computes and verifies checksums server-side for large files; retries I/O errors with automatic discard logging
app/jobs/requeue_unverified_resources_job.rbPeriodic job (runs every 6 hours) that re-enqueues VerifyChecksumJob for stale unverified resources
lib/tasks/storage_migrate.rakeImplements storage migration rake task with checksum verification
app/services/system_health_check_service.rbVerifies storage backend connectivity and collects metrics
app/models/hash_list.rbHashList model with ActiveStorage file attachment and tus upload support (temp_file_path, tus_upload_pending)
app/models/concerns/attack_resource.rbShared concern for WordList, RuleList, and MaskList file attachments with checksum_verified support
app/controllers/concerns/downloadable.rbController concern for generating download URLs
app/controllers/concerns/tus_upload_handler.rbHandles tus upload post-processing for attack resources and hash lists
app/controllers/hash_lists_controller.rbIncludes TusUploadHandler and calls process_tus_hash_list_upload after record creation
app/helpers/application_helper.rbContains tus_endpoint helper that returns ENV["TUS_ENDPOINT_URL"] or /uploads/ default
app/jobs/process_hash_list_job.rbBackground job for processing uploaded hash lists; checks temp_file_path before falling back to blob.open
app/jobs/count_file_lines_job.rbBackground job for counting lines in uploaded files; checks file_path before falling back to blob.open
app/jobs/concerns/temp_storage_validation.rbConcern for pre-download temp storage space checks in ingest jobs
app/errors/insufficient_temp_storage_error.rbCustom error raised when /tmp space is insufficient for blob downloads
db/migrate/20260317034648_add_checksum_verified_to_attack_resources.rbMigration adding checksum_verified boolean column to attack resource tables
db/migrate/20260329211021_add_checksum_sweep_indexes.rbMigration adding partial indexes on updated_at WHERE checksum_verified = false for efficient sweep queries
config/schedule.ymlCron schedule configuration including RequeueUnverifiedResourcesJob (every 6 hours)
config/configs/application_config.rbApplication configuration including checksum_verification_retry_threshold (default 6 hours)
docker-compose.prod.ymlProduction Docker Compose configuration with volume definitions and tmpfs mounts
docker-compose.ymlDevelopment Docker Compose configuration
.envEnvironment variables for storage configuration (not in repository)
  • CipherSwarm Installation and Deployment: Complete setup procedures for production environments
  • Air-Gapped Deployment Strategies: Methods for deploying applications without Internet connectivity
  • Docker Volume Management: Best practices for persistent data in containerized applications
  • Docker Storage and /tmp Configuration: Comprehensive guide to tmpfs sizing, monitoring, and troubleshooting for Active Storage blob processing
  • Rails ActiveStorage: Ruby on Rails framework for file upload and storage abstraction
  • Distributed Hash Cracking Architecture: CipherSwarm's approach to coordinating distributed password cracking operations