Agent API Lifecycle Workflow#

The Agent API Lifecycle Workflow describes the complete sequence of API operations that CipherSwarm agents execute when interacting with the server, from initial startup through task completion. This workflow enables distributed hash cracking operations in secure, air-gapped environments by orchestrating communication between remote agents (client software) and the central CipherSwarm server (Rails application).

All agent API endpoints are organized under the /api/v1/client/* namespace and require bearer token authentication via the Authorization: Bearer <token> header. The lifecycle consists of five major phases: startup sequence (authentication, configuration, and registration), task acquisition (requesting and accepting work), work execution (status updates and crack submission), task completion (exhaustion or abandonment), and maintenance operations (heartbeat, error reporting, and shutdown). Understanding this workflow is essential for deploying CipherSwarm agents in production environments, particularly in network-isolated security labs where reliable distributed processing is critical.

The workflow implements sophisticated error handling and state management to ensure reliable operation across potentially unreliable network connections. Token validation occurs through direct database lookup, and the system tracks agent activity by updating timestamps and IP addresses on each authenticated request. This enables administrators to monitor agent health and troubleshoot connectivity issues in isolated network environments.

Startup Sequence#

The startup sequence establishes the agent's identity and retrieves configuration settings necessary for operation. This process must complete successfully before the agent can request work, though some steps (such as benchmark submission) are conditional based on server directives.

Authentication#

Agents begin by authenticating with the CipherSwarm server using GET /api/v1/client/authenticate. Token validation occurs via Agent.find_by(token: token) in the database, and successful authentication updates the last_seen_at timestamp and IP address for monitoring purposes.

Successful Response (200 OK):

{
  "authenticated": true,
  "agent_id": 42
}

Authentication Failure (401 Unauthorized):

{
  "error": "Bad credentials"
}

If authentication fails, the agent should cease all operations and alert administrators, as the token may have been revoked or is incorrectly configured.

Configuration Retrieval#

After authentication, agents retrieve global configuration settings using GET /api/v1/client/configuration. The response contains operational parameters such as polling intervals, resource limits, server capabilities, and the benchmarks_needed flag indicating whether the agent should run benchmarks. Agents should cache this response and only refresh on restart or when explicitly notified of configuration changes.

Response (200 OK): Returns an AgentConfigurationResponse object containing:

config: Advanced hashcat and agent configuration options
api_version: The minimum accepted version of the API
benchmarks_needed: Boolean flag indicating whether the server needs benchmark data from this agent
recommended_timeouts: Server-recommended timeout values (connect, read, write, request) in seconds
recommended_retry: Retry policy parameters (max_attempts, initial_delay, max_delay)
recommended_circuit_breaker: Circuit breaker thresholds (failure_threshold, timeout)

The resilience parameters enable agents to configure their HTTP clients with server-provided values, allowing operators to adjust timeout and retry behavior without redeploying agents. Agents should apply these values when initializing HTTP clients and refresh them periodically (recommended interval: on restart or when configuration changes are detected).

When benchmarks_needed is false, the server already has valid cached benchmark results for this agent, and the agent can skip the benchmark execution step entirely. This improves startup time and reduces resource consumption when benchmark results haven't changed since the last submission.

Possible Errors:

401 Unauthorized: Token invalid
404 Not Found: Agent not found in database
422 Unprocessable Content: Configuration validation error

Agent Registration#

Agents update their metadata on the server using PUT /api/v1/client/agents/{id}. This endpoint accepts an AgentUpdateV1 schema containing agent information such as hostname, operating system, device details, and installed hashcat version. This information helps administrators identify agents and monitor the cluster's composition.

Request Body: AgentUpdateV1 schema (hostname, device information, hashcat version)

Response (200 OK): Returns updated AgentResponseV1 object

Benchmark Submission#

Benchmark submission is conditional based on the benchmarks_needed flag received during configuration retrieval. When benchmarks_needed is true, agents execute performance benchmarks and submit them using POST /api/v1/client/agents/{id}/submit_benchmark. Benchmarks report the agent's hash cracking capabilities across different hash types, enabling the server to make intelligent task assignment decisions based on agent performance characteristics.

When benchmarks_needed is false, the agent skips benchmark execution entirely because the server has valid cached benchmark results from a previous submission. This optimization significantly improves agent startup time by avoiding the resource-intensive benchmark process when hardware configuration hasn't changed.

Request Body: AgentBenchmark schema containing device capabilities and performance metrics for various hash algorithms

Response: 200 OK with a BenchmarkReceipt JSON body containing received_count, processed_count, failed_count, and optional message field. Servers may return 204 No Content for backward compatibility with legacy servers (no receipt validation).

The agent validates the receipt counts (received, processed, failed). Count mismatches and partial failures are logged as warnings (advisory-only); HTTP 204 responses are still accepted for backward compatibility.

Force Benchmark Override:

Operators can force a fresh benchmark run using the --force-benchmark CLI flag, which overrides the server's benchmarks_needed signal. This is useful when:

Hardware configuration has changed (GPU drivers updated, new device installed)
Cached benchmark results are suspected to be stale or inaccurate
Troubleshooting performance issues requires fresh baseline metrics

The force benchmark flag bypasses the local cache and server directive, ensuring a complete re-run of all benchmark tests.

Task Acquisition Loop#

After startup, agents enter the task acquisition loop, continuously polling for available work. This loop implements the distributed work assignment mechanism that allocates hash cracking tasks across the agent pool.

Task Request#

Agents request new work using GET /api/v1/client/tasks/new. The server filters available tasks by the agent's project membership using where(campaigns: {project_id: agent.project_ids}), ensuring agents only receive work from projects they're authorized to access.

Quarantine Filtering:

The task assignment service excludes tasks from quarantined campaigns at all stages of the assignment algorithm. Campaigns are automatically quarantined when agents report unrecoverable hashcat errors (such as token length exceptions or "no hashes loaded" failures) via the error reporting endpoint. Quarantined campaigns remain excluded from task assignment until administrators manually clear the quarantine or the underlying issue is resolved by updating the hash list or attack parameters. This prevents agents from wasting computational resources repeatedly attempting tasks with fundamental configuration errors.

The task assignment algorithm implements a three-step priority system designed to maximize efficiency and minimize redundant work:

Incomplete assigned tasks (highest priority): Returns any incomplete task already assigned to the agent that doesn't have fatal errors
Agent's own paused tasks: Reclaims the agent's own paused tasks (e.g., after restart) to leverage existing restore files and progress
Orphaned paused tasks (with grace period): Claims paused tasks from other agents after a 30-minute grace period (default), or immediately if the original agent is offline/stopped
Available attack tasks (standard allocation): Finds failed retryable tasks, pending tasks from existing attacks, or creates new tasks for attacks without pending work

All four stages filter out quarantined campaigns using where(campaigns: { quarantined: false }) to prevent task assignment from campaigns with unrecoverable errors.

This priority system ensures agents preferentially resume their own interrupted work before claiming tasks from other agents, reducing wasted computation. The grace period prevents premature task reassignment when an agent briefly disconnects, allowing it time to reconnect and resume with its existing restore files intact.

Response (200 OK): Task data including task ID and attack ID

When no work is available, the agent should wait before polling again (typical interval: 5-30 seconds depending on workload).

Task Acceptance#

Upon receiving a task, the agent accepts it using POST /api/v1/client/tasks/{id}/accept_task. Task authorization is enforced through @task = @agent.tasks.find(params[:id]), which ensures agents can only accept tasks assigned to them.

Response: 204 No Content on successful acceptance

Possible Errors:

422 Unprocessable Content: Task already completed
404 Not Found: Task not found or not assigned to this agent
403 Forbidden: Agent lacks permission

Accepting a task transitions it from pending to running state in the task state machine.

Acceptance Error Handling#

The agent implements specialized error handling for task acceptance failures, distinguishing between transient race conditions and genuine errors:

404 Not Found (Task Vanished):

When the server responds with 404 during task acceptance, this indicates the task disappeared before the agent could claim it—a normal race condition in distributed systems. The agent treats this as a terminal no-op:

No AbandonTask call: Skips the abandon endpoint entirely, since the task already doesn't exist on the server
Local cleanup only: Removes downloaded files (hash lists, wordlists) associated with the vanished task
Immediate retry: Returns to the task request loop without delay to check for new work
Info-level logging: Logs "Task no longer exists on server" at Info severity, not Critical

This behavior prevents unnecessary API traffic and false critical-severity alerts when tasks are legitimately removed by concurrent operations (campaign cancellation, attack completion, or another agent claiming the task first).

Other Acceptance Errors (403, 422, 5xx):

For non-404 failures, the agent treats acceptance failure as a genuine error requiring cleanup:

Calls AbandonTask: Explicitly abandons the task via POST /tasks/{id}/abandon, triggering the cascade that destroys all tasks for that attack (see Task Abandonment section)
Applies failure delay: Sleeps for the configured sleepOnFailure duration before requesting new work, preventing tight retry loops
Critical-level logging: Logs at Critical severity with full error context for administrator investigation

Sentinel Errors:

The agent uses two sentinel errors (defined in lib/task/errors.go) to distinguish acceptance failure modes:

ErrTaskAcceptNotFound: Returned when the server responds 404
ErrTaskAcceptFailed: Returned for all other acceptance errors

Callers use errors.Is() to check the specific failure mode and implement appropriate recovery logic.

Attack Details Retrieval#

Before execution, agents retrieve comprehensive attack configuration using GET /api/v1/client/attacks/{id}. This endpoint should be called once per task after acceptance, before starting hashcat execution. Authorization occurs via task assignment: Attack.joins(:tasks).where(tasks: { agent: @agent }), preventing agents from accessing attacks they're not assigned to.

The attack configuration response includes all parameters needed to construct the hashcat command:

Attack mode: Dictionary, mask, hybrid dictionary, or hybrid mask, plus the corresponding hashcat numeric mode (0, 3, 6, or 7)
Hash type: Hashcat mode ID (e.g., 1000 for NTLM) extracted from the campaign's hash list
Performance parameters: Workload profile (1-4), optimization flags, slow candidate generator settings
Attack strategy: Mask patterns, increment mode settings (minimum/maximum lengths), rule files, custom character sets (1-4)
Markov chain settings: Enable/disable flags, classic Markov mode, threshold values
Resource objects: Download URLs, MD5 checksums, and filenames for wordlists, rule files, and mask files
Hash list metadata: Download URL and MD5 checksum for integrity verification

Agents cache this configuration for the task duration, as attack parameters don't change during execution.

Error Response (404 Not Found):

{
  "error": "Attack not found."
}

Hash List Download#

Agents download the target hash list using GET /api/v1/client/attacks/{id}/hash_list. The endpoint returns plain text format with one hash per line, directly compatible with hashcat's input requirements.

For efficiency, the server streams only uncracked hashes in batches of 10,000 to avoid memory overhead when processing large hash lists. This ensures the server can handle massive hash lists (millions of entries) without consuming excessive memory.

The response includes Content-Type: text/plain and a Content-Disposition header with filename {hash_list_id}.txt. Agents save the hash list to a local file named {attack_id}.hsh and validate that the file is not empty before proceeding.

Example Hash List Format:

5f4dcc3b5aa765d61d8327deb882cf99
e99a18c428cb38d5f260853678922e03
098f6bcd4621d373cade4e832627b4f6

Possible Errors:

404 Not Found: Attack not found or hash list missing
401 Unauthorized: Invalid token
403 Forbidden: Agent lacks permission

Work Execution Loop#

During task execution, agents maintain communication with the server through periodic status updates and crack submissions while hashcat processes the attack.

Heartbeat#

Agents send periodic heartbeat messages using POST /api/v1/client/agents/{id}/heartbeat to signal that they're online and operational. The recommended interval is 30-60 seconds. Heartbeats update the agent's last_seen_at timestamp, enabling the server to detect offline agents and trigger automated recovery procedures.

Request Body: AgentHeartbeatRequest schema containing current agent state

Response: 204 No Content

Heartbeats continue throughout the agent's lifetime, independent of task execution.

Status Updates#

Agents report task progress using POST /api/v1/client/tasks/{id}/submit_status at recommended intervals of 10-30 seconds during active execution. Status updates include progress percentage, hash rate, estimated time remaining, and current position in the keyspace.

Request Body: TaskStatusUpdate schema containing:

Progress percentage (0-100)
Current hash rate (hashes per second)
Estimated time remaining
Checkpoint position in keyspace

Response Codes:

204 No Content: Continue execution normally
202 Accepted: Status accepted but data appears stale; agent should abandon current work
410 Gone: Task has been paused server-side; agent must stop execution immediately
404 Not Found: Task no longer exists; implement recovery strategy

The varied response codes enable the server to control agent behavior dynamically, supporting administrative actions like pausing campaigns or reassigning work.

Crack Submission#

As hashcat discovers plaintext passwords, agents submit them using POST /api/v1/client/tasks/{id}/submit_crack. Submissions should occur as soon as cracks are found to ensure other agents can skip already-cracked hashes.

Request Body: HashcatResult schema containing:

Hash value
Plaintext password
Timestamp

Response (200 OK): Confirmation that the crack was recorded

Possible Errors:

422 Unprocessable Content: Validation error (duplicate submission, malformed data)
409 Conflict: Crack already submitted by another agent

Get Previously Cracked Hashes#

To avoid wasting computational resources on already-cracked hashes, agents periodically retrieve completed hashes using GET /api/v1/client/tasks/{id}/get_zaps. The endpoint returns a text/plain file containing cracked hash values, which agents pass to hashcat via the --remove flag to exclude them from processing.

Response (200 OK): Plain text file with one hash per line

Possible Errors:

404 Not Found: Task not found
401 Unauthorized: Invalid token
403 Forbidden: Agent lacks permission

Task Completion Paths#

Tasks conclude through one of three paths depending on execution outcomes.

Successful Completion#

When all hashes are cracked, the task automatically transitions from running to completed via the accept_crack event when no uncracked hashes remain. This transition occurs automatically upon submitting the final crack; no explicit completion endpoint is required.

Keyspace Exhaustion#

When hashcat fully explores the configured keyspace but some hashes remain uncracked, agents signal exhaustion using POST /api/v1/client/tasks/{id}/exhausted. This transitions the task from running to exhausted state.

Response: 204 No Content

Possible Errors:

404 Not Found: Task not found
401 Unauthorized: Invalid token
403 Forbidden: Agent lacks permission
422 Unprocessable Content: Task already completed or exhausted

Exhausted tasks indicate the attack configuration didn't successfully crack all hashes, requiring administrators to configure additional attacks with different parameters.

Task Abandonment#

When agents cannot complete a task (due to errors, resource constraints, or administrative intervention), they abandon it using POST /api/v1/client/tasks/{id}/abandon.

Critical Warning: Abandoning a task triggers attack.abandon, which destroys ALL tasks associated with that attack, not just the abandoned task. This cascade behavior prevents partial attack execution when fundamental issues exist with the attack configuration. Agents should use abandonment only for irrecoverable errors, not transient issues.

Important Exception: 404 errors during task acceptance do NOT trigger the abandon cascade. When a task vanishes before acceptance (normal race condition), the agent skips the AbandonTask call entirely and proceeds directly to cleanup and task request. This prevents attempting to abandon tasks that already don't exist on the server, avoiding unnecessary API errors and false critical-severity alerts. The cascade warning applies only when AbandonTask is actually invoked—during genuine acceptance failures (403, 422, 5xx) or when the agent encounters errors after successfully accepting a task.

Response: 204 No Content

Possible Errors:

422 Unprocessable Content: Task already completed
404 Not Found: Task not found
401 Unauthorized: Invalid token
403 Forbidden: Agent lacks permission

Maintenance Operations#

Agents perform several maintenance operations throughout their lifecycle to report errors, check for updates, verify connectivity, and signal shutdown.

Health Check#

Agents can verify server availability using GET /api/v1/client/health before attempting authenticated requests. This unauthenticated endpoint enables connectivity probes during circuit breaker recovery, initial setup, and diagnostics.

Response (200 OK):

{
  "status": "ok",
  "api_version": 1,
  "timestamp": "2026-03-12T10:30:45Z",
  "database": "healthy"
}

Response (503 Service Unavailable):

{
  "status": "degraded",
  "api_version": 1,
  "timestamp": "2026-03-12T10:30:45Z",
  "database": "unhealthy"
}

Use Cases:

Circuit breaker probes: When transitioning from open to half-open state, probe /health before resuming authenticated requests
Initial connectivity check: Verify server reachability during agent startup before authentication
Diagnostics: Distinguish between network connectivity issues and authentication failures

The health endpoint bypasses agent token authentication to enable connectivity verification when credentials are unavailable or expired. It performs a lightweight database check to detect common infrastructure failures.

Graceful Shutdown#

When stopping, agents notify the server using POST /api/v1/client/agents/{id}/shutdown. This marks the agent as intentionally offline, distinguishing graceful shutdown from connectivity loss.

Response: 204 No Content

Administrators can use this signal to differentiate between normal operations and potential failures requiring investigation.

Enhanced Shutdown Cascade:

The shutdown process implements a sophisticated cascade that cleanly pauses ongoing work:

Task Pause: All running tasks assigned to the agent are paused and their claim fields (claimed_by_agent_id, claimed_at, expires_at) are cleared. The paused_at timestamp is set to Time.zone.now, initiating the grace period for task recovery.
Attack Pause: Attacks with no remaining active tasks (pending or running) are automatically paused. This prevents attacks from appearing active in the system when all agents have stopped work on them.
Automatic Resume: When another agent claims a paused task from a paused attack (either during grace period reclamation by the original agent or after grace period expiration by a different agent), the attack is automatically resumed. This prevents attacks from remaining stuck in paused state when work resumes.
Error Handling: The shutdown cascade uses try-rescue blocks around state transitions. If a task or attack fails to pause (due to concurrent state changes), the shutdown process continues with other tasks. Claim fields are only cleared for successfully paused tasks to avoid inconsistent states.

The agent_id field always remains populated (NOT NULL constraint), enabling the system to track which agent originally owned each paused task for grace period calculations and metrics.

Error Reporting#

Agents submit detailed error information using POST /api/v1/client/agents/{id}/submit_error when encountering operational issues. Error reports include stack traces, error messages, and structured context metadata to facilitate troubleshooting and enable programmatic server-side decisions.

Request Body: AgentErrorV1 schema containing:

Error message
Stack trace
Severity level
Timestamp
Context (task ID, attack ID, etc.)
Structured error metadata (via other field)

Response: 204 No Content

Structured Error Context:

Agents perform client-side error classification using the ClassifyStderr function, which analyzes hashcat output and extracts structured context fields before submission. The agent sends these fields via the other metadata map, enabling the server to make programmatic decisions about agent health and task assignment without parsing raw error text.

Required Metadata Fields for Quarantine Logic:

The server inspects metadata.other for the following fields when evaluating whether to quarantine the campaign:

category (string, required): Error domain—hash_format, hardware, runtime, or config
retryable (boolean, required): Whether the error is transient (true) or permanent (false)
terminal (boolean, optional): Definitive failures where no retry can succeed (e.g., no_hashes_loaded)
error_type (string, optional): Machine-readable identifier (e.g., token_length_exception, hashfile_empty_or_corrupt)

The server triggers automatic quarantine when both conditions are met:

retryable == false, AND
category == "hash_format" OR terminal == true

This ensures campaigns are quarantined only for unrecoverable configuration errors (invalid hash format, empty hash files, token length mismatches) while allowing transient errors (network timeouts, temporary GPU failures) to be retried.

Additional Context Fields:

hashfile: Path to affected hashfile (for hash parsing errors)
line_number: Line number where parsing error occurred
hash_preview: Truncated hash preview (max 64 characters) for parsing errors
device_id: GPU device ID for device-specific errors
backend_api: Backend API name ("OpenCL", "CUDA", "HIP", "Metal")
api_error: Specific backend API error code (e.g., "CL_OUT_OF_HOST_MEMORY")
exit_code_name: Named exit code for programmatic classification (e.g., "exhausted", "memory_hit")

Enhanced Error Classification:

The agent classifies 15+ error patterns including:

Hash parsing errors: Stdout summary lines and per-hash errors with file context
Kernel failures: Build and creation failures with device and kernel path
Backend API errors: OpenCL, CUDA, HIP, and Metal errors with device context
Memory exhaustion: Device memory errors with device ID
Temperature limits: GPU thermal abort events with device ID
Self-test/autotune failures: Fatal kernel test failures
Configuration errors: Invalid hash mode, keyspace overflow, stdin timeout

StderrMessages Channel:

The StderrMessages channel has been changed from chan string to chan ErrorInfo, providing pre-classified errors with structured context. Error classification occurs once in the session handlers, eliminating redundant classification by consumers. The ErrorInfo struct contains:

type ErrorInfo struct {
    Category ErrorCategory
    Severity api.Severity
    Retryable bool
    Message string
    Context map[string]any
}

WithContext ErrorOption:

Error submissions use the cserrors.WithContext() function to merge structured fields into the error metadata:

cserrors.SendAgentError(ctx, client, severity, message,
    cserrors.WithClassification(category, retryable),
    cserrors.WithContext(errInfo.Context))

Server-Side Campaign Quarantine:

When agents submit errors via POST /api/v1/client/agents/{id}/submit_error, the server automatically quarantines the associated campaign if the error metadata indicates an unrecoverable failure. The quarantine logic evaluates the structured metadata fields (category, retryable, terminal) and invokes campaign.quarantine!(error_message) when the conditions are met.

Quarantined campaigns are immediately excluded from all task assignment queries, preventing agents from receiving new tasks from campaigns with fundamental configuration errors. The quarantine state persists until administrators manually clear it or the underlying issue is resolved through hash list or attack parameter updates.

Cracker Updates#

Agents check for updated hashcat binaries using GET /api/v1/client/crackers/check_for_cracker_update?version=<semver>&operating_system=<os>. The endpoint accepts the current hashcat version (semantic versioning format) and operating system (windows, linux, darwin).

Response (200 OK): CrackerUpdateResponse object if an update is available, containing download URL and version information

Possible Errors:

400 Bad Request: Invalid parameters
401 Unauthorized: Invalid token

This mechanism enables centralized hashcat version management across the agent fleet, essential in air-gapped environments where direct internet access is unavailable.

Error Handling and State Transitions#

Robust error handling ensures reliable operation despite network issues, configuration problems, and server-side events. Agents should implement timeout protection, exponential backoff retries, and circuit breaker patterns to maintain resilience in unreliable network environments.

Timeout Configuration#

Agents must configure their HTTP clients with the timeout values received from GET /api/v1/client/configuration:

connect_timeout: Maximum time to establish TCP connection (default: 10 seconds)
read_timeout: Maximum time to wait for response data after connection (default: 30 seconds)
write_timeout: Maximum time to send request data, including file uploads (default: 30 seconds)
request_timeout: Overall deadline wrapping the entire request lifecycle (default: 60 seconds)

Timeouts prevent agents from hanging indefinitely when the server becomes unresponsive. Without timeouts, an agent blocked on a single request cannot detect connectivity loss or perform recovery operations.

Retry Strategy with Circuit Breaker#

Agents should implement exponential backoff with jitter using the recommended_retry parameters from configuration:

delay = min(initial_delay * 2^attempt, max_delay) + random(0, delay * 0.5)

Adding random jitter prevents synchronized retries across multiple agents (thundering herd problem). The circuit breaker pattern protects the server during extended outages:

Circuit Breaker States:

Closed (normal): Requests pass through. Consecutive failures are counted against failure_threshold.
Open (failing): After reaching failure_threshold, all requests fail immediately for timeout seconds without attempting connection.
Half-Open (probing): After timeout expires, allow one probe request. Success transitions to Closed; failure returns to Open.

Health Check Integration:

When transitioning from Open to Half-Open, agents should probe GET /api/v1/client/health rather than authenticated endpoints. This verifies server availability without consuming authentication resources or triggering token validation failures.

Authentication Errors (401 Unauthorized)#

Invalid or expired tokens return {error: "Bad credentials"}. Agents receiving 401 responses should cease all operations and alert administrators, as the token may have been revoked, expired, or misconfigured.

401 errors require manual intervention; automated retry is inappropriate.

Authorization Errors (403 Forbidden)#

403 responses indicate the agent lacks permission to access the requested resource, such as tasks outside their assigned projects. This typically indicates configuration issues or unexpected state changes.

Agents should log the error and request different work rather than retrying the same operation.

Enhanced 404 Responses#

The API provides enhanced 404 error responses with metadata to help agents understand the cause:

Task Deleted (attack abandoned or completed):

{
  "error": "Record not found",
  "reason": "task_deleted",
  "details": "Task was removed when attack was abandoned or completed"
}

Task Not Assigned (belongs to another agent):

{
  "error": "Record not found",
  "reason": "task_not_assigned",
  "details": "Task belongs to another agent"
}

Invalid Task ID:

{
  "error": "Record not found",
  "reason": "task_invalid",
  "details": "Task ID does not exist"
}

Special Case: 404 During Task Acceptance

404 responses during task acceptance have distinct semantics compared to 404s during other operations (heartbeat, status updates, crack submission). When AcceptTask receives a 404:

The agent logs the error at Info level (not Critical), using the message "Task no longer exists on server"
The agent treats this as a terminal no-op—no retry logic, no exponential backoff, no AbandonTask call
The agent immediately cleans up local files and requests new work via GET /api/v1/client/tasks/new

This reflects the race condition nature of task acceptance: the task vanished between being assigned via /tasks/new and being claimed via /accept_task. This is expected behavior in a distributed system with concurrent task operations, not an error requiring investigation or retry.

404 Recovery Strategy#

When receiving 404 errors, agents should implement the following recovery strategy:

Stop retrying: Don't retry the same task ID indefinitely
Exponential backoff: Implement exponential backoff starting at 1 second, maximum 60 seconds:
- First retry: 1 second
- Second retry: 2 seconds
- Third retry: 4 seconds
After 3 consecutive 404s: Abandon the task reference and request new work via GET /api/v1/client/tasks/new
Log the error: Include task ID, agent ID, operation type, and timestamp for debugging
Interpret reason codes:
- task_deleted: Stop processing immediately, request new work
- task_not_assigned: Task exists but assigned to another agent (race condition), request new work after backoff
- task_invalid: Task ID doesn't exist (likely client bug), request new work immediately
- No reason field: Legacy error response, treat as task_deleted

Task State Machine#

CipherSwarm implements a sophisticated state machine managing six core states:

pending: Initial state, awaiting agent assignment
running: Task actively being processed by an agent
paused: Execution temporarily suspended by administrator or agent shutdown
completed: All hashes successfully cracked
exhausted: Keyspace fully explored, some hashes remain uncracked
failed: Task encountered unrecoverable error

Key State Transitions:

pending → running: Via accept_task API endpoint
running → completed: Via accept_crack event when no uncracked hashes remain (guard condition)
running → exhausted: Via exhausted API endpoint
running → paused: Server-initiated when administrator pauses campaign or during agent shutdown
paused → pending: Via resume! when agent reclaims the task
running → pending: Via abandon API, followed by cascade destruction

Terminal states (completed, exhausted, failed) are blocked from further transitions via accept_status to prevent task resurrection after completion.

Grace Period and Task Recovery:

The task state machine includes a grace period mechanism that manages task recovery after agent disconnections:

When a task transitions to paused state, the paused_at timestamp is set to Time.zone.now using update_column to avoid triggering additional callbacks
When a task transitions from paused to pending via resume!, both the stale flag is set to true and paused_at is cleared to nil using update_columns
The grace period (default 30 minutes, configured via agent_considered_offline_time) determines when paused tasks become available to other agents:
- Within grace period: Only the original agent (matching agent_id) can reclaim the task via find_own_paused_task
- After grace period expires: Any agent can claim the task via find_unassigned_paused_task
- Immediate availability: Tasks from offline or stopped agents bypass the grace period and are immediately available

The SQL logic for orphaned task detection includes:

tasks.paused_at IS NULL OR tasks.paused_at < :grace_cutoff OR agents.state IN (:orphan_states)

This clause ensures:

Legacy paused tasks (created before the paused_at column was added) are treated as immediately available (paused_at IS NULL)
Recent paused tasks become available after the grace period expires (paused_at < :grace_cutoff)
Tasks from truly offline/stopped agents are available immediately regardless of paused_at timestamp (agents.state IN (:orphan_states))

The grace period mechanism balances two competing concerns: allowing agents to resume their own work efficiently after brief disconnections, while ensuring abandoned tasks don't remain locked indefinitely when agents fail to reconnect.

Stale Task Handling#

When agents receive a 202 (Accepted) status code from submit_status, they must treat the task as stale:

Abandon current work on that task immediately
Discard local state and partially computed results
Request new work via GET /api/v1/client/tasks/new

Stale detection enables task reassignment without destroying work, supporting operational flexibility when redistributing workload across the agent pool.

Campaign Quarantine State#

Campaign quarantine is a server-side mechanism that prevents agents from repeatedly attempting tasks with unrecoverable configuration errors. When agents report fatal errors (such as token length exceptions or "no hashes loaded" failures) with retryable == false and either category == "hash_format" or terminal == true, the server automatically flags the campaign as quarantined.

Quarantine Lifecycle:

Quarantined campaigns are excluded from task assignment at all stages of the algorithm (incomplete tasks, own paused tasks, orphaned paused tasks, and available attacks). This prevents wasting computational resources on campaigns that cannot succeed due to invalid hash formats, empty hash lists, or hash type mismatches.

Clearing Quarantine:

Quarantine is automatically cleared when operators correct the underlying issue:

Hash list updates: Changing the hash type (hash_type_id) or uploading a new hash file triggers HashList#clear_campaigns_quarantine_if_needed, which clears quarantine on all associated campaigns
Attack parameter updates: Modifying attack resource references (word lists, rule lists, mask lists) or attack configuration parameters (mask, attack mode, increment settings, Markov settings, custom charsets, workload profile) triggers Attack#clear_campaign_quarantine_if_needed, which clears quarantine on the parent campaign
Manual clearing: Administrators can manually clear quarantine via the campaign management interface when troubleshooting or overriding automated quarantine decisions

After quarantine is cleared, tasks from the campaign become available for assignment again. Agents will automatically receive work from previously quarantined campaigns during the normal task request cycle.

Complete Workflow Diagrams#

Overall Agent Lifecycle#

The following sequence diagram illustrates the complete operational workflow from startup through shutdown:

Loading diagram...

Task State Transitions#

The state diagram shows how tasks transition between states based on API calls and server events:

Loading diagram...

Error Recovery Flow#

The flowchart illustrates the decision tree for handling different HTTP response codes:

Loading diagram...

Best Practices#

Status Update Handling#

Always check HTTP response codes before continuing execution
Implement maximum 3 retry attempts with exponential backoff for 404 errors
Use structured JSON logging format for easy parsing and analysis
Monitor error rates across the agent fleet to detect systemic issues

Note on Acceptance Failures: 404 errors during task acceptance do not follow the standard retry/exponential backoff pattern documented below for other operations. Acceptance failures are treated as terminal no-ops (see Task Acceptance section) to prevent unnecessary retries when tasks legitimately vanish due to concurrent operations.

Task Validation#

Before starting expensive operations like downloading large hash lists, validate task existence and state. Implement local task expiration tracking to detect stale tasks early. For long-running operations, periodically request task updates to ensure the task hasn't been cancelled or reassigned.

Recommended Configuration Settings#

The following configuration values are recommended for production deployments:

Heartbeat interval: 30-60 seconds
Status update interval: 10-30 seconds during active work
Maximum retry attempts: 3 for 404 errors
Backoff multiplier: 2x (exponential)
Maximum backoff delay: 60 seconds

These values balance server load against responsiveness and monitoring granularity. In high-latency environments, consider increasing intervals; in low-latency environments with strict monitoring requirements, decrease them.

Resilience Parameters:

Agents should fetch configuration from GET /api/v1/client/configuration on startup and apply the server-provided resilience parameters:

Timeout configuration: Set HTTP client timeouts (connect, read, write, request) from recommended_timeouts
Retry policy: Configure exponential backoff using recommended_retry (max_attempts, initial_delay, max_delay)
Circuit breaker: Implement three-state circuit breaker with recommended_circuit_breaker (failure_threshold, timeout)

Periodically refresh configuration (recommended interval: every 4-8 hours or on agent restart) to pick up updated resilience parameters without redeployment. The server returns default values if custom configuration is not set, ensuring agents always receive valid parameters.

Task Reassignment Strategy#

For safe task reassignment without destruction, the system uses pause/resume workflow with grace period support:

Manual Campaign Pause/Resume (Administrator-Initiated):

Administrator pauses the campaign (all tasks transition: running → paused)
Administrator resumes the campaign (tasks transition: paused → pending with stale flag set)
Agents detect stale status via 202 response, abandon current work
Agents request new tasks via GET /tasks/new
Previously unprocessed keyspace portions become available for reassignment

Automatic Agent Shutdown Recovery (System-Initiated):

Agent shutdown pauses running tasks and clears claim fields (claimed_by_agent_id, claimed_at, expires_at)
The paused_at timestamp is set, initiating the grace period (default 30 minutes)
Within grace period: Only the original agent can reclaim via find_own_paused_task
After grace period: Any agent can claim via find_unassigned_paused_task
On task reclamation, if the attack is also paused, it's automatically resumed
The claiming agent resumes the task (transitions: paused → pending, clears paused_at)
Agent accepts and executes the task with access to existing restore files and progress

This dual-path workflow enables both administrative workload rebalancing (via manual pause/resume) and automated recovery from agent failures (via grace period reclamation), maximizing efficiency while maintaining system reliability.

Relevant Code Files#

The following table lists key source code files implementing the Agent API Lifecycle Workflow:

File Path	Description	Key Functionality
app/controllers/api/v1/base_controller.rb	Base controller for all agent API endpoints	Bearer token authentication, error handling (401/403/404/422/500), request/response logging, agent activity tracking
app/controllers/api/v1/client/agents_controller.rb	Agent lifecycle management endpoints	Authentication, configuration retrieval, agent updates, benchmark submission, heartbeat, error reporting, shutdown
app/controllers/api/v1/client/tasks_controller.rb	Task assignment and status management	Task request, task acceptance, status updates (204/202/410 responses), crack submission, task completion (exhausted/abandon), get_zaps
app/controllers/api/v1/client/attacks_controller.rb	Attack configuration retrieval	Attack details (show action), hash list download (streaming with batched queries)
app/views/api/v1/client/attacks/show.json.jbuilder	Attack response template	JSON structure for attack configuration including hashcat parameters, attack modes, resources, hash list metadata
app/models/concerns/task_state_machine.rb	Task state machine logic	6 states (pending, running, paused, completed, exhausted, failed), 11 transition events, guards preventing terminal state transitions
app/models/concerns/attack_state_machine.rb	Attack state machine logic	Attack lifecycle states, abandon event cascade logic that destroys all associated tasks
app/models/hash_list.rb	Hash list model	`uncracked_list_enum` method for streaming uncracked hashes in batches, checksum calculation
app/services/task_assignment_service.rb	Task assignment logic	Project-scoped task filtering, agent capability matching, workload distribution
app/services/status_submission_service.rb	Status update processing	Transaction-wrapped status handling, stale detection (202 response), task state validation
config/routes/client_api.rb	Agent API route definitions	Complete route mappings for `/api/v1/client/*` namespace
lib/agent/agent.go	Agent main workflow (Go implementation)	ProcessTask function implementing the complete task lifecycle from acceptance through completion
lib/downloader/downloader.go	Resource downloader (Go implementation)	DownloadHashList function, file validation, cleanup logic
lib/task/manager.go	Task manager (Go implementation)	AcceptTask, RunTask, status submission, crack submission, task completion coordination
lib/task/errors.go	Error handling for task operations	Sentinel errors (ErrTaskAcceptNotFound, ErrTaskAcceptFailed), severity mapping, error reporting

CipherSwarm Authentication and Authorization - Token-based authentication mechanism, project-scoped access control, agent permission model
Task State Machine - Detailed state definitions, transition events, guards and validation rules
Attack Configuration - Hashcat parameter specification, attack mode selection, resource management
Distributed Hash Cracking Architecture - Campaign structure, task distribution algorithms, agent coordination patterns
Error Handling Strategies - Retry logic patterns, exponential backoff algorithms, stale task detection mechanisms

Agent API Lifecycle Workflow#

Startup Sequence#

Authentication#

Configuration Retrieval#

Agent Registration#

Benchmark Submission#

Task Acquisition Loop#

Task Request#

Task Acceptance#

Acceptance Error Handling#

Attack Details Retrieval#

Hash List Download#

Work Execution Loop#

Heartbeat#

Status Updates#

Crack Submission#

Get Previously Cracked Hashes#

Task Completion Paths#

Successful Completion#

Keyspace Exhaustion#

Task Abandonment#

Maintenance Operations#

Health Check#

Graceful Shutdown#

Error Reporting#

Cracker Updates#

Error Handling and State Transitions#

Timeout Configuration#

Retry Strategy with Circuit Breaker#

Authentication Errors (401 Unauthorized)#

Authorization Errors (403 Forbidden)#

Enhanced 404 Responses#

404 Recovery Strategy#

Task State Machine#

Stale Task Handling#

Campaign Quarantine State#

Complete Workflow Diagrams#

Overall Agent Lifecycle#

Task State Transitions#

Error Recovery Flow#

Best Practices#

Status Update Handling#

Task Validation#

Recommended Configuration Settings#

Task Reassignment Strategy#

Relevant Code Files#

Related Topics#