Agent API Lifecycle Workflow#
The Agent API Lifecycle Workflow describes the complete sequence of API operations that CipherSwarm agents execute when interacting with the server, from initial startup through task completion. This workflow enables distributed hash cracking operations in secure, air-gapped environments by orchestrating communication between remote agents (client software) and the central CipherSwarm server (Rails application).
All agent API endpoints are organized under the /api/v1/client/* namespace and require bearer token authentication via the Authorization: Bearer <token> header. The lifecycle consists of five major phases: startup sequence (authentication, configuration, and registration), task acquisition (requesting and accepting work), work execution (status updates and crack submission), task completion (exhaustion or abandonment), and maintenance operations (heartbeat, error reporting, and shutdown). Understanding this workflow is essential for deploying CipherSwarm agents in production environments, particularly in network-isolated security labs where reliable distributed processing is critical.
The workflow implements sophisticated error handling and state management to ensure reliable operation across potentially unreliable network connections. Token validation occurs through direct database lookup, and the system tracks agent activity by updating timestamps and IP addresses on each authenticated request. This enables administrators to monitor agent health and troubleshoot connectivity issues in isolated network environments.
Startup Sequence#
The startup sequence establishes the agent's identity and retrieves configuration settings necessary for operation. This process must complete successfully before the agent can request work, though some steps (such as benchmark submission) are conditional based on server directives.
Authentication#
Agents begin by authenticating with the CipherSwarm server using GET /api/v1/client/authenticate. Token validation occurs via Agent.find_by(token: token) in the database, and successful authentication updates the last_seen_at timestamp and IP address for monitoring purposes.
Successful Response (200 OK):
{
"authenticated": true,
"agent_id": 42
}
Authentication Failure (401 Unauthorized):
{
"error": "Bad credentials"
}
If authentication fails, the agent should cease all operations and alert administrators, as the token may have been revoked or is incorrectly configured.
Configuration Retrieval#
After authentication, agents retrieve global configuration settings using GET /api/v1/client/configuration. The response contains operational parameters such as polling intervals, resource limits, server capabilities, and the benchmarks_needed flag indicating whether the agent should run benchmarks. Agents should cache this response and only refresh on restart or when explicitly notified of configuration changes.
Response (200 OK): Returns an AgentConfigurationResponse object containing:
config: Advanced hashcat and agent configuration optionsapi_version: The minimum accepted version of the APIbenchmarks_needed: Boolean flag indicating whether the server needs benchmark data from this agentrecommended_timeouts: Server-recommended timeout values (connect, read, write, request) in secondsrecommended_retry: Retry policy parameters (max_attempts, initial_delay, max_delay)recommended_circuit_breaker: Circuit breaker thresholds (failure_threshold, timeout)
The resilience parameters enable agents to configure their HTTP clients with server-provided values, allowing operators to adjust timeout and retry behavior without redeploying agents. Agents should apply these values when initializing HTTP clients and refresh them periodically (recommended interval: on restart or when configuration changes are detected).
When benchmarks_needed is false, the server already has valid cached benchmark results for this agent, and the agent can skip the benchmark execution step entirely. This improves startup time and reduces resource consumption when benchmark results haven't changed since the last submission.
Possible Errors:
- 401 Unauthorized: Token invalid
- 404 Not Found: Agent not found in database
- 422 Unprocessable Content: Configuration validation error
Agent Registration#
Agents update their metadata on the server using PUT /api/v1/client/agents/{id}. This endpoint accepts an AgentUpdateV1 schema containing agent information such as hostname, operating system, device details, and installed hashcat version. This information helps administrators identify agents and monitor the cluster's composition.
Request Body: AgentUpdateV1 schema (hostname, device information, hashcat version)
Response (200 OK): Returns updated AgentResponseV1 object
Benchmark Submission#
Benchmark submission is conditional based on the benchmarks_needed flag received during configuration retrieval. When benchmarks_needed is true, agents execute performance benchmarks and submit them using POST /api/v1/client/agents/{id}/submit_benchmark. Benchmarks report the agent's hash cracking capabilities across different hash types, enabling the server to make intelligent task assignment decisions based on agent performance characteristics.
When benchmarks_needed is false, the agent skips benchmark execution entirely because the server has valid cached benchmark results from a previous submission. This optimization significantly improves agent startup time by avoiding the resource-intensive benchmark process when hardware configuration hasn't changed.
Request Body: AgentBenchmark schema containing device capabilities and performance metrics for various hash algorithms
Response: 200 OK with a BenchmarkReceipt JSON body containing received_count, processed_count, failed_count, and optional message field. Servers may return 204 No Content for backward compatibility with legacy servers (no receipt validation).
The agent validates the receipt counts (received, processed, failed). Count mismatches and partial failures are logged as warnings (advisory-only); HTTP 204 responses are still accepted for backward compatibility.
Force Benchmark Override:
Operators can force a fresh benchmark run using the --force-benchmark CLI flag, which overrides the server's benchmarks_needed signal. This is useful when:
- Hardware configuration has changed (GPU drivers updated, new device installed)
- Cached benchmark results are suspected to be stale or inaccurate
- Troubleshooting performance issues requires fresh baseline metrics
The force benchmark flag bypasses the local cache and server directive, ensuring a complete re-run of all benchmark tests.
Task Acquisition Loop#
After startup, agents enter the task acquisition loop, continuously polling for available work. This loop implements the distributed work assignment mechanism that allocates hash cracking tasks across the agent pool.
Task Request#
Agents request new work using GET /api/v1/client/tasks/new. The server filters available tasks by the agent's project membership using where(campaigns: {project_id: agent.project_ids}), ensuring agents only receive work from projects they're authorized to access.
Quarantine Filtering:
The task assignment service excludes tasks from quarantined campaigns at all stages of the assignment algorithm. Campaigns are automatically quarantined when agents report unrecoverable hashcat errors (such as token length exceptions or "no hashes loaded" failures) via the error reporting endpoint. Quarantined campaigns remain excluded from task assignment until administrators manually clear the quarantine or the underlying issue is resolved by updating the hash list or attack parameters. This prevents agents from wasting computational resources repeatedly attempting tasks with fundamental configuration errors.
The task assignment algorithm implements a three-step priority system designed to maximize efficiency and minimize redundant work:
- Incomplete assigned tasks (highest priority): Returns any incomplete task already assigned to the agent that doesn't have fatal errors
- Agent's own paused tasks: Reclaims the agent's own paused tasks (e.g., after restart) to leverage existing restore files and progress
- Orphaned paused tasks (with grace period): Claims paused tasks from other agents after a 30-minute grace period (default), or immediately if the original agent is offline/stopped
- Available attack tasks (standard allocation): Finds failed retryable tasks, pending tasks from existing attacks, or creates new tasks for attacks without pending work
All four stages filter out quarantined campaigns using where(campaigns: { quarantined: false }) to prevent task assignment from campaigns with unrecoverable errors.
This priority system ensures agents preferentially resume their own interrupted work before claiming tasks from other agents, reducing wasted computation. The grace period prevents premature task reassignment when an agent briefly disconnects, allowing it time to reconnect and resume with its existing restore files intact.
Response (200 OK): Task data including task ID and attack ID
When no work is available, the agent should wait before polling again (typical interval: 5-30 seconds depending on workload).
Task Acceptance#
Upon receiving a task, the agent accepts it using POST /api/v1/client/tasks/{id}/accept_task. Task authorization is enforced through @task = @agent.tasks.find(params[:id]), which ensures agents can only accept tasks assigned to them.
Response: 204 No Content on successful acceptance
Possible Errors:
- 422 Unprocessable Content: Task already completed
- 404 Not Found: Task not found or not assigned to this agent
- 403 Forbidden: Agent lacks permission
Accepting a task transitions it from pending to running state in the task state machine.
Acceptance Error Handling#
The agent implements specialized error handling for task acceptance failures, distinguishing between transient race conditions and genuine errors:
404 Not Found (Task Vanished):
When the server responds with 404 during task acceptance, this indicates the task disappeared before the agent could claim it—a normal race condition in distributed systems. The agent treats this as a terminal no-op:
- No AbandonTask call: Skips the abandon endpoint entirely, since the task already doesn't exist on the server
- Local cleanup only: Removes downloaded files (hash lists, wordlists) associated with the vanished task
- Immediate retry: Returns to the task request loop without delay to check for new work
- Info-level logging: Logs "Task no longer exists on server" at Info severity, not Critical
This behavior prevents unnecessary API traffic and false critical-severity alerts when tasks are legitimately removed by concurrent operations (campaign cancellation, attack completion, or another agent claiming the task first).
Other Acceptance Errors (403, 422, 5xx):
For non-404 failures, the agent treats acceptance failure as a genuine error requiring cleanup:
- Calls AbandonTask: Explicitly abandons the task via
POST /tasks/{id}/abandon, triggering the cascade that destroys all tasks for that attack (see Task Abandonment section) - Applies failure delay: Sleeps for the configured
sleepOnFailureduration before requesting new work, preventing tight retry loops - Critical-level logging: Logs at Critical severity with full error context for administrator investigation
Sentinel Errors:
The agent uses two sentinel errors (defined in lib/task/errors.go) to distinguish acceptance failure modes:
ErrTaskAcceptNotFound: Returned when the server responds 404ErrTaskAcceptFailed: Returned for all other acceptance errors
Callers use errors.Is() to check the specific failure mode and implement appropriate recovery logic.
Attack Details Retrieval#
Before execution, agents retrieve comprehensive attack configuration using GET /api/v1/client/attacks/{id}. This endpoint should be called once per task after acceptance, before starting hashcat execution. Authorization occurs via task assignment: Attack.joins(:tasks).where(tasks: { agent: @agent }), preventing agents from accessing attacks they're not assigned to.
The attack configuration response includes all parameters needed to construct the hashcat command:
- Attack mode: Dictionary, mask, hybrid dictionary, or hybrid mask, plus the corresponding hashcat numeric mode (0, 3, 6, or 7)
- Hash type: Hashcat mode ID (e.g., 1000 for NTLM) extracted from the campaign's hash list
- Performance parameters: Workload profile (1-4), optimization flags, slow candidate generator settings
- Attack strategy: Mask patterns, increment mode settings (minimum/maximum lengths), rule files, custom character sets (1-4)
- Markov chain settings: Enable/disable flags, classic Markov mode, threshold values
- Resource objects: Download URLs, MD5 checksums, and filenames for wordlists, rule files, and mask files
- Hash list metadata: Download URL and MD5 checksum for integrity verification
Agents cache this configuration for the task duration, as attack parameters don't change during execution.
Error Response (404 Not Found):
{
"error": "Attack not found."
}
Hash List Download#
Agents download the target hash list using GET /api/v1/client/attacks/{id}/hash_list. The endpoint returns plain text format with one hash per line, directly compatible with hashcat's input requirements.
For efficiency, the server streams only uncracked hashes in batches of 10,000 to avoid memory overhead when processing large hash lists. This ensures the server can handle massive hash lists (millions of entries) without consuming excessive memory.
The response includes Content-Type: text/plain and a Content-Disposition header with filename {hash_list_id}.txt. Agents save the hash list to a local file named {attack_id}.hsh and validate that the file is not empty before proceeding.
Example Hash List Format:
5f4dcc3b5aa765d61d8327deb882cf99
e99a18c428cb38d5f260853678922e03
098f6bcd4621d373cade4e832627b4f6
Possible Errors:
- 404 Not Found: Attack not found or hash list missing
- 401 Unauthorized: Invalid token
- 403 Forbidden: Agent lacks permission
Work Execution Loop#
During task execution, agents maintain communication with the server through periodic status updates and crack submissions while hashcat processes the attack.
Heartbeat#
Agents send periodic heartbeat messages using POST /api/v1/client/agents/{id}/heartbeat to signal that they're online and operational. The recommended interval is 30-60 seconds. Heartbeats update the agent's last_seen_at timestamp, enabling the server to detect offline agents and trigger automated recovery procedures.
Request Body: AgentHeartbeatRequest schema containing current agent state
Response: 204 No Content
Heartbeats continue throughout the agent's lifetime, independent of task execution.
Status Updates#
Agents report task progress using POST /api/v1/client/tasks/{id}/submit_status at recommended intervals of 10-30 seconds during active execution. Status updates include progress percentage, hash rate, estimated time remaining, and current position in the keyspace.
Request Body: TaskStatusUpdate schema containing:
- Progress percentage (0-100)
- Current hash rate (hashes per second)
- Estimated time remaining
- Checkpoint position in keyspace
Response Codes:
- 204 No Content: Continue execution normally
- 202 Accepted: Status accepted but data appears stale; agent should abandon current work
- 410 Gone: Task has been paused server-side; agent must stop execution immediately
- 404 Not Found: Task no longer exists; implement recovery strategy
The varied response codes enable the server to control agent behavior dynamically, supporting administrative actions like pausing campaigns or reassigning work.
Crack Submission#
As hashcat discovers plaintext passwords, agents submit them using POST /api/v1/client/tasks/{id}/submit_crack. Submissions should occur as soon as cracks are found to ensure other agents can skip already-cracked hashes.
Request Body: HashcatResult schema containing:
- Hash value
- Plaintext password
- Timestamp
Response (200 OK): Confirmation that the crack was recorded
Possible Errors:
- 422 Unprocessable Content: Validation error (duplicate submission, malformed data)
- 409 Conflict: Crack already submitted by another agent
Get Previously Cracked Hashes#
To avoid wasting computational resources on already-cracked hashes, agents periodically retrieve completed hashes using GET /api/v1/client/tasks/{id}/get_zaps. The endpoint returns a text/plain file containing cracked hash values, which agents pass to hashcat via the --remove flag to exclude them from processing.
Response (200 OK): Plain text file with one hash per line
Possible Errors:
- 404 Not Found: Task not found
- 401 Unauthorized: Invalid token
- 403 Forbidden: Agent lacks permission
Task Completion Paths#
Tasks conclude through one of three paths depending on execution outcomes.
Successful Completion#
When all hashes are cracked, the task automatically transitions from running to completed via the accept_crack event when no uncracked hashes remain. This transition occurs automatically upon submitting the final crack; no explicit completion endpoint is required.
Keyspace Exhaustion#
When hashcat fully explores the configured keyspace but some hashes remain uncracked, agents signal exhaustion using POST /api/v1/client/tasks/{id}/exhausted. This transitions the task from running to exhausted state.
Response: 204 No Content
Possible Errors:
- 404 Not Found: Task not found
- 401 Unauthorized: Invalid token
- 403 Forbidden: Agent lacks permission
- 422 Unprocessable Content: Task already completed or exhausted
Exhausted tasks indicate the attack configuration didn't successfully crack all hashes, requiring administrators to configure additional attacks with different parameters.
Task Abandonment#
When agents cannot complete a task (due to errors, resource constraints, or administrative intervention), they abandon it using POST /api/v1/client/tasks/{id}/abandon.
Critical Warning: Abandoning a task triggers attack.abandon, which destroys ALL tasks associated with that attack, not just the abandoned task. This cascade behavior prevents partial attack execution when fundamental issues exist with the attack configuration. Agents should use abandonment only for irrecoverable errors, not transient issues.
Important Exception: 404 errors during task acceptance do NOT trigger the abandon cascade. When a task vanishes before acceptance (normal race condition), the agent skips the AbandonTask call entirely and proceeds directly to cleanup and task request. This prevents attempting to abandon tasks that already don't exist on the server, avoiding unnecessary API errors and false critical-severity alerts. The cascade warning applies only when AbandonTask is actually invoked—during genuine acceptance failures (403, 422, 5xx) or when the agent encounters errors after successfully accepting a task.
Response: 204 No Content
Possible Errors:
- 422 Unprocessable Content: Task already completed
- 404 Not Found: Task not found
- 401 Unauthorized: Invalid token
- 403 Forbidden: Agent lacks permission
Maintenance Operations#
Agents perform several maintenance operations throughout their lifecycle to report errors, check for updates, verify connectivity, and signal shutdown.
Health Check#
Agents can verify server availability using GET /api/v1/client/health before attempting authenticated requests. This unauthenticated endpoint enables connectivity probes during circuit breaker recovery, initial setup, and diagnostics.
Response (200 OK):
{
"status": "ok",
"api_version": 1,
"timestamp": "2026-03-12T10:30:45Z",
"database": "healthy"
}
Response (503 Service Unavailable):
{
"status": "degraded",
"api_version": 1,
"timestamp": "2026-03-12T10:30:45Z",
"database": "unhealthy"
}
Use Cases:
- Circuit breaker probes: When transitioning from open to half-open state, probe
/healthbefore resuming authenticated requests - Initial connectivity check: Verify server reachability during agent startup before authentication
- Diagnostics: Distinguish between network connectivity issues and authentication failures
The health endpoint bypasses agent token authentication to enable connectivity verification when credentials are unavailable or expired. It performs a lightweight database check to detect common infrastructure failures.
Graceful Shutdown#
When stopping, agents notify the server using POST /api/v1/client/agents/{id}/shutdown. This marks the agent as intentionally offline, distinguishing graceful shutdown from connectivity loss.
Response: 204 No Content
Administrators can use this signal to differentiate between normal operations and potential failures requiring investigation.
Enhanced Shutdown Cascade:
The shutdown process implements a sophisticated cascade that cleanly pauses ongoing work:
-
Task Pause: All running tasks assigned to the agent are paused and their claim fields (
claimed_by_agent_id,claimed_at,expires_at) are cleared. Thepaused_attimestamp is set toTime.zone.now, initiating the grace period for task recovery. -
Attack Pause: Attacks with no remaining active tasks (pending or running) are automatically paused. This prevents attacks from appearing active in the system when all agents have stopped work on them.
-
Automatic Resume: When another agent claims a paused task from a paused attack (either during grace period reclamation by the original agent or after grace period expiration by a different agent), the attack is automatically resumed. This prevents attacks from remaining stuck in paused state when work resumes.
-
Error Handling: The shutdown cascade uses try-rescue blocks around state transitions. If a task or attack fails to pause (due to concurrent state changes), the shutdown process continues with other tasks. Claim fields are only cleared for successfully paused tasks to avoid inconsistent states.
The agent_id field always remains populated (NOT NULL constraint), enabling the system to track which agent originally owned each paused task for grace period calculations and metrics.
Error Reporting#
Agents submit detailed error information using POST /api/v1/client/agents/{id}/submit_error when encountering operational issues. Error reports include stack traces, error messages, and structured context metadata to facilitate troubleshooting and enable programmatic server-side decisions.
Request Body: AgentErrorV1 schema containing:
- Error message
- Stack trace
- Severity level
- Timestamp
- Context (task ID, attack ID, etc.)
- Structured error metadata (via
otherfield)
Response: 204 No Content
Structured Error Context:
Agents perform client-side error classification using the ClassifyStderr function, which analyzes hashcat output and extracts structured context fields before submission. The agent sends these fields via the other metadata map, enabling the server to make programmatic decisions about agent health and task assignment without parsing raw error text.
Required Metadata Fields for Quarantine Logic:
The server inspects metadata.other for the following fields when evaluating whether to quarantine the campaign:
category(string, required): Error domain—hash_format,hardware,runtime, orconfigretryable(boolean, required): Whether the error is transient (true) or permanent (false)terminal(boolean, optional): Definitive failures where no retry can succeed (e.g.,no_hashes_loaded)error_type(string, optional): Machine-readable identifier (e.g.,token_length_exception,hashfile_empty_or_corrupt)
The server triggers automatic quarantine when both conditions are met:
retryable == false, ANDcategory == "hash_format"ORterminal == true
This ensures campaigns are quarantined only for unrecoverable configuration errors (invalid hash format, empty hash files, token length mismatches) while allowing transient errors (network timeouts, temporary GPU failures) to be retried.
Additional Context Fields:
hashfile: Path to affected hashfile (for hash parsing errors)line_number: Line number where parsing error occurredhash_preview: Truncated hash preview (max 64 characters) for parsing errorsdevice_id: GPU device ID for device-specific errorsbackend_api: Backend API name ("OpenCL", "CUDA", "HIP", "Metal")api_error: Specific backend API error code (e.g., "CL_OUT_OF_HOST_MEMORY")exit_code_name: Named exit code for programmatic classification (e.g., "exhausted", "memory_hit")
Enhanced Error Classification:
The agent classifies 15+ error patterns including:
- Hash parsing errors: Stdout summary lines and per-hash errors with file context
- Kernel failures: Build and creation failures with device and kernel path
- Backend API errors: OpenCL, CUDA, HIP, and Metal errors with device context
- Memory exhaustion: Device memory errors with device ID
- Temperature limits: GPU thermal abort events with device ID
- Self-test/autotune failures: Fatal kernel test failures
- Configuration errors: Invalid hash mode, keyspace overflow, stdin timeout
StderrMessages Channel:
The StderrMessages channel has been changed from chan string to chan ErrorInfo, providing pre-classified errors with structured context. Error classification occurs once in the session handlers, eliminating redundant classification by consumers. The ErrorInfo struct contains:
type ErrorInfo struct {
Category ErrorCategory
Severity api.Severity
Retryable bool
Message string
Context map[string]any
}
WithContext ErrorOption:
Error submissions use the cserrors.WithContext() function to merge structured fields into the error metadata:
cserrors.SendAgentError(ctx, client, severity, message,
cserrors.WithClassification(category, retryable),
cserrors.WithContext(errInfo.Context))
Server-Side Campaign Quarantine:
When agents submit errors via POST /api/v1/client/agents/{id}/submit_error, the server automatically quarantines the associated campaign if the error metadata indicates an unrecoverable failure. The quarantine logic evaluates the structured metadata fields (category, retryable, terminal) and invokes campaign.quarantine!(error_message) when the conditions are met.
Quarantined campaigns are immediately excluded from all task assignment queries, preventing agents from receiving new tasks from campaigns with fundamental configuration errors. The quarantine state persists until administrators manually clear it or the underlying issue is resolved through hash list or attack parameter updates.
Cracker Updates#
Agents check for updated hashcat binaries using GET /api/v1/client/crackers/check_for_cracker_update?version=<semver>&operating_system=<os>. The endpoint accepts the current hashcat version (semantic versioning format) and operating system (windows, linux, darwin).
Response (200 OK): CrackerUpdateResponse object if an update is available, containing download URL and version information
Possible Errors:
- 400 Bad Request: Invalid parameters
- 401 Unauthorized: Invalid token
This mechanism enables centralized hashcat version management across the agent fleet, essential in air-gapped environments where direct internet access is unavailable.
Error Handling and State Transitions#
Robust error handling ensures reliable operation despite network issues, configuration problems, and server-side events. Agents should implement timeout protection, exponential backoff retries, and circuit breaker patterns to maintain resilience in unreliable network environments.
Timeout Configuration#
Agents must configure their HTTP clients with the timeout values received from GET /api/v1/client/configuration:
- connect_timeout: Maximum time to establish TCP connection (default: 10 seconds)
- read_timeout: Maximum time to wait for response data after connection (default: 30 seconds)
- write_timeout: Maximum time to send request data, including file uploads (default: 30 seconds)
- request_timeout: Overall deadline wrapping the entire request lifecycle (default: 60 seconds)
Timeouts prevent agents from hanging indefinitely when the server becomes unresponsive. Without timeouts, an agent blocked on a single request cannot detect connectivity loss or perform recovery operations.
Retry Strategy with Circuit Breaker#
Agents should implement exponential backoff with jitter using the recommended_retry parameters from configuration:
delay = min(initial_delay * 2^attempt, max_delay) + random(0, delay * 0.5)
Adding random jitter prevents synchronized retries across multiple agents (thundering herd problem). The circuit breaker pattern protects the server during extended outages:
Circuit Breaker States:
- Closed (normal): Requests pass through. Consecutive failures are counted against
failure_threshold. - Open (failing): After reaching
failure_threshold, all requests fail immediately fortimeoutseconds without attempting connection. - Half-Open (probing): After timeout expires, allow one probe request. Success transitions to Closed; failure returns to Open.
Health Check Integration:
When transitioning from Open to Half-Open, agents should probe GET /api/v1/client/health rather than authenticated endpoints. This verifies server availability without consuming authentication resources or triggering token validation failures.
Authentication Errors (401 Unauthorized)#
Invalid or expired tokens return {error: "Bad credentials"}. Agents receiving 401 responses should cease all operations and alert administrators, as the token may have been revoked, expired, or misconfigured.
401 errors require manual intervention; automated retry is inappropriate.
Authorization Errors (403 Forbidden)#
403 responses indicate the agent lacks permission to access the requested resource, such as tasks outside their assigned projects. This typically indicates configuration issues or unexpected state changes.
Agents should log the error and request different work rather than retrying the same operation.
Enhanced 404 Responses#
The API provides enhanced 404 error responses with metadata to help agents understand the cause:
Task Deleted (attack abandoned or completed):
{
"error": "Record not found",
"reason": "task_deleted",
"details": "Task was removed when attack was abandoned or completed"
}
Task Not Assigned (belongs to another agent):
{
"error": "Record not found",
"reason": "task_not_assigned",
"details": "Task belongs to another agent"
}
Invalid Task ID:
{
"error": "Record not found",
"reason": "task_invalid",
"details": "Task ID does not exist"
}
Special Case: 404 During Task Acceptance
404 responses during task acceptance have distinct semantics compared to 404s during other operations (heartbeat, status updates, crack submission). When AcceptTask receives a 404:
- The agent logs the error at Info level (not Critical), using the message "Task no longer exists on server"
- The agent treats this as a terminal no-op—no retry logic, no exponential backoff, no AbandonTask call
- The agent immediately cleans up local files and requests new work via
GET /api/v1/client/tasks/new
This reflects the race condition nature of task acceptance: the task vanished between being assigned via /tasks/new and being claimed via /accept_task. This is expected behavior in a distributed system with concurrent task operations, not an error requiring investigation or retry.
404 Recovery Strategy#
When receiving 404 errors, agents should implement the following recovery strategy:
- Stop retrying: Don't retry the same task ID indefinitely
- Exponential backoff: Implement exponential backoff starting at 1 second, maximum 60 seconds:
- First retry: 1 second
- Second retry: 2 seconds
- Third retry: 4 seconds
- After 3 consecutive 404s: Abandon the task reference and request new work via
GET /api/v1/client/tasks/new - Log the error: Include task ID, agent ID, operation type, and timestamp for debugging
- Interpret reason codes:
task_deleted: Stop processing immediately, request new worktask_not_assigned: Task exists but assigned to another agent (race condition), request new work after backofftask_invalid: Task ID doesn't exist (likely client bug), request new work immediately- No reason field: Legacy error response, treat as
task_deleted
Task State Machine#
CipherSwarm implements a sophisticated state machine managing six core states:
- pending: Initial state, awaiting agent assignment
- running: Task actively being processed by an agent
- paused: Execution temporarily suspended by administrator or agent shutdown
- completed: All hashes successfully cracked
- exhausted: Keyspace fully explored, some hashes remain uncracked
- failed: Task encountered unrecoverable error
Key State Transitions:
pending → running: Viaaccept_taskAPI endpointrunning → completed: Viaaccept_crackevent when no uncracked hashes remain (guard condition)running → exhausted: ViaexhaustedAPI endpointrunning → paused: Server-initiated when administrator pauses campaign or during agent shutdownpaused → pending: Viaresume!when agent reclaims the taskrunning → pending: ViaabandonAPI, followed by cascade destruction
Terminal states (completed, exhausted, failed) are blocked from further transitions via accept_status to prevent task resurrection after completion.
Grace Period and Task Recovery:
The task state machine includes a grace period mechanism that manages task recovery after agent disconnections:
- When a task transitions to
pausedstate, thepaused_attimestamp is set toTime.zone.nowusingupdate_columnto avoid triggering additional callbacks - When a task transitions from
pausedtopendingviaresume!, both thestaleflag is set totrueandpaused_atis cleared tonilusingupdate_columns - The grace period (default 30 minutes, configured via
agent_considered_offline_time) determines when paused tasks become available to other agents:- Within grace period: Only the original agent (matching
agent_id) can reclaim the task viafind_own_paused_task - After grace period expires: Any agent can claim the task via
find_unassigned_paused_task - Immediate availability: Tasks from offline or stopped agents bypass the grace period and are immediately available
- Within grace period: Only the original agent (matching
The SQL logic for orphaned task detection includes:
tasks.paused_at IS NULL OR tasks.paused_at < :grace_cutoff OR agents.state IN (:orphan_states)
This clause ensures:
- Legacy paused tasks (created before the
paused_atcolumn was added) are treated as immediately available (paused_at IS NULL) - Recent paused tasks become available after the grace period expires (
paused_at < :grace_cutoff) - Tasks from truly offline/stopped agents are available immediately regardless of
paused_attimestamp (agents.state IN (:orphan_states))
The grace period mechanism balances two competing concerns: allowing agents to resume their own work efficiently after brief disconnections, while ensuring abandoned tasks don't remain locked indefinitely when agents fail to reconnect.
Stale Task Handling#
- Abandon current work on that task immediately
- Discard local state and partially computed results
- Request new work via
GET /api/v1/client/tasks/new
Stale detection enables task reassignment without destroying work, supporting operational flexibility when redistributing workload across the agent pool.
Campaign Quarantine State#
Campaign quarantine is a server-side mechanism that prevents agents from repeatedly attempting tasks with unrecoverable configuration errors. When agents report fatal errors (such as token length exceptions or "no hashes loaded" failures) with retryable == false and either category == "hash_format" or terminal == true, the server automatically flags the campaign as quarantined.
Quarantine Lifecycle:
Quarantined campaigns are excluded from task assignment at all stages of the algorithm (incomplete tasks, own paused tasks, orphaned paused tasks, and available attacks). This prevents wasting computational resources on campaigns that cannot succeed due to invalid hash formats, empty hash lists, or hash type mismatches.
Clearing Quarantine:
Quarantine is automatically cleared when operators correct the underlying issue:
- Hash list updates: Changing the hash type (
hash_type_id) or uploading a new hash file triggersHashList#clear_campaigns_quarantine_if_needed, which clears quarantine on all associated campaigns - Attack parameter updates: Modifying attack resource references (word lists, rule lists, mask lists) or attack configuration parameters (mask, attack mode, increment settings, Markov settings, custom charsets, workload profile) triggers
Attack#clear_campaign_quarantine_if_needed, which clears quarantine on the parent campaign - Manual clearing: Administrators can manually clear quarantine via the campaign management interface when troubleshooting or overriding automated quarantine decisions
After quarantine is cleared, tasks from the campaign become available for assignment again. Agents will automatically receive work from previously quarantined campaigns during the normal task request cycle.
Complete Workflow Diagrams#
Overall Agent Lifecycle#
The following sequence diagram illustrates the complete operational workflow from startup through shutdown:
Task State Transitions#
The state diagram shows how tasks transition between states based on API calls and server events:
Error Recovery Flow#
The flowchart illustrates the decision tree for handling different HTTP response codes:
Best Practices#
Status Update Handling#
- Always check HTTP response codes before continuing execution
- Implement maximum 3 retry attempts with exponential backoff for 404 errors
- Use structured JSON logging format for easy parsing and analysis
- Monitor error rates across the agent fleet to detect systemic issues
Note on Acceptance Failures: 404 errors during task acceptance do not follow the standard retry/exponential backoff pattern documented below for other operations. Acceptance failures are treated as terminal no-ops (see Task Acceptance section) to prevent unnecessary retries when tasks legitimately vanish due to concurrent operations.
Task Validation#
Before starting expensive operations like downloading large hash lists, validate task existence and state. Implement local task expiration tracking to detect stale tasks early. For long-running operations, periodically request task updates to ensure the task hasn't been cancelled or reassigned.
Recommended Configuration Settings#
The following configuration values are recommended for production deployments:
- Heartbeat interval: 30-60 seconds
- Status update interval: 10-30 seconds during active work
- Maximum retry attempts: 3 for 404 errors
- Backoff multiplier: 2x (exponential)
- Maximum backoff delay: 60 seconds
These values balance server load against responsiveness and monitoring granularity. In high-latency environments, consider increasing intervals; in low-latency environments with strict monitoring requirements, decrease them.
Resilience Parameters:
Agents should fetch configuration from GET /api/v1/client/configuration on startup and apply the server-provided resilience parameters:
- Timeout configuration: Set HTTP client timeouts (connect, read, write, request) from
recommended_timeouts - Retry policy: Configure exponential backoff using
recommended_retry(max_attempts, initial_delay, max_delay) - Circuit breaker: Implement three-state circuit breaker with
recommended_circuit_breaker(failure_threshold, timeout)
Periodically refresh configuration (recommended interval: every 4-8 hours or on agent restart) to pick up updated resilience parameters without redeployment. The server returns default values if custom configuration is not set, ensuring agents always receive valid parameters.
Task Reassignment Strategy#
Manual Campaign Pause/Resume (Administrator-Initiated):
- Administrator pauses the campaign (all tasks transition:
running → paused) - Administrator resumes the campaign (tasks transition:
paused → pendingwith stale flag set) - Agents detect stale status via 202 response, abandon current work
- Agents request new tasks via
GET /tasks/new - Previously unprocessed keyspace portions become available for reassignment
Automatic Agent Shutdown Recovery (System-Initiated):
- Agent shutdown pauses running tasks and clears claim fields (
claimed_by_agent_id,claimed_at,expires_at) - The
paused_attimestamp is set, initiating the grace period (default 30 minutes) - Within grace period: Only the original agent can reclaim via
find_own_paused_task - After grace period: Any agent can claim via
find_unassigned_paused_task - On task reclamation, if the attack is also paused, it's automatically resumed
- The claiming agent resumes the task (transitions:
paused → pending, clearspaused_at) - Agent accepts and executes the task with access to existing restore files and progress
This dual-path workflow enables both administrative workload rebalancing (via manual pause/resume) and automated recovery from agent failures (via grace period reclamation), maximizing efficiency while maintaining system reliability.
Relevant Code Files#
The following table lists key source code files implementing the Agent API Lifecycle Workflow:
| File Path | Description | Key Functionality |
|---|---|---|
| app/controllers/api/v1/base_controller.rb | Base controller for all agent API endpoints | Bearer token authentication, error handling (401/403/404/422/500), request/response logging, agent activity tracking |
| app/controllers/api/v1/client/agents_controller.rb | Agent lifecycle management endpoints | Authentication, configuration retrieval, agent updates, benchmark submission, heartbeat, error reporting, shutdown |
| app/controllers/api/v1/client/tasks_controller.rb | Task assignment and status management | Task request, task acceptance, status updates (204/202/410 responses), crack submission, task completion (exhausted/abandon), get_zaps |
| app/controllers/api/v1/client/attacks_controller.rb | Attack configuration retrieval | Attack details (show action), hash list download (streaming with batched queries) |
| app/views/api/v1/client/attacks/show.json.jbuilder | Attack response template | JSON structure for attack configuration including hashcat parameters, attack modes, resources, hash list metadata |
| app/models/concerns/task_state_machine.rb | Task state machine logic | 6 states (pending, running, paused, completed, exhausted, failed), 11 transition events, guards preventing terminal state transitions |
| app/models/concerns/attack_state_machine.rb | Attack state machine logic | Attack lifecycle states, abandon event cascade logic that destroys all associated tasks |
| app/models/hash_list.rb | Hash list model | uncracked_list_enum method for streaming uncracked hashes in batches, checksum calculation |
| app/services/task_assignment_service.rb | Task assignment logic | Project-scoped task filtering, agent capability matching, workload distribution |
| app/services/status_submission_service.rb | Status update processing | Transaction-wrapped status handling, stale detection (202 response), task state validation |
| config/routes/client_api.rb | Agent API route definitions | Complete route mappings for /api/v1/client/* namespace |
| lib/agent/agent.go | Agent main workflow (Go implementation) | ProcessTask function implementing the complete task lifecycle from acceptance through completion |
| lib/downloader/downloader.go | Resource downloader (Go implementation) | DownloadHashList function, file validation, cleanup logic |
| lib/task/manager.go | Task manager (Go implementation) | AcceptTask, RunTask, status submission, crack submission, task completion coordination |
| lib/task/errors.go | Error handling for task operations | Sentinel errors (ErrTaskAcceptNotFound, ErrTaskAcceptFailed), severity mapping, error reporting |
Related Topics#
- CipherSwarm Authentication and Authorization - Token-based authentication mechanism, project-scoped access control, agent permission model
- Task State Machine - Detailed state definitions, transition events, guards and validation rules
- Attack Configuration - Hashcat parameter specification, attack mode selection, resource management
- Distributed Hash Cracking Architecture - Campaign structure, task distribution algorithms, agent coordination patterns
- Error Handling Strategies - Retry logic patterns, exponential backoff algorithms, stale task detection mechanisms