Agent Shutdown Cascade#
Lead Section#
The Agent Shutdown Cascade is a critical lifecycle management mechanism in CipherSwarm that orchestrates the orderly cleanup of distributed hash cracking tasks when an agent disconnects from the system. When an agent shuts down—whether through graceful termination, administrative action, or unexpected disconnection—the system automatically pauses all running tasks assigned to that agent and clears their claim fields, enabling other active agents to detect and recover these orphaned tasks. This cascade behavior ensures fault tolerance and high availability in distributed hash cracking operations, particularly in airgapped lab environments where network connectivity may be unreliable.
The shutdown cascade operates through a state machine event in the Agent model that transitions the agent to the offline state and triggers an after_transition callback. This callback pauses all running tasks, clears three task claim fields—claimed_by_agent_id, claimed_at, and expires_at—while preserving the permanent ownership field agent_id, and pauses attacks that have no remaining active tasks. This dual ownership model allows the system to track both the original task assignment and the active claim status, enabling sophisticated task reassignment logic that prevents duplicate work while maximizing resource utilization across the agent pool.
Unlike other pause mechanisms in CipherSwarm (such as attack-level pauses or campaign priority preemption), the agent shutdown cascade is unique in both clearing claim fields and automatically pausing attacks. This design choice is intentional: when an agent shuts down, its tasks are truly orphaned and should become available for reassignment to healthy agents after a configurable grace period. In contrast, administrative pause actions preserve claim fields because the same agent is expected to resume the work when the attack/campaign is unpaused.
Shutdown Event and Lifecycle#
State Machine Implementation#
The agent shutdown is implemented using the state_machines-activerecord gem, which provides declarative state machine management for ActiveRecord models. The shutdown event is defined as:
event :shutdown do
transition any => :offline
end
This event can be invoked from any agent state (pending, active, stopped, error) and transitions the agent to the offline state. The invocation syntax is agent.shutdown (not agent.shutdown!), following the state_machines gem convention where the bang version raises exceptions on invalid transitions.
Shutdown Callback Execution#
The primary shutdown callback executes after the state transition completes:
after_transition on: :shutdown do |agent|
running_tasks = agent.tasks.with_states(:running)
paused_count = running_tasks.count
Rails.logger.info(
"[AgentLifecycle] shutdown: agent_id=#{agent.id} state_change=#{agent.state_was}->offline " \
"running_tasks_paused=#{paused_count} timestamp=#{Time.zone.now}"
)
affected_attacks = Set.new
running_tasks.find_each do |task|
paused = false
begin
if task.can_pause?
task.pause!
paused = true
end
rescue StateMachines::InvalidTransition, ActiveRecord::StaleObjectError => e
Rails.logger.error(
"[AgentLifecycle] shutdown: Failed to pause task #{task.id} " \
"for agent #{agent.id}: #{e.class} - #{e.message}"
)
end
# Only clear claim fields on successfully paused tasks.
# Running tasks with cleared claims would be an inconsistent state
# not handled by any recovery path. If pause failed, the heartbeat
# timeout will eventually detect the agent as offline and handle the task.
if paused
task.update_columns(claimed_by_agent_id: nil, claimed_at: nil, expires_at: nil)
end
affected_attacks << task.attack
end
# Pause attacks that have no remaining in-progress tasks (pending or running).
# This updates the Activity page to reflect that work has stopped.
affected_attacks.each do |attack|
next unless attack.can_pause?
next if attack.tasks.without_states(:paused, :completed, :exhausted, :failed).exists?
begin
attack.pause!
rescue StateMachines::InvalidTransition, ActiveRecord::StaleObjectError => e
Rails.logger.error(
"[AgentLifecycle] shutdown: Failed to pause attack #{attack.id} " \
"for agent #{agent.id}: #{e.class} - #{e.message}"
)
end
end
end
Key implementation details:
- Only tasks in the
runningstate are affected (not pending or failed tasks) - Structured logging captures agent ID, state transition, and count of affected tasks
- Tasks are processed in batches using
find_eachfor memory efficiency - Each task is individually paused using the task state machine pause! event
- Error handling for task pause: The callback wraps
task.pause!in a try-rescue block to catch state machine transition errors (StateMachines::InvalidTransition) and database concurrency errors (ActiveRecord::StaleObjectError). If a pause fails, the error is logged withRails.logger.errorbut the shutdown process continues. This ensures robust operation even when concurrent state transitions occur. - Claim field clearing logic: Claim fields (
claimed_by_agent_id,claimed_at,expires_at) are only cleared when the pause operation succeeds (tracked by thepausedboolean flag). Running tasks with cleared claims would create an inconsistent state not handled by any recovery path. If a task pause fails, the heartbeat timeout mechanism will eventually detect the agent as offline and handle the orphaned task. - Claim fields are cleared using
update_columnsto bypass validations and callbacks for performance - Automatic attack pause: After pausing tasks, the callback checks each affected attack to see if it has any remaining active tasks (excluding paused, completed, exhausted, and failed states). The condition
attack.tasks.without_states(:paused, :completed, :exhausted, :failed).exists?returnsfalsewhen no active tasks remain, triggering the attack pause. - Error handling for attack pause: Attack pause operations are wrapped in try-rescue blocks to catch state machine and concurrency errors. If an attack pause fails, the error is logged but the shutdown process continues, ensuring the agent transitions to offline state even if attack cleanup encounters issues.
Task Pausing Behavior#
Task State Machine#
The task pause event defines the valid transitions:
event :pause do
transition %i[pending running] => :paused
transition any => same
end
This idempotent design ensures that:
- Tasks in
pendingorrunningstates transition topaused - Tasks already in terminal states (completed, exhausted, failed) or already paused remain unchanged
- The pause operation never fails due to invalid state
Task Pause Callbacks#
When a task is paused, three callbacks execute:
- State-specific logging and paused_at timestamp:
after_transition to: :paused do |task|
task.send(:log_state_transition, "paused", "Task execution paused")
task.send(:mark_paused_safely)
end
The mark_paused_safely method sets the paused_at timestamp using update_column:
def mark_paused_safely
update_column(:paused_at, Time.zone.now)
rescue ActiveRecord::ActiveRecordError => e
Rails.logger.error(
"[Task #{id}] Error setting paused_at in pause callback - " \
"Error: #{e.class} - #{e.message} - #{Time.current}"
)
# Non-critical: task is paused but paused_at not set.
# Grace period will treat it as immediately available (paused_at IS NULL).
end
- Attack progress broadcast (lines 188-191):
after_transition do |task, transition|
next if transition.event == :abandon
task.send(:safe_broadcast_attack_progress_update)
end
- Activity timestamp update (line 193):
after_transition any - [:pending] => any, do: :update_activity_timestamp
The log_state_transition method generates structured logs:
def log_state_transition(new_state, message)
Rails.logger.info(
"[Task #{id}] Agent #{agent_id} - Attack #{attack_id} - " \
"State change: #{state_was} -> #{new_state} - #{message}"
)
end
paused_at Timestamp: When a task transitions to the paused state, the paused_at column is set to the current time via the mark_paused_safely method. This timestamp is critical for the grace period mechanism in orphaned task recovery, allowing the system to determine when a paused task becomes eligible for reclamation by other agents. The method uses error handling to ensure task pausing succeeds even if the timestamp update fails; in such cases, the task is treated as immediately available (matching the paused_at IS NULL grace period logic).
Task State Classifications#
Task scopes define semantic groupings:
scope :incomplete, -> { with_states(%i[pending failed running]) }
scope :successful, -> { with_states(:completed, :exhausted) }
scope :finished, -> { with_states(:completed, :exhausted, :failed) }
scope :running, -> { with_state(:running) }
Important: The paused state is not included in the incomplete scope. This is intentional: paused tasks cannot be actively worked on without first being resumed, distinguishing them from pending, running, and failed tasks that are available for immediate execution.
Claim Field Clearing Mechanism#
Dual Ownership Model#
CipherSwarm implements a dual ownership model for tasks:
-
Permanent Ownership (
agent_id): Tracks which agent the task was originally assigned to. This is a NOT NULL column withON DELETE => cascade. -
Active Claim (
claimed_by_agent_id,claimed_at,expires_at): Tracks which agent is currently processing the task and when the claim expires. The claimed_by_agent_id has ON DELETE => nullify.
Claim Clearing Implementation#
During shutdown, claim fields are cleared:
task.update_columns(claimed_by_agent_id: nil, claimed_at: nil, expires_at: nil)
Implementation considerations:
- Uses
update_columnsto bypass ActiveRecord validations and callbacks for performance - Only clears claim fields, not the permanent
agent_idfield - Intentionally skips model validations (acknowledged with
# rubocop:disable Rails/SkipsModelValidations)
According to AGENTS.md documentation:
On agent shutdown, tasks are paused and claim fields (
claimed_by_agent_id,claimed_at,expires_at) are cleared.TaskAssignmentService#find_unassigned_paused_taskdetects orphans using apaused_atgrace period, then reassignsagent_idand callsresume!on pickup.TaskAssignmentService#find_own_paused_taskruns beforefind_unassigned_paused_task— returning agents reclaim their own paused tasks first (to use restore files). Grace period (agent_considered_offline_time, default 30 min) viapaused_atcolumn: within the period, only the original agent can reclaim; after, any agent can. Tasks from offline/stopped agents are available immediately. When reclaiming a paused task whose attack was also paused (shutdown cascade), the attack is resumed automatically.
Why Claim Clearing is Shutdown-Specific#
Other pause mechanisms in CipherSwarm do not clear claim fields:
- Attack-level pause: When attacks transition to paused state, they cascade the pause to tasks:
def pause_tasks
tasks.without_state(:paused).each(&:pause)
end
This pauses tasks but does not clear claim fields, as the same agent is expected to resume work when the attack is unpaused.
- Campaign priority preemption: When higher-priority campaigns preempt lower-priority ones, tasks are preempted (not paused) and claim fields may remain set depending on the preemption implementation.
The shutdown-specific claim clearing ensures that orphaned tasks are immediately available for reassignment, rather than waiting for claim expiration timeouts.
Orphaned Task Detection and Reassignment#
Grace Period Mechanism#
The orphaned task recovery system uses a time-based grace period rather than checking agent state directly. When a task is paused during agent shutdown, the paused_at timestamp is set. This timestamp controls when the task becomes eligible for reassignment:
- Within grace period (default 30 minutes via
agent_considered_offline_timeconfiguration): Only the original agent can reclaim its own paused tasks viafind_own_paused_task - After grace period expires: Any agent can claim the orphaned task via
find_unassigned_paused_task - Exception: Tasks from agents in
offlineorstoppedstates are immediately available for reassignment, bypassing the grace period
This grace period design ensures that agents can quickly resume their work after brief disconnections (e.g., network blips, restarts) by leveraging restore files, while preventing tasks from being permanently stuck if an agent never returns.
find_own_paused_task Method#
The TaskAssignmentService#find_own_paused_task method allows agents to reclaim their own paused tasks:
def find_own_paused_task
task = agent.tasks.with_state(:paused)
.where(claimed_by_agent_id: [nil, agent.id])
.joins(attack: { campaign: :hash_list })
.where("EXISTS (SELECT 1 FROM hash_items WHERE hash_items.hash_list_id = hash_lists.id AND hash_items.cracked = false)")
.order(:id)
.first
return nil unless task
if task.attack.paused? && task.attack.can_resume?
task.attack.resume!
task.reload # attack.resume_tasks may have already resumed this task
end
task.resume! if task.paused? && task.can_resume?
task
rescue StateMachines::InvalidTransition, ActiveRecord::StaleObjectError => e
Rails.logger.error(
"[TaskAssignmentService] Failed to resume own paused task #{task.id} " \
"for agent #{agent.id}: #{e.class} - #{e.message}"
)
nil
end
Key characteristics:
- Purpose: Reclaims the agent's own paused tasks after a restart
- Scope: Queries only tasks where
agent_idmatches the current agent (viaagent.tasks) - Claim handling: Accepts tasks with
claimed_by_agent_idas eithernilor the current agent's ID - No grace period check: This method does not filter by
paused_attimestamp—agents can always reclaim their own paused tasks immediately - No locking:
FOR UPDATE SKIP LOCKEDis not needed because the query is already scoped to this agent'sagent_id, preventing cross-agent races - Attack auto-resume: If the task's attack was paused during shutdown cascade, this method automatically resumes the attack before resuming the task
- Restore file optimization: Allows agents to resume work using existing restore files, avoiding redundant computation
- Error handling: Wrapped in a rescue block to catch state transition failures; returns
nilif resume fails
find_unassigned_paused_task Method#
The TaskAssignmentService#find_unassigned_paused_task method handles orphaned task detection from other agents:
def find_unassigned_paused_task
task = nil
Task.transaction do
task = Task.with_state(:paused)
.where(claimed_by_agent_id: nil)
.where.not(agent_id: agent.id)
.joins(:agent)
.where(
"tasks.paused_at IS NULL OR tasks.paused_at < :grace_cutoff OR agents.state IN (:orphan_states)",
grace_cutoff: ApplicationConfig.agent_considered_offline_time.ago,
orphan_states: %w[offline stopped]
)
.joins(attack: { campaign: :hash_list })
.where(campaigns: { project_id: agent.project_ids })
.where(hash_lists: { hash_type_id: allowed_hash_type_ids })
.where("EXISTS (SELECT 1 FROM hash_items WHERE hash_items.hash_list_id = hash_lists.id AND hash_items.cracked = false)")
.order(:id)
.lock("FOR UPDATE OF tasks SKIP LOCKED")
.first
return nil unless task
# Reassign ownership to the claiming agent
task.update_columns(agent_id: agent.id)
# Resume the attack if it was paused due to agent shutdown (not campaign pause).
begin
if task.attack.paused? && task.attack.can_resume?
task.attack.resume!
task.reload # attack.resume_tasks may have already resumed this task
end
# Transition to pending so the new agent can accept the task.
# resume! moves paused -> pending, marks stale (so the agent re-downloads cracks),
# and clears paused_at (removing the task from grace period queries).
task.resume! if task.paused? && task.can_resume?
rescue StateMachines::InvalidTransition, ActiveRecord::StaleObjectError => e
Rails.logger.error(
"[TaskAssignmentService] Failed to resume orphaned task #{task.id} " \
"for agent #{agent.id}: #{e.class} - #{e.message}"
)
# Ownership was reassigned; task stays paused but belongs to the new agent.
# Next cycle's find_own_paused_task will pick it up.
end
end
task
end
Key implementation details:
-
Grace period logic: The WHERE clause implements three conditions (OR):
tasks.paused_at IS NULL: Legacy tasks paused before thepaused_atcolumn was added (treated as immediately available)tasks.paused_at < :grace_cutoff: Tasks paused more than 30 minutes ago (grace period expired, configured viaagent_considered_offline_time)agents.state IN (:orphan_states): Tasks from offline/stopped agents (immediate availability, bypassing grace period)
-
Purpose: Finds tasks from other agents that have been orphaned, not the current agent's own tasks
-
Exclusion: The query includes
.where.not(agent_id: agent.id)to explicitly exclude tasks owned by the current agent -
Attack auto-resume: When reclaiming a paused task whose attack was also paused during shutdown cascade, the method automatically resumes the attack
-
Reload handling: After
attack.resume!, the task is reloaded because the attack'sresume_taskscallback may have already resumed all paused tasks, potentially causing aStaleObjectErrorif we attempt to resume again -
Error handling: The resume operations are wrapped in a rescue block. If the attack or task resume fails, ownership reassignment is preserved and the error is logged. The task remains paused but assigned to the new agent, allowing
find_own_paused_taskto retry on the next cycle.
Why find_own_paused_task Checks [nil, agent.id]#
Unlike find_unassigned_paused_task, the find_own_paused_task method checks for claimed_by_agent_id in [nil, agent.id] because:
- Agent's own tasks: The query is already scoped to
agent.tasks, which filters byagent_id - Claim status: Tasks may have been paused with the agent still holding the claim, or with the claim cleared during shutdown
- Semantic correctness: Including
agent.idin the claim check allows agents to resume tasks they were actively working on before shutdown
Race Condition Prevention#
The find_unassigned_paused_task method uses PostgreSQL row-level locking:
.lock("FOR UPDATE OF tasks SKIP LOCKED")
According to inline documentation:
Uses FOR UPDATE SKIP LOCKED to prevent two agents from racing to claim the same task.
This ensures that when multiple agents simultaneously query for orphaned tasks, each agent locks a different task, preventing duplicate work. The find_own_paused_task method does not require this locking mechanism because its query is scoped to the current agent's agent_id, eliminating the possibility of cross-agent races.
Task Assignment Priority#
The find_next_task method defines the priority order for task assignment:
def find_next_task
find_existing_incomplete_task ||
find_own_paused_task ||
find_unassigned_paused_task ||
find_task_from_available_attacks
end
Priority order:
- Incomplete tasks already assigned to this agent (highest priority) - Ensures agents complete their existing work before taking on new tasks
- Agent's own paused tasks - Allows returning agents to reclaim their paused tasks and leverage restore files within the grace period
- Orphaned paused tasks from other agents - Recovery mode for tasks from offline agents or tasks past the grace period
- New tasks from available attacks (lowest priority) - Normal operation when no incomplete or orphaned tasks exist
This two-stage reclamation process (own tasks → orphaned tasks) ensures that agents can quickly resume their work after brief disconnections, while still providing fault tolerance for permanently failed agents. The grace period mechanism prevents tasks from being immediately stolen during routine agent restarts.
Attack and Campaign State Propagation#
Attack Pause During Shutdown#
The agent shutdown cascade does automatically pause attacks when they have no remaining active tasks. After pausing all running tasks, the shutdown callback checks each affected attack:
# Pause attacks that have no more active (non-paused) tasks.
# This updates the Activity page to reflect that work has stopped.
affected_attacks.each do |attack|
next unless attack.can_pause?
next if attack.tasks.without_states(:paused, :completed, :exhausted, :failed).exists?
attack.pause!
end
Implementation details:
- The callback maintains a
Setof affected attacks (collected during task pausing) - For each attack, it checks if any tasks remain in active states (excluding paused, completed, exhausted, and failed)
- If no active tasks remain and the attack can be paused (not already paused or in a terminal state), the attack transitions to the
pausedstate - This ensures the Activity page accurately reflects that work has stopped on these attacks
This bidirectional cascade ensures consistency between task and attack states: shutdown pauses tasks → tasks with no active work pause attacks → attack resume auto-resumes tasks.
Attack State Machine Behavior#
The attack state machine includes bidirectional pause and resume callbacks:
after_transition any => :paused, :do => :pause_tasks
def pause_tasks
tasks.without_state(:paused).each(&:pause)
end
after_transition any => :running, :do => :resume_tasks
def resume_tasks
tasks.with_state(:paused).each(&:resume!)
end
This bidirectional relationship means:
- Attack → Task: When an attack is paused (administratively or via shutdown cascade), all non-paused tasks are paused
- Task → Attack: When orphaned tasks are reclaimed and their paused attacks are resumed, all paused tasks in that attack are automatically resumed
Important: The resume_tasks callback can cause StaleObjectError if task code attempts to resume a task after attack.resume! has already resumed it. This is why TaskAssignmentService includes task.reload before attempting task.resume! (see the "Reload handling" detail in the orphaned task section).
Campaign State Computation#
Campaigns do not have a persistent state column. Instead, the paused? method computes campaign state dynamically:
def paused?
attacks.without_states(%i[paused completed]).empty? && attacks.with_state(:paused).any?
end
A campaign is considered paused when:
- All non-completed attacks are in the paused state
- At least one attack is paused
The completed? method similarly computes completion status:
def completed?
return true unless hash_list.uncracked_items.exists?
!attacks.without_state(:completed).exists?
end
Why campaign transition reflects shutdown automatically: Since campaign state is computed on-demand from attack states, no explicit state transition occurs during agent shutdown. When campaign.paused? is called, it queries the current attack states at that moment, automatically reflecting any changes caused by the shutdown cascade (including attacks paused due to having no remaining active tasks).
Bottom-Up State Aggregation Pattern#
CipherSwarm implements a bottom-up state aggregation architecture with bidirectional cascades:
Task states (stored) ⟷ Attack states (stored) → Campaign states (computed)
The bidirectional arrows between Task and Attack states represent:
- Downward cascade: Attack pause → pauses tasks; Attack resume → resumes tasks
- Upward cascade: Agent shutdown → pauses tasks → pauses attacks with no active tasks; Task reclaim → resumes attacks
This design ensures consistency between task and attack states during both shutdown and recovery, while eliminating the need for explicit campaign state transitions. Campaign state automatically reflects the aggregate state of its attacks, which in turn reflect the aggregate state of their tasks.
Agent Detection of Paused Tasks#
When agents submit status updates during task execution, the server responds with HTTP status codes that signal task state changes. The StatusSubmissionService determines the response:
def determine_response_status
if task.stale
Result.new(status: :stale)
elsif task.paused?
Result.new(status: :paused)
else
Result.new(status: :ok)
end
end
The TasksController handles paused status by returning HTTP 410 Gone:
when :paused
Rails.logger.info("[Agent #{@agent.id}] Task #{@task.id} - Status accepted, task is paused")
head :gone
Agents detect paused tasks through this HTTP 410 response during status submission. This allows agents to gracefully stop processing tasks that have been paused due to shutdown of other agents or administrative actions.
Task Resume Behavior#
When a paused task is resumed (either by the original agent reclaiming it or by a new agent claiming it), it transitions from paused to pending:
event :resume do
transition paused: :pending
transition any => same
end
after_transition on: :resume do |task|
task.send(:log_state_transition, "resumed", "Marking as stale and clearing paused_at")
task.send(:mark_resumed_safely)
end
The mark_resumed_safely method clears the paused_at timestamp and marks the task as stale:
def mark_resumed_safely
update_columns(stale: true, paused_at: nil)
rescue ActiveRecord::ActiveRecordError => e
Rails.logger.error(
"[Task #{id}] Error updating stale/paused_at in resume callback - " \
"Error: #{e.class} - #{e.message} - #{Time.current}"
)
# Don't re-raise - task state transition already succeeded
end
Key behaviors:
- Resumed tasks transition to
pending, not directly torunning - Tasks are marked
stale: true, requiring agents to re-download crack information - The
paused_attimestamp is cleared viaupdate_columns, removing the task from grace period tracking - Error handling ensures the resume operation succeeds even if the timestamp/stale updates fail
- This ensures agents receive updated hash crack status before resuming work, preventing redundant cracking attempts
The new agent must then accept the task (transitioning from pending to running) before beginning execution.
Deployment Considerations for Airgapped Environments#
In airgapped laboratory and secure environments where CipherSwarm is commonly deployed, the agent shutdown cascade provides critical resilience features:
Network Reliability#
Airgapped networks often experience connectivity issues between isolated network segments. The shutdown cascade's grace period mechanism ensures that tasks are not immediately stolen during brief network interruptions, allowing the original agent to reconnect and reclaim its work using restore files. After the grace period expires (or if the agent is confirmed offline/stopped), other agents can immediately detect and claim orphaned tasks, ensuring work continues despite extended outages.
Resource Management#
In secure environments with limited computational resources, the priority-based task assignment system (incomplete → own paused → orphaned → new) ensures efficient resource utilization. When an agent restarts after maintenance or system updates, it prioritizes reclaiming its own paused tasks to leverage existing restore files, minimizing redundant work. Only after exhausting its own paused tasks does it help recover orphaned tasks from other agents.
Auditability#
The structured logging throughout the shutdown cascade provides comprehensive audit trails for security-conscious environments:
- Agent state transitions are logged with timestamps
- Task state changes include agent ID, attack ID, and state transitions
- Claim field modifications and
paused_attimestamps are tracked for forensic analysis - Attack pause events during shutdown are logged to explain why attacks transition to paused state
Database Integrity#
The use of PostgreSQL row-level locking (FOR UPDATE SKIP LOCKED) prevents race conditions even in high-concurrency scenarios common in large-scale password cracking operations. This ensures that multiple agents competing for orphaned tasks never duplicate work, critical for efficient resource usage in compute-constrained environments. The partial index on paused_at for paused tasks (WHERE state = 'paused') optimizes grace period queries by reducing index size and improving query performance for orphaned task detection.
Relevant Code Files#
| File Path | Purpose | Key Components |
|---|---|---|
app/models/agent.rb | Agent lifecycle management | shutdown event, after_transition callback, claim field clearing, attack pause logic |
app/models/concerns/task_state_machine.rb | Task state transitions | pause event, resume event, paused_at timestamp management, state transition callbacks |
app/services/task_assignment_service.rb | Task assignment and orphan recovery | find_own_paused_task, find_unassigned_paused_task, find_next_task, grace period logic, row-level locking |
app/models/concerns/attack_state_machine.rb | Attack state transitions | pause_tasks callback, resume_tasks callback, bidirectional task-attack cascades |
app/models/campaign.rb | Campaign state computation | paused? method, completed? method |
app/models/task.rb | Task model and scopes | incomplete, finished, running scopes, paused_at column, partial index |
app/services/status_submission_service.rb | Agent status update handling | determine_response_status for pause detection |
app/controllers/api/v1/client/tasks_controller.rb | Agent API endpoints | submit_status action, HTTP 410 response handling |
Related Topics#
Agent State Machine#
The agent shutdown cascade is part of the broader Agent State Machine that manages agent lifecycle events including activation, deactivation, and error handling.
Task Assignment System#
The cascade interacts with the Task Assignment System which implements priority-based task distribution and orphan recovery.
Attack State Machine#
The Attack State Machine defines bidirectional state propagation between attacks and tasks. During shutdown, attacks are automatically paused when they have no remaining active tasks. During task reclaim, attacks are automatically resumed when paused tasks are recovered.
Campaign Priority Management#
Campaign priority preemption represents an alternative pause mechanism that may interact with shutdown cascade behavior when agents are processing multiple priority levels.
Architectural Diagram#
This sequence diagram illustrates the complete lifecycle of an agent shutdown and subsequent task recovery, showing:
- The automatic attack pause when no active tasks remain
- The grace period mechanism that allows agents to reclaim their own paused tasks
- The two-stage reclamation process (own tasks → orphaned tasks)
- The bidirectional cascade between attacks and tasks during resume
- The
task.reloadpattern to avoidStaleObjectErrorwhenattack.resume_taskshas already resumed tasks
This article documents the Agent Shutdown Cascade behavior as implemented in CipherSwarm. The system uses a time-based grace period mechanism (via the paused_at timestamp) to enable efficient task recovery, with automatic attack pause/resume coordination during the shutdown and reclaim processes. The two-stage reclamation priority (own tasks → orphaned tasks) ensures agents can leverage restore files after brief disconnections while providing fault tolerance for permanently failed agents.