Documents
Agent Shutdown Cascade
Agent Shutdown Cascade
Type
Topic
Status
Published
Created
Feb 27, 2026
Updated
Feb 27, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Agent Shutdown Cascade#

Lead Section#

The Agent Shutdown Cascade is a critical lifecycle management mechanism in CipherSwarm that orchestrates the orderly cleanup of distributed hash cracking tasks when an agent disconnects from the system. When an agent shuts down—whether through graceful termination, administrative action, or unexpected disconnection—the system automatically pauses all running tasks assigned to that agent and clears their claim fields, enabling other active agents to detect and recover these orphaned tasks. This cascade behavior ensures fault tolerance and high availability in distributed hash cracking operations, particularly in airgapped lab environments where network connectivity may be unreliable.

The shutdown cascade operates through a state machine event in the Agent model that transitions the agent to the offline state and triggers an after_transition callback. This callback pauses all running tasks, clears three task claim fields—claimed_by_agent_id, claimed_at, and expires_at—while preserving the permanent ownership field agent_id, and pauses attacks that have no remaining active tasks. This dual ownership model allows the system to track both the original task assignment and the active claim status, enabling sophisticated task reassignment logic that prevents duplicate work while maximizing resource utilization across the agent pool.

Unlike other pause mechanisms in CipherSwarm (such as attack-level pauses or campaign priority preemption), the agent shutdown cascade is unique in both clearing claim fields and automatically pausing attacks. This design choice is intentional: when an agent shuts down, its tasks are truly orphaned and should become available for reassignment to healthy agents after a configurable grace period. In contrast, administrative pause actions preserve claim fields because the same agent is expected to resume the work when the attack/campaign is unpaused.

Shutdown Event and Lifecycle#

State Machine Implementation#

The agent shutdown is implemented using the state_machines-activerecord gem, which provides declarative state machine management for ActiveRecord models. The shutdown event is defined as:

event :shutdown do
  transition any => :offline
end

This event can be invoked from any agent state (pending, active, stopped, error) and transitions the agent to the offline state. The invocation syntax is agent.shutdown (not agent.shutdown!), following the state_machines gem convention where the bang version raises exceptions on invalid transitions.

Shutdown Callback Execution#

The primary shutdown callback executes after the state transition completes:

after_transition on: :shutdown do |agent|
  running_tasks = agent.tasks.with_states(:running)
  paused_count = running_tasks.count

  Rails.logger.info(
    "[AgentLifecycle] shutdown: agent_id=#{agent.id} state_change=#{agent.state_was}->offline " \
    "running_tasks_paused=#{paused_count} timestamp=#{Time.zone.now}"
  )

  affected_attacks = Set.new
  running_tasks.find_each do |task|
    paused = false
    begin
      if task.can_pause?
        task.pause!
        paused = true
      end
    rescue StateMachines::InvalidTransition, ActiveRecord::StaleObjectError => e
      Rails.logger.error(
        "[AgentLifecycle] shutdown: Failed to pause task #{task.id} " \
        "for agent #{agent.id}: #{e.class} - #{e.message}"
      )
    end
    # Only clear claim fields on successfully paused tasks.
    # Running tasks with cleared claims would be an inconsistent state
    # not handled by any recovery path. If pause failed, the heartbeat
    # timeout will eventually detect the agent as offline and handle the task.
    if paused
      task.update_columns(claimed_by_agent_id: nil, claimed_at: nil, expires_at: nil)
    end
    affected_attacks << task.attack
  end

  # Pause attacks that have no remaining in-progress tasks (pending or running).
  # This updates the Activity page to reflect that work has stopped.
  affected_attacks.each do |attack|
    next unless attack.can_pause?
    next if attack.tasks.without_states(:paused, :completed, :exhausted, :failed).exists?

    begin
      attack.pause!
    rescue StateMachines::InvalidTransition, ActiveRecord::StaleObjectError => e
      Rails.logger.error(
        "[AgentLifecycle] shutdown: Failed to pause attack #{attack.id} " \
        "for agent #{agent.id}: #{e.class} - #{e.message}"
      )
    end
  end
end

Key implementation details:

  • Only tasks in the running state are affected (not pending or failed tasks)
  • Structured logging captures agent ID, state transition, and count of affected tasks
  • Tasks are processed in batches using find_each for memory efficiency
  • Each task is individually paused using the task state machine pause! event
  • Error handling for task pause: The callback wraps task.pause! in a try-rescue block to catch state machine transition errors (StateMachines::InvalidTransition) and database concurrency errors (ActiveRecord::StaleObjectError). If a pause fails, the error is logged with Rails.logger.error but the shutdown process continues. This ensures robust operation even when concurrent state transitions occur.
  • Claim field clearing logic: Claim fields (claimed_by_agent_id, claimed_at, expires_at) are only cleared when the pause operation succeeds (tracked by the paused boolean flag). Running tasks with cleared claims would create an inconsistent state not handled by any recovery path. If a task pause fails, the heartbeat timeout mechanism will eventually detect the agent as offline and handle the orphaned task.
  • Claim fields are cleared using update_columns to bypass validations and callbacks for performance
  • Automatic attack pause: After pausing tasks, the callback checks each affected attack to see if it has any remaining active tasks (excluding paused, completed, exhausted, and failed states). The condition attack.tasks.without_states(:paused, :completed, :exhausted, :failed).exists? returns false when no active tasks remain, triggering the attack pause.
  • Error handling for attack pause: Attack pause operations are wrapped in try-rescue blocks to catch state machine and concurrency errors. If an attack pause fails, the error is logged but the shutdown process continues, ensuring the agent transitions to offline state even if attack cleanup encounters issues.

Task Pausing Behavior#

Task State Machine#

The task pause event defines the valid transitions:

event :pause do
  transition %i[pending running] => :paused
  transition any => same
end

This idempotent design ensures that:

  • Tasks in pending or running states transition to paused
  • Tasks already in terminal states (completed, exhausted, failed) or already paused remain unchanged
  • The pause operation never fails due to invalid state

Task Pause Callbacks#

When a task is paused, three callbacks execute:

  1. State-specific logging and paused_at timestamp:
after_transition to: :paused do |task|
  task.send(:log_state_transition, "paused", "Task execution paused")
  task.send(:mark_paused_safely)
end

The mark_paused_safely method sets the paused_at timestamp using update_column:

def mark_paused_safely
  update_column(:paused_at, Time.zone.now)
rescue ActiveRecord::ActiveRecordError => e
  Rails.logger.error(
    "[Task #{id}] Error setting paused_at in pause callback - " \
    "Error: #{e.class} - #{e.message} - #{Time.current}"
  )
  # Non-critical: task is paused but paused_at not set.
  # Grace period will treat it as immediately available (paused_at IS NULL).
end
  1. Attack progress broadcast (lines 188-191):
after_transition do |task, transition|
  next if transition.event == :abandon
  task.send(:safe_broadcast_attack_progress_update)
end
  1. Activity timestamp update (line 193):
after_transition any - [:pending] => any, do: :update_activity_timestamp

The log_state_transition method generates structured logs:

def log_state_transition(new_state, message)
  Rails.logger.info(
    "[Task #{id}] Agent #{agent_id} - Attack #{attack_id} - " \
    "State change: #{state_was} -> #{new_state} - #{message}"
  )
end

paused_at Timestamp: When a task transitions to the paused state, the paused_at column is set to the current time via the mark_paused_safely method. This timestamp is critical for the grace period mechanism in orphaned task recovery, allowing the system to determine when a paused task becomes eligible for reclamation by other agents. The method uses error handling to ensure task pausing succeeds even if the timestamp update fails; in such cases, the task is treated as immediately available (matching the paused_at IS NULL grace period logic).

Task State Classifications#

Task scopes define semantic groupings:

scope :incomplete, -> { with_states(%i[pending failed running]) }
scope :successful, -> { with_states(:completed, :exhausted) }
scope :finished, -> { with_states(:completed, :exhausted, :failed) }
scope :running, -> { with_state(:running) }

Important: The paused state is not included in the incomplete scope. This is intentional: paused tasks cannot be actively worked on without first being resumed, distinguishing them from pending, running, and failed tasks that are available for immediate execution.

Claim Field Clearing Mechanism#

Dual Ownership Model#

CipherSwarm implements a dual ownership model for tasks:

  1. Permanent Ownership (agent_id): Tracks which agent the task was originally assigned to. This is a NOT NULL column with ON DELETE => cascade.

  2. Active Claim (claimed_by_agent_id, claimed_at, expires_at): Tracks which agent is currently processing the task and when the claim expires. The claimed_by_agent_id has ON DELETE => nullify.

Claim Clearing Implementation#

During shutdown, claim fields are cleared:

task.update_columns(claimed_by_agent_id: nil, claimed_at: nil, expires_at: nil)

Implementation considerations:

  • Uses update_columns to bypass ActiveRecord validations and callbacks for performance
  • Only clears claim fields, not the permanent agent_id field
  • Intentionally skips model validations (acknowledged with # rubocop:disable Rails/SkipsModelValidations)

According to AGENTS.md documentation:

On agent shutdown, tasks are paused and claim fields (claimed_by_agent_id, claimed_at, expires_at) are cleared. TaskAssignmentService#find_unassigned_paused_task detects orphans using a paused_at grace period, then reassigns agent_id and calls resume! on pickup. TaskAssignmentService#find_own_paused_task runs before find_unassigned_paused_task — returning agents reclaim their own paused tasks first (to use restore files). Grace period (agent_considered_offline_time, default 30 min) via paused_at column: within the period, only the original agent can reclaim; after, any agent can. Tasks from offline/stopped agents are available immediately. When reclaiming a paused task whose attack was also paused (shutdown cascade), the attack is resumed automatically.

Why Claim Clearing is Shutdown-Specific#

Other pause mechanisms in CipherSwarm do not clear claim fields:

  1. Attack-level pause: When attacks transition to paused state, they cascade the pause to tasks:
def pause_tasks
  tasks.without_state(:paused).each(&:pause)
end

This pauses tasks but does not clear claim fields, as the same agent is expected to resume work when the attack is unpaused.

  1. Campaign priority preemption: When higher-priority campaigns preempt lower-priority ones, tasks are preempted (not paused) and claim fields may remain set depending on the preemption implementation.

The shutdown-specific claim clearing ensures that orphaned tasks are immediately available for reassignment, rather than waiting for claim expiration timeouts.

Orphaned Task Detection and Reassignment#

Grace Period Mechanism#

The orphaned task recovery system uses a time-based grace period rather than checking agent state directly. When a task is paused during agent shutdown, the paused_at timestamp is set. This timestamp controls when the task becomes eligible for reassignment:

  • Within grace period (default 30 minutes via agent_considered_offline_time configuration): Only the original agent can reclaim its own paused tasks via find_own_paused_task
  • After grace period expires: Any agent can claim the orphaned task via find_unassigned_paused_task
  • Exception: Tasks from agents in offline or stopped states are immediately available for reassignment, bypassing the grace period

This grace period design ensures that agents can quickly resume their work after brief disconnections (e.g., network blips, restarts) by leveraging restore files, while preventing tasks from being permanently stuck if an agent never returns.

find_own_paused_task Method#

The TaskAssignmentService#find_own_paused_task method allows agents to reclaim their own paused tasks:

def find_own_paused_task
  task = agent.tasks.with_state(:paused)
                .where(claimed_by_agent_id: [nil, agent.id])
                .joins(attack: { campaign: :hash_list })
                .where("EXISTS (SELECT 1 FROM hash_items WHERE hash_items.hash_list_id = hash_lists.id AND hash_items.cracked = false)")
                .order(:id)
                .first

  return nil unless task

  if task.attack.paused? && task.attack.can_resume?
    task.attack.resume!
    task.reload # attack.resume_tasks may have already resumed this task
  end
  task.resume! if task.paused? && task.can_resume?
  task
rescue StateMachines::InvalidTransition, ActiveRecord::StaleObjectError => e
  Rails.logger.error(
    "[TaskAssignmentService] Failed to resume own paused task #{task.id} " \
    "for agent #{agent.id}: #{e.class} - #{e.message}"
  )
  nil
end

Key characteristics:

  • Purpose: Reclaims the agent's own paused tasks after a restart
  • Scope: Queries only tasks where agent_id matches the current agent (via agent.tasks)
  • Claim handling: Accepts tasks with claimed_by_agent_id as either nil or the current agent's ID
  • No grace period check: This method does not filter by paused_at timestamp—agents can always reclaim their own paused tasks immediately
  • No locking: FOR UPDATE SKIP LOCKED is not needed because the query is already scoped to this agent's agent_id, preventing cross-agent races
  • Attack auto-resume: If the task's attack was paused during shutdown cascade, this method automatically resumes the attack before resuming the task
  • Restore file optimization: Allows agents to resume work using existing restore files, avoiding redundant computation
  • Error handling: Wrapped in a rescue block to catch state transition failures; returns nil if resume fails

find_unassigned_paused_task Method#

The TaskAssignmentService#find_unassigned_paused_task method handles orphaned task detection from other agents:

def find_unassigned_paused_task
  task = nil

  Task.transaction do
    task = Task.with_state(:paused)
               .where(claimed_by_agent_id: nil)
               .where.not(agent_id: agent.id)
               .joins(:agent)
               .where(
                 "tasks.paused_at IS NULL OR tasks.paused_at < :grace_cutoff OR agents.state IN (:orphan_states)",
                 grace_cutoff: ApplicationConfig.agent_considered_offline_time.ago,
                 orphan_states: %w[offline stopped]
               )
               .joins(attack: { campaign: :hash_list })
               .where(campaigns: { project_id: agent.project_ids })
               .where(hash_lists: { hash_type_id: allowed_hash_type_ids })
               .where("EXISTS (SELECT 1 FROM hash_items WHERE hash_items.hash_list_id = hash_lists.id AND hash_items.cracked = false)")
               .order(:id)
               .lock("FOR UPDATE OF tasks SKIP LOCKED")
               .first

    return nil unless task

    # Reassign ownership to the claiming agent
    task.update_columns(agent_id: agent.id)

    # Resume the attack if it was paused due to agent shutdown (not campaign pause).
    begin
      if task.attack.paused? && task.attack.can_resume?
        task.attack.resume!
        task.reload # attack.resume_tasks may have already resumed this task
      end

      # Transition to pending so the new agent can accept the task.
      # resume! moves paused -> pending, marks stale (so the agent re-downloads cracks),
      # and clears paused_at (removing the task from grace period queries).
      task.resume! if task.paused? && task.can_resume?
    rescue StateMachines::InvalidTransition, ActiveRecord::StaleObjectError => e
      Rails.logger.error(
        "[TaskAssignmentService] Failed to resume orphaned task #{task.id} " \
        "for agent #{agent.id}: #{e.class} - #{e.message}"
      )
      # Ownership was reassigned; task stays paused but belongs to the new agent.
      # Next cycle's find_own_paused_task will pick it up.
    end
  end

  task
end

Key implementation details:

  1. Grace period logic: The WHERE clause implements three conditions (OR):

    • tasks.paused_at IS NULL: Legacy tasks paused before the paused_at column was added (treated as immediately available)
    • tasks.paused_at < :grace_cutoff: Tasks paused more than 30 minutes ago (grace period expired, configured via agent_considered_offline_time)
    • agents.state IN (:orphan_states): Tasks from offline/stopped agents (immediate availability, bypassing grace period)
  2. Purpose: Finds tasks from other agents that have been orphaned, not the current agent's own tasks

  3. Exclusion: The query includes .where.not(agent_id: agent.id) to explicitly exclude tasks owned by the current agent

  4. Attack auto-resume: When reclaiming a paused task whose attack was also paused during shutdown cascade, the method automatically resumes the attack

  5. Reload handling: After attack.resume!, the task is reloaded because the attack's resume_tasks callback may have already resumed all paused tasks, potentially causing a StaleObjectError if we attempt to resume again

  6. Error handling: The resume operations are wrapped in a rescue block. If the attack or task resume fails, ownership reassignment is preserved and the error is logged. The task remains paused but assigned to the new agent, allowing find_own_paused_task to retry on the next cycle.

Why find_own_paused_task Checks [nil, agent.id]#

Unlike find_unassigned_paused_task, the find_own_paused_task method checks for claimed_by_agent_id in [nil, agent.id] because:

  1. Agent's own tasks: The query is already scoped to agent.tasks, which filters by agent_id
  2. Claim status: Tasks may have been paused with the agent still holding the claim, or with the claim cleared during shutdown
  3. Semantic correctness: Including agent.id in the claim check allows agents to resume tasks they were actively working on before shutdown

Race Condition Prevention#

The find_unassigned_paused_task method uses PostgreSQL row-level locking:

.lock("FOR UPDATE OF tasks SKIP LOCKED")

According to inline documentation:

Uses FOR UPDATE SKIP LOCKED to prevent two agents from racing to claim the same task.

This ensures that when multiple agents simultaneously query for orphaned tasks, each agent locks a different task, preventing duplicate work. The find_own_paused_task method does not require this locking mechanism because its query is scoped to the current agent's agent_id, eliminating the possibility of cross-agent races.

Task Assignment Priority#

The find_next_task method defines the priority order for task assignment:

def find_next_task
  find_existing_incomplete_task ||
    find_own_paused_task ||
    find_unassigned_paused_task ||
    find_task_from_available_attacks
end

Priority order:

  1. Incomplete tasks already assigned to this agent (highest priority) - Ensures agents complete their existing work before taking on new tasks
  2. Agent's own paused tasks - Allows returning agents to reclaim their paused tasks and leverage restore files within the grace period
  3. Orphaned paused tasks from other agents - Recovery mode for tasks from offline agents or tasks past the grace period
  4. New tasks from available attacks (lowest priority) - Normal operation when no incomplete or orphaned tasks exist

This two-stage reclamation process (own tasks → orphaned tasks) ensures that agents can quickly resume their work after brief disconnections, while still providing fault tolerance for permanently failed agents. The grace period mechanism prevents tasks from being immediately stolen during routine agent restarts.

Attack and Campaign State Propagation#

Attack Pause During Shutdown#

The agent shutdown cascade does automatically pause attacks when they have no remaining active tasks. After pausing all running tasks, the shutdown callback checks each affected attack:

# Pause attacks that have no more active (non-paused) tasks.
# This updates the Activity page to reflect that work has stopped.
affected_attacks.each do |attack|
  next unless attack.can_pause?
  next if attack.tasks.without_states(:paused, :completed, :exhausted, :failed).exists?

  attack.pause!
end

Implementation details:

  • The callback maintains a Set of affected attacks (collected during task pausing)
  • For each attack, it checks if any tasks remain in active states (excluding paused, completed, exhausted, and failed)
  • If no active tasks remain and the attack can be paused (not already paused or in a terminal state), the attack transitions to the paused state
  • This ensures the Activity page accurately reflects that work has stopped on these attacks

This bidirectional cascade ensures consistency between task and attack states: shutdown pauses tasks → tasks with no active work pause attacks → attack resume auto-resumes tasks.

Attack State Machine Behavior#

The attack state machine includes bidirectional pause and resume callbacks:

after_transition any => :paused, :do => :pause_tasks

def pause_tasks
  tasks.without_state(:paused).each(&:pause)
end

after_transition any => :running, :do => :resume_tasks

def resume_tasks
  tasks.with_state(:paused).each(&:resume!)
end

This bidirectional relationship means:

  • Attack → Task: When an attack is paused (administratively or via shutdown cascade), all non-paused tasks are paused
  • Task → Attack: When orphaned tasks are reclaimed and their paused attacks are resumed, all paused tasks in that attack are automatically resumed

Important: The resume_tasks callback can cause StaleObjectError if task code attempts to resume a task after attack.resume! has already resumed it. This is why TaskAssignmentService includes task.reload before attempting task.resume! (see the "Reload handling" detail in the orphaned task section).

Campaign State Computation#

Campaigns do not have a persistent state column. Instead, the paused? method computes campaign state dynamically:

def paused?
  attacks.without_states(%i[paused completed]).empty? && attacks.with_state(:paused).any?
end

A campaign is considered paused when:

  1. All non-completed attacks are in the paused state
  2. At least one attack is paused

The completed? method similarly computes completion status:

def completed?
  return true unless hash_list.uncracked_items.exists?
  !attacks.without_state(:completed).exists?
end

Why campaign transition reflects shutdown automatically: Since campaign state is computed on-demand from attack states, no explicit state transition occurs during agent shutdown. When campaign.paused? is called, it queries the current attack states at that moment, automatically reflecting any changes caused by the shutdown cascade (including attacks paused due to having no remaining active tasks).

Bottom-Up State Aggregation Pattern#

CipherSwarm implements a bottom-up state aggregation architecture with bidirectional cascades:

Task states (stored) ⟷ Attack states (stored) → Campaign states (computed)

The bidirectional arrows between Task and Attack states represent:

  • Downward cascade: Attack pause → pauses tasks; Attack resume → resumes tasks
  • Upward cascade: Agent shutdown → pauses tasks → pauses attacks with no active tasks; Task reclaim → resumes attacks

This design ensures consistency between task and attack states during both shutdown and recovery, while eliminating the need for explicit campaign state transitions. Campaign state automatically reflects the aggregate state of its attacks, which in turn reflect the aggregate state of their tasks.

Agent Detection of Paused Tasks#

When agents submit status updates during task execution, the server responds with HTTP status codes that signal task state changes. The StatusSubmissionService determines the response:

def determine_response_status
  if task.stale
    Result.new(status: :stale)
  elsif task.paused?
    Result.new(status: :paused)
  else
    Result.new(status: :ok)
  end
end

The TasksController handles paused status by returning HTTP 410 Gone:

when :paused
  Rails.logger.info("[Agent #{@agent.id}] Task #{@task.id} - Status accepted, task is paused")
  head :gone

Agents detect paused tasks through this HTTP 410 response during status submission. This allows agents to gracefully stop processing tasks that have been paused due to shutdown of other agents or administrative actions.

Task Resume Behavior#

When a paused task is resumed (either by the original agent reclaiming it or by a new agent claiming it), it transitions from paused to pending:

event :resume do
  transition paused: :pending
  transition any => same
end

after_transition on: :resume do |task|
  task.send(:log_state_transition, "resumed", "Marking as stale and clearing paused_at")
  task.send(:mark_resumed_safely)
end

The mark_resumed_safely method clears the paused_at timestamp and marks the task as stale:

def mark_resumed_safely
  update_columns(stale: true, paused_at: nil)
rescue ActiveRecord::ActiveRecordError => e
  Rails.logger.error(
    "[Task #{id}] Error updating stale/paused_at in resume callback - " \
    "Error: #{e.class} - #{e.message} - #{Time.current}"
  )
  # Don't re-raise - task state transition already succeeded
end

Key behaviors:

  • Resumed tasks transition to pending, not directly to running
  • Tasks are marked stale: true, requiring agents to re-download crack information
  • The paused_at timestamp is cleared via update_columns, removing the task from grace period tracking
  • Error handling ensures the resume operation succeeds even if the timestamp/stale updates fail
  • This ensures agents receive updated hash crack status before resuming work, preventing redundant cracking attempts

The new agent must then accept the task (transitioning from pending to running) before beginning execution.

Deployment Considerations for Airgapped Environments#

In airgapped laboratory and secure environments where CipherSwarm is commonly deployed, the agent shutdown cascade provides critical resilience features:

Network Reliability#

Airgapped networks often experience connectivity issues between isolated network segments. The shutdown cascade's grace period mechanism ensures that tasks are not immediately stolen during brief network interruptions, allowing the original agent to reconnect and reclaim its work using restore files. After the grace period expires (or if the agent is confirmed offline/stopped), other agents can immediately detect and claim orphaned tasks, ensuring work continues despite extended outages.

Resource Management#

In secure environments with limited computational resources, the priority-based task assignment system (incomplete → own paused → orphaned → new) ensures efficient resource utilization. When an agent restarts after maintenance or system updates, it prioritizes reclaiming its own paused tasks to leverage existing restore files, minimizing redundant work. Only after exhausting its own paused tasks does it help recover orphaned tasks from other agents.

Auditability#

The structured logging throughout the shutdown cascade provides comprehensive audit trails for security-conscious environments:

  • Agent state transitions are logged with timestamps
  • Task state changes include agent ID, attack ID, and state transitions
  • Claim field modifications and paused_at timestamps are tracked for forensic analysis
  • Attack pause events during shutdown are logged to explain why attacks transition to paused state

Database Integrity#

The use of PostgreSQL row-level locking (FOR UPDATE SKIP LOCKED) prevents race conditions even in high-concurrency scenarios common in large-scale password cracking operations. This ensures that multiple agents competing for orphaned tasks never duplicate work, critical for efficient resource usage in compute-constrained environments. The partial index on paused_at for paused tasks (WHERE state = 'paused') optimizes grace period queries by reducing index size and improving query performance for orphaned task detection.

Relevant Code Files#

File PathPurposeKey Components
app/models/agent.rbAgent lifecycle managementshutdown event, after_transition callback, claim field clearing, attack pause logic
app/models/concerns/task_state_machine.rbTask state transitionspause event, resume event, paused_at timestamp management, state transition callbacks
app/services/task_assignment_service.rbTask assignment and orphan recoveryfind_own_paused_task, find_unassigned_paused_task, find_next_task, grace period logic, row-level locking
app/models/concerns/attack_state_machine.rbAttack state transitionspause_tasks callback, resume_tasks callback, bidirectional task-attack cascades
app/models/campaign.rbCampaign state computationpaused? method, completed? method
app/models/task.rbTask model and scopesincomplete, finished, running scopes, paused_at column, partial index
app/services/status_submission_service.rbAgent status update handlingdetermine_response_status for pause detection
app/controllers/api/v1/client/tasks_controller.rbAgent API endpointssubmit_status action, HTTP 410 response handling

Agent State Machine#

The agent shutdown cascade is part of the broader Agent State Machine that manages agent lifecycle events including activation, deactivation, and error handling.

Task Assignment System#

The cascade interacts with the Task Assignment System which implements priority-based task distribution and orphan recovery.

Attack State Machine#

The Attack State Machine defines bidirectional state propagation between attacks and tasks. During shutdown, attacks are automatically paused when they have no remaining active tasks. During task reclaim, attacks are automatically resumed when paused tasks are recovered.

Campaign Priority Management#

Campaign priority preemption represents an alternative pause mechanism that may interact with shutdown cascade behavior when agents are processing multiple priority levels.

Architectural Diagram#

This sequence diagram illustrates the complete lifecycle of an agent shutdown and subsequent task recovery, showing:

  • The automatic attack pause when no active tasks remain
  • The grace period mechanism that allows agents to reclaim their own paused tasks
  • The two-stage reclamation process (own tasks → orphaned tasks)
  • The bidirectional cascade between attacks and tasks during resume
  • The task.reload pattern to avoid StaleObjectError when attack.resume_tasks has already resumed tasks

This article documents the Agent Shutdown Cascade behavior as implemented in CipherSwarm. The system uses a time-based grace period mechanism (via the paused_at timestamp) to enable efficient task recovery, with automatic attack pause/resume coordination during the shutdown and reclaim processes. The two-stage reclamation priority (own tasks → orphaned tasks) ensures agents can leverage restore files after brief disconnections while providing fault tolerance for permanently failed agents.

Agent Shutdown Cascade | Dosu