Tech Plan: Operational Excellence Implementation#
This technical plan defines the implementation approach for the CipherSwarm V2 Operational Excellence Epic. The plan extends the existing Rails 8 + Hotwire + ViewComponent architecture without introducing new frameworks, ensuring maintainability for a solo part-time developer.
Architectural Approach#
1. Real-Time Updates Strategy#
Decision: Targeted Turbo Stream Broadcasts
Replace the current broadcasts_refreshes pattern with targeted Turbo Stream broadcasts to avoid disrupting user interactions (e.g., form inputs, scrolling).
Current Pattern:
# app/models/agent.rb
broadcasts_refreshes unless Rails.env.test?
New Pattern:
# app/models/agent.rb
after_update_commit :broadcast_status_update, if: :status_changed?
def broadcast_status_update
broadcast_replace_to self,
target: "agent_status_#{id}",
partial: "agents/status_card",
locals: { agent: self }
rescue => e
Rails.logger.error("Failed to broadcast agent status update: #{e.message}")
Rails.logger.error(e.backtrace.join("\n"))
end
Scoping Strategy:
- Agent List: Broadcast individual agent status cards (not entire table)
- Agent Detail: Broadcast specific tabs (Overview, Errors) independently
- Campaign Detail: Broadcast attack progress bars individually
- Task Detail: Broadcast task status section only
Trade-offs:
- ✅ Avoids disrupting user interactions
- ✅ Reduces bandwidth (smaller updates)
- ✅ More granular control over updates
- ❌ More manual broadcast management
- ❌ Requires careful DOM element targeting
Implementation Notes:
- Use
dom_idhelper for consistent target IDs - Wrap broadcastable sections in divs with unique IDs
- Test broadcasts don't interfere with Stimulus controllers
2. Structured Logging with Lograge#
Decision: Add Lograge for JSON-formatted request logging
Implement structured logging to improve debuggability and enable log parsing/analysis.
Configuration:
# config/initializers/lograge.rb
Rails.application.configure do
config.lograge.enabled = true
config.lograge.formatter = Lograge::Formatters::Json.new
config.lograge.custom_options = lambda do |event|
{
user_id: event.payload[:user_id],
agent_id: event.payload[:agent_id],
ip: event.payload[:ip],
request_id: event.payload[:request_id]
}
end
end
Logging Levels:
- INFO: Agent lifecycle events (connect, disconnect, heartbeat)
- INFO: Task state transitions (accepted, completed, failed)
- INFO: API requests/responses with timing
- WARN: Performance issues (slow queries, high memory)
- ERROR: Application errors with context
- FATAL: Critical failures (database connection lost)
Structured Log Format:
{
"method": "GET",
"path": "/agents/1",
"status": 200,
"duration": 45.2,
"user_id": 123,
"timestamp": "2024-01-07T10:30:00Z"
}
Trade-offs:
- ✅ Easy to parse and analyze
- ✅ Consistent format across application
- ✅ Integrates with log aggregation tools
- ❌ Slightly more verbose logs
- ❌ One additional gem dependency
3. Caching Strategy#
Decision: Rails.cache with TTL-based expiration
Use Rails.cache for expensive queries and health checks to balance performance and freshness.
Caching Targets:
- System Health Checks: 1-minute TTL
- Agent Metrics: 30-second TTL (updated on status broadcast)
- Campaign ETAs: 1-minute TTL
- Recent Cracks: 1-minute TTL
Cache Keys:
# System health
Rails.cache.fetch("system_health", expires_in: 1.minute) { check_services }
# Agent metrics
Rails.cache.fetch("agent_metrics_#{agent.id}", expires_in: 30.seconds) { calculate_metrics }
# Campaign ETA
Rails.cache.fetch("#{campaign.cache_key_with_version}/eta", expires_in: 1.minute) { calculate_eta }
Cache Invalidation:
- Automatic expiration via TTL
- Manual invalidation on critical updates (e.g.,
Rails.cache.delete("agent_metrics_#{id}")) - Use
cache_key_with_versionfor model-based caching
Trade-offs:
- ✅ Reduces database load
- ✅ Improves page load times
- ✅ Simple TTL-based expiration
- ❌ Slight staleness (acceptable for monitoring)
- ❌ Cache warming on first request
4. UI Loading States#
Decision: Hybrid approach (skeleton screens + spinners)
Use skeleton screens for major components (agent list, campaign list) and spinners for smaller interactions (modals, forms).
Skeleton Screens:
- Agent list loading state
- Campaign list loading state
- System health dashboard loading state
Spinners:
- Modal dialogs (error details, task actions)
- Form submissions
- Button actions (cancel, retry, reassign)
Implementation:
- Create
SkeletonLoaderComponentfor reusable skeleton patterns - Use Bootstrap spinner component for simple loading states
- Turbo Frame loading states with
turbo:before-fetch-requestevent
Trade-offs:
- ✅ Better perceived performance (skeletons)
- ✅ Simple implementation (spinners)
- ✅ Balanced complexity
- ❌ Need to maintain skeleton templates
5. Tabbed Interface Implementation#
Decision: Stimulus controller with hidden divs
Implement tabbed agent detail page using Stimulus controller for fast tab switching without network requests.
Implementation:
// app/javascript/controllers/tabs_controller.js
import {
Controller
} from "@hotwired/stimulus";
export default class extends Controller {
static targets = ["tab", "panel"];
switch (event) {
const tabName = event.currentTarget.dataset.tabName;
// Hide all panels
this.panelTargets.forEach((panel) => panel.classList.add("d-none"));
// Show selected panel
const selectedPanel = this.panelTargets.find(
(panel) => panel.dataset.tabPanel === tabName,
);
selectedPanel.classList.remove("d-none");
// Update active tab styling
this.tabTargets.forEach((tab) => tab.classList.remove("active"));
event.currentTarget.classList.add("active");
}
}
Trade-offs:
- ✅ Fast tab switching (no network request)
- ✅ All content loaded upfront (good for small datasets)
- ✅ Simple Stimulus controller
- ❌ Larger initial page load
- ❌ No URL updates (not bookmarkable)
6. Toast Notifications#
Decision: Stimulus controller for Bootstrap toast component
Implement toast notifications using Stimulus controller to trigger Bootstrap 5 toast component.
Implementation:
// app/javascript/controllers/toast_controller.js
import {
Controller
} from "@hotwired/stimulus";
import {
Toast
} from "bootstrap";
export default class extends Controller {
connect() {
const toast = new Toast(this.element, {
autohide: true,
delay: 5000,
});
toast.show();
}
}
Usage:
<div class="toast" data-controller="toast" role="alert">
<div class="toast-body">
<%= message %>
</div>
</div>
Trade-offs:
- ✅ Native Bootstrap component
- ✅ Auto-dismiss after 5 seconds
- ✅ Can be triggered from Turbo Stream responses
- ❌ Requires Stimulus controller
7. System Health Monitoring#
Decision: Stateless health checks with Rails.cache
Implement system health monitoring without database persistence, using Rails.cache for 1-minute TTL.
Health Check Services:
- PostgreSQL:
ActiveRecord::Base.connection.active? - Redis:
Redis.current.ping - MinIO: S3 bucket access check
- Sidekiq: Queue stats and worker count
Implementation:
# app/controllers/system_health_controller.rb
def index
@health_status = fetch_health_status_with_lock
end
def fetch_health_status_with_lock
# Use Redis lock to prevent cache stampede
lock_key = "system_health_check_lock"
# Try to get cached value first
cached = Rails.cache.read("system_health")
return cached if cached
# Acquire lock to run health checks
Redis.current.set(lock_key, "locked", nx: true, ex: 10) do
# Run health checks
status = {
postgresql: check_postgresql,
redis: check_redis,
minio: check_minio,
sidekiq: check_sidekiq
}
# Cache results for 1 minute
Rails.cache.write("system_health", status, expires_in: 1.minute)
status
end
# If we couldn't get lock, return cached value or wait briefly and retry
Rails.cache.read("system_health") || {
postgresql: { status: :checking },
redis: { status: :checking },
minio: { status: :checking },
sidekiq: { status: :checking }
}
end
private
def check_postgresql
start_time = Time.current
ActiveRecord::Base.connection.execute("SELECT 1")
latency = ((Time.current - start_time) * 1000).round(2)
{ status: :healthy, latency: latency }
rescue => e
Rails.logger.error("PostgreSQL health check failed: #{e.message}")
{ status: :unhealthy, error: e.message }
end
Trade-offs:
- ✅ No database table needed
- ✅ Fast reads (cached)
- ✅ Simple implementation
- ❌ No historical data
- ❌ First request runs checks (cache warming)
Data Model#
1. Database Schema Changes#
Migration: Add Performance Indexes
Add indexes to optimize queries for monitoring features.
# db/migrate/YYYYMMDDHHMMSS_add_performance_indexes.rb
class AddPerformanceIndexes < ActiveRecord::Migration[8.0]
def change
# Index for recent cracks query
add_index :hash_items, :cracked_time
# Index for recent errors query
add_index :agent_errors, :created_at
# Index for latest status query
add_index :hashcat_statuses, :time
# Composite index for agent task lookup
add_index :tasks, [:agent_id, :state]
end
end
Migration: Add Cached Agent Metrics
Add columns to Agent model for cached performance metrics.
# db/migrate/YYYYMMDDHHMMSS_add_cached_metrics_to_agents.rb
class AddCachedMetricsToAgents < ActiveRecord::Migration[8.0]
def change
add_column :agents, :current_hash_rate, :decimal, precision: 20, scale: 2
add_column :agents, :current_temperature, :integer
add_column :agents, :current_utilization, :integer
add_column :agents, :metrics_updated_at, :datetime
add_index :agents, :metrics_updated_at
end
end
2. Model Extensions#
Agent Model: Cached Metrics
Add methods to update and retrieve cached performance metrics.
# app/models/agent.rb
class Agent < ApplicationRecord
# ... existing code ...
def hash_rate_display
return "—" unless current_hash_rate
return "0 H/s" if current_hash_rate.zero?
"#{number_to_human(current_hash_rate, prefix: :si)} H/s"
end
end
HashcatStatus Model: Update Agent Metrics
Add callback to update agent metrics when status is created.
# app/models/hashcat_status.rb
class HashcatStatus < ApplicationRecord
belongs_to :task
after_create_commit :update_agent_metrics
private
def update_agent_metrics
agent = task.agent
return unless agent
return unless status == :running
# Only update if metrics are stale (30 seconds)
return if agent.metrics_updated_at && agent.metrics_updated_at > 30.seconds.ago
agent.update_columns(
current_hash_rate: hash_rate,
current_temperature: device_temperature,
current_utilization: device_utilization,
metrics_updated_at: Time.current
)
rescue => e
Rails.logger.error("Failed to update agent metrics: #{e.message}")
end
end
Rationale: HashcatStatus is the source of truth for agent metrics. By updating Agent from HashcatStatus callback, we ensure metrics are updated whenever new status data arrives, avoiding race conditions and stale data.
Campaign Model: ETA Calculation
Add methods to calculate campaign estimated finish time (both current and total).
# app/models/campaign.rb
class Campaign < ApplicationRecord
# ... existing code ...
# Current attack ETA (only running attacks)
def current_eta
Rails.cache.fetch("#{cache_key_with_version}/current_eta", expires_in: 1.minute) do
calculate_current_eta
end
end
# Total campaign ETA (all incomplete attacks)
def total_eta
Rails.cache.fetch("#{cache_key_with_version}/total_eta", expires_in: 1.minute) do
calculate_total_eta
end
end
private
def calculate_current_eta
running_attacks = attacks.with_state(:running)
return nil if running_attacks.empty?
# Return the maximum ETA of all running attacks
etas = running_attacks.map(&:estimated_finish_time).compact
etas.max
end
def calculate_total_eta
incomplete_attacks = attacks.without_states(:completed, :exhausted)
return nil if incomplete_attacks.empty?
# Get estimated finish times for running attacks
running_etas = incomplete_attacks.with_state(:running).map(&:estimated_finish_time).compact
pending_count = incomplete_attacks.with_state(:pending).count
# If we have running attacks, convert ETAs to durations (seconds remaining)
if running_etas.any?
# Convert Time objects to durations (seconds from now)
running_durations = running_etas.map { |t| (t - Time.current).to_f }.compact
avg_duration = running_durations.sum / running_durations.size
running_max_seconds = running_durations.max.to_f
# Total = max running duration + estimated time for all pending attacks
total_seconds = running_max_seconds + (pending_count * avg_duration)
Time.current + total_seconds.seconds
else
nil # Can't estimate without running attacks
end
end
end
Rationale: Showing both current and total ETA gives users a complete picture: "Current attack finishes in 2h, entire campaign finishes in 6h". This manages expectations better than showing only current attack ETA.
HashList Model: Recent Cracks
Add method to retrieve recently cracked hashes (last 24 hours).
# app/models/hash_list.rb
class HashList < ApplicationRecord
# ... existing code ...
def recent_cracks(limit: 100)
Rails.cache.fetch("#{cache_key_with_version}/recent_cracks", expires_in: 1.minute) do
hash_items
.where("cracked_time > ?", 24.hours.ago)
.order(cracked_time: :desc)
.limit(limit)
end
end
def recent_cracks_count
Rails.cache.fetch("#{cache_key_with_version}/recent_cracks_count", expires_in: 1.minute) do
hash_items.where("cracked_time > ?", 24.hours.ago).count
end
end
end
3. State Machine Extensions#
Task Model: Add Retry Event
Add retry event to Task state machine for proper failed → pending transition.
# app/models/task.rb
class Task < ApplicationRecord
# ... existing code ...
state_machine :state, initial: :pending do
# ... existing events ...
# New retry event for manual task retry
event :retry do
transition failed: :pending
end
after_transition on: :retry do |task|
Rails.logger.info("[Task #{task.id}] Agent #{task.agent_id} - Attack #{task.attack_id} - State change: failed -> pending - Task manually retried")
task.increment!(:retry_count)
task.update(last_error: nil) # Clear previous error
end
end
end
Rationale: Adding a proper state machine event ensures retry follows the same lifecycle management as other state transitions, triggers appropriate callbacks, and maintains system invariants.
4. Authorization Rules#
CanCanCan Abilities: Task Management
Add project-based authorization for task management actions.
# app/models/ability.rb
class Ability
include CanCan::Ability
def initialize(user)
# ... existing code ...
# Task management (project-based)
can :read, Task, attack: { campaign: { project_id: user.project_ids } }
can :cancel, Task, attack: { campaign: { project_id: user.project_ids } }
can :retry, Task, attack: { campaign: { project_id: user.project_ids } }
can :reassign, Task, attack: { campaign: { project_id: user.project_ids } }
can :download_results, Task, attack: { campaign: { project_id: user.project_ids } }
# Admins can manage all tasks
can :manage, Task if user.admin?
end
end
Component Architecture#
1. New ViewComponents#
AgentStatusCardComponent
Displays agent status in list view with real-time updates.
# app/components/agent_status_card_component.rb
class AgentStatusCardComponent < ApplicationViewComponent
option :agent, required: true
def status_badge_variant
case agent.state
when "active" then "success"
when "offline" then "danger"
when "pending" then "warning"
else "secondary"
end
end
def error_count
agent.agent_errors.where("created_at > ?", 24.hours.ago).count
end
end
AgentDetailTabsComponent
Tabbed interface for agent detail page.
# app/components/agent_detail_tabs_component.rb
class AgentDetailTabsComponent < ApplicationViewComponent
option :agent, required: true
renders_one :overview_tab
renders_one :errors_tab
renders_one :configuration_tab
renders_one :capabilities_tab
end
CampaignProgressComponent
Progress bar with ETA for campaign attacks.
# app/components/campaign_progress_component.rb
class CampaignProgressComponent < ApplicationViewComponent
option :attack, required: true
def progress_percentage
attack.percentage_complete
end
def eta_text
return "Calculating..." unless attack.estimated_finish_time
"ETA: #{distance_of_time_in_words_to_now(attack.estimated_finish_time)}"
end
end
ErrorModalComponent
Modal dialog for displaying error details.
# app/components/error_modal_component.rb
class ErrorModalComponent < ApplicationViewComponent
option :error, required: true
option :modal_id, required: true
def severity_badge_variant
case error.severity
when "fatal" then "danger"
when "error" then "danger"
when "warning" then "warning"
when "info" then "info"
else "secondary"
end
end
end
SystemHealthCardComponent
Service status card for system health dashboard.
# app/components/system_health_card_component.rb
class SystemHealthCardComponent < ApplicationViewComponent
option :service_name, required: true
option :status, required: true
option :latency, default: proc { nil }
option :error, default: proc { nil }
def status_variant
status == :healthy ? "success" : "danger"
end
def status_icon
status == :healthy ? "check-circle" : "x-circle"
end
end
TaskActionsComponent
Action buttons for task management.
# app/components/task_actions_component.rb
class TaskActionsComponent < ApplicationViewComponent
option :task, required: true
def can_cancel?
task.pending? || task.running?
end
def can_retry?
task.failed?
end
def can_reassign?
task.pending? || task.failed?
end
end
SkeletonLoaderComponent
Loading state placeholder for major components.
# app/components/skeleton_loader_component.rb
class SkeletonLoaderComponent < ApplicationViewComponent
option :type, required: true # :agent_list, :campaign_list, :health_dashboard
option :count, default: proc { 5 }
end
ToastNotificationComponent
Toast notification for success/error feedback.
# app/components/toast_notification_component.rb
class ToastNotificationComponent < ApplicationViewComponent
option :message, required: true
option :variant, default: proc { "success" } # success, danger, warning, info
def toast_class
"toast-#{variant}"
end
end
2. New Stimulus Controllers#
tabs_controller.js
Manages tabbed interface for agent detail page.
// app/javascript/controllers/tabs_controller.js
import {
Controller
} from "@hotwired/stimulus";
export default class extends Controller {
static targets = ["tab", "panel"];
connect() {
// Show first tab by default
this.showTab(0);
}
switch (event) {
event.preventDefault();
const index = this.tabTargets.indexOf(event.currentTarget);
this.showTab(index);
}
showTab(index) {
// Hide all panels
this.panelTargets.forEach((panel) => {
panel.classList.add("d-none");
});
// Show selected panel
this.panelTargets[index].classList.remove("d-none");
// Update active tab
this.tabTargets.forEach((tab) => {
tab.classList.remove("active");
});
this.tabTargets[index].classList.add("active");
}
}
toast_controller.js
Triggers Bootstrap toast notifications.
// app/javascript/controllers/toast_controller.js
import {
Controller
} from "@hotwired/stimulus";
import {
Toast
} from "bootstrap";
export default class extends Controller {
static values = {
autohide: {
type: Boolean,
default: true,
},
delay: {
type: Number,
default: 5000,
},
};
connect() {
const toast = new Toast(this.element, {
autohide: this.autohideValue,
delay: this.delayValue,
});
toast.show();
// Remove from DOM after hidden
this.element.addEventListener("hidden.bs.toast", () => {
this.element.remove();
});
}
}
3. New Controller Actions#
SystemHealthController
Displays system health dashboard.
# app/controllers/system_health_controller.rb
class SystemHealthController < ApplicationController
before_action :authenticate_user!
def index
authorize! :read, :system_health
@health_status = fetch_health_status_with_lock
end
def fetch_health_status_with_lock
# Try to get cached value first
cached = Rails.cache.read("system_health")
return cached if cached
# Use Redis lock to prevent cache stampede
lock_key = "system_health_check_lock"
lock_acquired = Redis.current.set(lock_key, "locked", nx: true, ex: 10)
if lock_acquired
# Run health checks
status = {
postgresql: check_postgresql,
redis: check_redis,
minio: check_minio,
sidekiq: check_sidekiq
}
# Cache results for 1 minute
Rails.cache.write("system_health", status, expires_in: 1.minute)
# Release lock
Redis.current.del(lock_key)
status
else
# Another request is running checks, wait briefly and return cached value
sleep 0.1
Rails.cache.read("system_health") || {
postgresql: { status: :checking },
redis: { status: :checking },
minio: { status: :checking },
sidekiq: { status: :checking }
}
end
end
private
def check_postgresql
start_time = Time.current
ActiveRecord::Base.connection.execute("SELECT 1")
latency = ((Time.current - start_time) * 1000).round(2)
{ status: :healthy, latency: latency }
rescue => e
Rails.logger.error("PostgreSQL health check failed: #{e.message}")
{ status: :unhealthy, error: e.message }
end
def check_redis
start_time = Time.current
Redis.current.ping
latency = ((Time.current - start_time) * 1000).round(2)
{ status: :healthy, latency: latency }
rescue => e
Rails.logger.error("Redis health check failed: #{e.message}")
{ status: :unhealthy, error: e.message }
end
def check_minio
# Check S3 bucket access
start_time = Time.current
ActiveStorage::Blob.service.exist?("health_check")
latency = ((Time.current - start_time) * 1000).round(2)
{ status: :healthy, latency: latency }
rescue => e
Rails.logger.error("MinIO health check failed: #{e.message}")
{ status: :unhealthy, error: e.message }
end
def check_sidekiq
stats = Sidekiq::Stats.new
{
status: :healthy,
workers: stats.workers_size,
queues: stats.queues.size,
enqueued: stats.enqueued
}
rescue => e
Rails.logger.error("Sidekiq health check failed: #{e.message}")
{ status: :unhealthy, error: e.message }
end
end
TasksController
Manages task lifecycle actions.
# app/controllers/tasks_controller.rb
class TasksController < ApplicationController
before_action :authenticate_user!
load_and_authorize_resource
def show
@task = Task.includes(:agent, :attack, :hashcat_statuses).find(params[:id])
end
def cancel
@task = Task.find(params[:id])
authorize! :cancel, @task
if @task.cancel
respond_to do |format|
format.turbo_stream do
render turbo_stream: [
turbo_stream.replace("task_#{@task.id}", partial: "tasks/task", locals: { task: @task }),
turbo_stream.append("toast_container", partial: "shared/toast", locals: { message: "Task cancelled", variant: "success" })
]
end
format.html { redirect_to @task, notice: "Task cancelled" }
end
else
respond_to do |format|
format.turbo_stream do
render turbo_stream: turbo_stream.append("toast_container", partial: "shared/toast", locals: { message: "Failed to cancel task", variant: "danger" })
end
format.html { redirect_to @task, alert: "Failed to cancel task" }
end
end
end
def retry
@task = Task.find(params[:id])
authorize! :retry, @task
# Use state machine event for proper transition
if @task.retry
respond_to do |format|
format.turbo_stream do
render turbo_stream: [
turbo_stream.replace("task_#{@task.id}", partial: "tasks/task", locals: { task: @task }),
turbo_stream.append("toast_container", partial: "shared/toast", locals: { message: "Task queued for retry", variant: "success" })
]
end
format.html { redirect_to @task, notice: "Task queued for retry" }
end
else
respond_to do |format|
format.turbo_stream do
render turbo_stream: turbo_stream.append("toast_container", partial: "shared/toast", locals: { message: "Failed to retry task", variant: "danger" })
end
format.html { redirect_to @task, alert: "Failed to retry task" }
end
end
end
def reassign
@task = Task.find(params[:id])
authorize! :reassign, @task
new_agent = Agent.find(params[:agent_id])
# Validate agent can handle this task
unless agent_compatible_with_task?(new_agent, @task)
respond_to do |format|
format.turbo_stream do
render turbo_stream: turbo_stream.append("toast_container", partial: "shared/toast", locals: { message: "Agent #{new_agent.name} cannot handle this task (incompatible hash type or insufficient performance)", variant: "danger" })
end
format.html { redirect_to @task, alert: "Agent incompatible with task" }
end
return
end
if @task.update(agent: new_agent, state: :pending)
respond_to do |format|
format.turbo_stream do
render turbo_stream: [
turbo_stream.replace("task_#{@task.id}", partial: "tasks/task", locals: { task: @task }),
turbo_stream.append("toast_container", partial: "shared/toast", locals: { message: "Task reassigned to #{new_agent.name}", variant: "success" })
]
end
format.html { redirect_to @task, notice: "Task reassigned" }
end
else
respond_to do |format|
format.turbo_stream do
render turbo_stream: turbo_stream.append("toast_container", partial: "shared/toast", locals: { message: "Failed to reassign task", variant: "danger" })
end
format.html { redirect_to @task, alert: "Failed to reassign task" }
end
end
end
def logs
@task = Task.find(params[:id])
authorize! :read, @task
@logs = @task.hashcat_statuses.order(time: :desc).limit(100)
end
def download_results
@task = Task.find(params[:id])
authorize! :download_results, @task
# Generate CSV of cracked hashes for this task
csv_data = generate_results_csv(@task)
send_data csv_data,
filename: "task_#{@task.id}_results_#{Time.current.to_i}.csv",
type: "text/csv"
end
private
def generate_results_csv(task)
require "csv"
CSV.generate do |csv|
csv << ["Hash", "Plaintext", "Cracked At"]
task.attack.hash_list.hash_items.where.not(cracked_time: nil).find_each do |item|
csv << [item.hash_value, item.plain_text, item.cracked_time]
end
end
end
def agent_compatible_with_task?(agent, task)
hash_type = task.attack.hash_type
# Check if agent supports this hash type
return false unless agent.allowed_hash_types.include?(hash_type)
# Check if agent meets performance threshold
return false unless agent.meets_performance_threshold?(hash_type)
# Check if agent has access to task's project
return false unless agent.project_ids.include?(task.attack.campaign.project_id)
true
end
end
4. Integration Points#
Turbo Stream Broadcasts
Define broadcast targets and partials for real-time updates.
# app/models/agent.rb
after_update_commit :broadcast_status_update, if: :should_broadcast_status?
def broadcast_status_update
broadcast_replace_to self,
target: "agent_status_#{id}",
partial: "agents/status_card",
locals: { agent: self }
end
def should_broadcast_status?
saved_change_to_state? ||
saved_change_to_last_seen_at? ||
saved_change_to_current_hash_rate?
end
# app/models/attack.rb
after_update_commit :broadcast_progress_update, if: :should_broadcast_progress?
def broadcast_progress_update
broadcast_replace_to campaign,
target: "attack_progress_#{id}",
partial: "campaigns/attack_progress",
locals: { attack: self }
rescue => e
Rails.logger.error("Failed to broadcast attack progress update: #{e.message}")
Rails.logger.error(e.backtrace.join("\n"))
end
def should_broadcast_progress?
saved_change_to_state? || tasks.any?(&:saved_change_to_state?)
end
Routes
Add routes for new controllers and actions.
# config/routes.rb
Rails.application.routes.draw do
# ... existing routes ...
# System Health
resource :system_health, only: [:index]
# Tasks
resources :tasks, only: [:show] do
member do
post :cancel
post :retry
post :reassign
get :logs
get :download_results
end
end
end
5. View Structure#
Agent List View
<!-- app/views/agents/index.html.erb -->
<div id="agents_list">
<%= turbo_stream_from "agents" %>
<div class="row">
<% @agents.each do |agent| %>
<div id="<%= dom_id(agent, :status) %>" class="col-md-4 mb-3">
<%= render AgentStatusCardComponent.new(agent: agent) %>
</div>
<% end %>
</div>
</div>
Agent Detail View with Tabs
<!-- app/views/agents/show.html.erb -->
<div data-controller="tabs">
<ul class="nav nav-tabs" role="tablist">
<li class="nav-item">
<a class="nav-link" data-tabs-target="tab" data-action="click->tabs#switch" href="#">Overview</a>
</li>
<li class="nav-item">
<a class="nav-link" data-tabs-target="tab" data-action="click->tabs#switch" href="#">Errors</a>
</li>
<li class="nav-item">
<a class="nav-link" data-tabs-target="tab" data-action="click->tabs#switch" href="#">Configuration</a>
</li>
<li class="nav-item">
<a class="nav-link" data-tabs-target="tab" data-action="click->tabs#switch" href="#">Capabilities</a>
</li>
</ul>
<div class="tab-content">
<div data-tabs-target="panel" class="tab-pane">
<div id="<%= dom_id(@agent, :overview_content) %>">
<%= render "agents/overview_tab", agent: @agent %>
</div>
</div>
<div data-tabs-target="panel" class="tab-pane d-none">
<div id="<%= dom_id(@agent, :errors_content) %>">
<%= render "agents/errors_tab", agent: @agent %>
</div>
</div>
<div data-tabs-target="panel" class="tab-pane d-none">
<%= render "agents/configuration_tab", agent: @agent %>
</div>
<div data-tabs-target="panel" class="tab-pane d-none">
<%= render "agents/capabilities_tab", agent: @agent %>
</div>
</div>
</div>
Turbo Stream Broadcast Targets:
Broadcasts target only the content inside each tab panel, preserving the tab structure and Stimulus controller state:
# Broadcast overview tab content
broadcast_replace_to agent,
target: dom_id(agent, :overview_content),
partial: "agents/overview_tab",
locals: { agent: agent }
# Broadcast errors tab content
broadcast_replace_to agent,
target: dom_id(agent, :errors_content),
partial: "agents/errors_tab",
locals: { agent: agent }
Rationale: By broadcasting only the content inside each tab panel (not the tab structure itself), we preserve the Stimulus controller state and avoid resetting the active tab when updates arrive.
Campaign Detail with Progress
<!-- app/views/campaigns/show.html.erb -->
<%= turbo_stream_from @campaign %>
<!-- Campaign ETA Summary -->
<div class="alert alert-info">
<% if @campaign.current_eta %>
<strong>Current Attack ETA:</strong> <%= distance_of_time_in_words_to_now(@campaign.current_eta) %>
<% end %>
<% if @campaign.total_eta %>
<br><strong>Total Campaign ETA:</strong> <%= distance_of_time_in_words_to_now(@campaign.total_eta) %>
<% end %>
</div>
<% @campaign.attacks.each do |attack| %>
<div id="<%= dom_id(attack, :progress) %>">
<%= render CampaignProgressComponent.new(attack: attack) %>
</div>
<% end %>
<!-- Recent Cracks Section -->
<div class="mt-4">
<button class="btn btn-outline-primary" data-bs-toggle="collapse" data-bs-target="#recent_cracks">
View Recent Cracks (<%= @campaign.hash_list.recent_cracks_count %>)
</button>
<div id="recent_cracks" class="collapse mt-3">
<table class="table table-sm">
<thead>
<tr>
<th>Hash</th>
<th>Plaintext</th>
<th>Cracked At</th>
</tr>
</thead>
<tbody>
<% @campaign.hash_list.recent_cracks.each do |item| %>
<tr>
<td><%= truncate(item.hash_value, length: 20) %></td>
<td><%= item.plain_text %></td>
<td><%= time_ago_in_words(item.cracked_time) %> ago</td>
</tr>
<% end %>
</tbody>
</table>
</div>
</div>
System Health Dashboard
<!-- app/views/system_health/index.html.erb -->
<div class="row">
<% @health_status.each do |service, status| %>
<div class="col-md-3 mb-3">
<%= render SystemHealthCardComponent.new(
service_name: service.to_s.titleize,
status: status[:status],
latency: status[:latency],
error: status[:error]
) %>
</div>
<% end %>
</div>
<div class="mt-4">
<h4>Diagnostic Links</h4>
<ul>
<li><%= link_to "Sidekiq Dashboard", sidekiq_web_path %></li>
<li><%= link_to "Rails Logs", "#" %> (view in terminal)</li>
</ul>
</div>
Summary#
This technical plan extends the existing CipherSwarm architecture with targeted improvements for operational excellence:
Key Principles:
- ✅ Build on existing patterns (Rails 8 + Hotwire + ViewComponent)
- ✅ No new frameworks or major dependencies
- ✅ Maintainable by solo part-time developer
- ✅ Air-gapped deployment compatible
- ✅ Pragmatic over perfect
Implementation Phases:
- Database migrations (indexes, cached columns)
- Model extensions (methods, callbacks, broadcasts)
- ViewComponents (reusable UI patterns)
- Stimulus controllers (tabs, toasts)
- Controller actions (system health, task management)
- Views and partials (integrate components)
- Logging configuration (Lograge setup)
- Authorization rules (CanCanCan abilities)
Testing Strategy:
- System tests for new flows (agent monitoring, campaign progress, task actions)
- Request specs for new controller actions
- Component tests for ViewComponents
- Model tests for new methods and callbacks
Deployment Considerations:
- Run migrations before deployment
- Warm caches on first request (acceptable)
- Monitor Turbo Stream broadcast performance
- Verify air-gapped asset compilation
Architecture Validation Results#
This section documents critical architectural decisions validated during the architecture review process.
Validated Decisions#
1. Cached Agent Metrics: HashcatStatus Callback Pattern
Issue: Original design had Agent callback updating metrics, but Agent doesn't update when HashcatStatus arrives, causing race conditions and stale data.
Resolution: Move callback to HashcatStatus model. When status is created, update Agent metrics directly.
Rationale: HashcatStatus is the source of truth for agent metrics. By updating Agent from HashcatStatus callback, we ensure metrics are updated whenever new status data arrives, avoiding race conditions.
Trade-offs:
- ✅ Eliminates race condition
- ✅ Metrics always fresh (within 30-second throttle)
- ✅ Simpler data flow (status → agent)
- ❌ Adds callback to HashcatStatus model
2. Task Retry: State Machine Event
Issue: Original design bypassed state machine with direct update(state: :pending), violating architecture pattern and skipping callbacks.
Resolution: Add retry event to Task state machine with proper failed → pending transition.
Rationale: State machines are the established pattern for lifecycle management in CipherSwarm. Adding a proper event ensures retry follows the same lifecycle management as other state transitions, triggers appropriate callbacks (logging, attack updates), and maintains system invariants.
Trade-offs:
- ✅ Maintains architectural consistency
- ✅ Triggers proper callbacks and logging
- ✅ Respects state machine guards
- ❌ Requires migration to add state machine event
3. Turbo Stream + Stimulus: Scoped Broadcasts
Issue: Turbo Stream broadcasts could replace Stimulus controller elements, resetting tab state and disrupting user interaction.
Resolution: Broadcast only tab content (inside panels), not tab structure. Wrap content in divs with unique IDs (dom_id(agent, :overview_content)).
Rationale: By broadcasting only the content inside each tab panel, we preserve the Stimulus controller state and avoid resetting the active tab when updates arrive.
Trade-offs:
- ✅ Preserves Stimulus controller state
- ✅ No tab reset on updates
- ✅ User interaction not disrupted
- ❌ Requires careful DOM structure
- ❌ More granular broadcast targets
4. System Health: Redis Lock for Cache Stampede Prevention
Issue: Multiple concurrent requests could overwhelm services with health checks when cache expires.
Resolution: Use Redis lock (SET NX EX) to ensure only one request runs health checks. Other requests wait briefly and return cached value or "checking" status.
Rationale: Redis lock is simple, robust, and prevents cache stampede without requiring background jobs or complex coordination.
Trade-offs:
- ✅ Prevents service overload
- ✅ Simple implementation (Redis SET NX)
- ✅ Graceful degradation (returns "checking" if locked)
- ❌ Slight delay for concurrent requests
- ❌ Requires Redis (already in stack)
5. Campaign ETA: Current + Total Display
Issue: Original design only showed running attack ETA, not accounting for pending attacks, giving incomplete picture of campaign completion time.
Resolution: Add two methods: current_eta (running attacks only) and total_eta (all incomplete attacks). Display both in UI.
Rationale: Showing both current and total ETA gives users a complete picture: "Current attack finishes in 2h, entire campaign finishes in 6h". This manages expectations better than showing only current attack ETA.
Trade-offs:
- ✅ Complete information for users
- ✅ Better expectation management
- ✅ Uses existing ETA calculation logic
- ❌ Total ETA is estimate (based on average)
- ❌ Two cache keys instead of one
6. Task Reassign: Agent Compatibility Validation
Issue: Original design allowed reassigning tasks to incompatible agents (wrong hash type, insufficient performance, no project access), causing task failures.
Resolution: Add agent_compatible_with_task? validation in controller. Only show compatible agents in UI (filter by hash type, performance, project access).
Rationale: Preventing invalid reassignments at the UI level (only showing compatible agents) is better UX than allowing selection and then rejecting. Backend validation provides defense-in-depth.
Trade-offs:
- ✅ Prevents invalid reassignments
- ✅ Better UX (only valid options shown)
- ✅ Defense-in-depth (UI + backend validation)
- ❌ More complex agent selection logic
- ❌ Requires querying agent capabilities
7. Turbo Stream Broadcast Errors: Rescue and Log
Issue: Broadcast failures (partial rendering errors, WebSocket issues) could cause silent failures or rollback transactions.
Resolution: Wrap all broadcasts in rescue blocks, log errors with full backtrace, continue execution.
Rationale: Turbo Stream broadcasts are best-effort updates. If a broadcast fails, the database update should still succeed. Logging errors ensures visibility for debugging without disrupting core functionality.
Trade-offs:
- ✅ Resilient to broadcast failures
- ✅ Database updates succeed even if UI update fails
- ✅ Errors logged for debugging
- ❌ UI might be stale until next update
- ❌ Silent failure from user perspective
Architecture Readiness#
The architecture has been validated against six dimensions:
- ✅ Simplicity: Extends existing patterns, no new frameworks, minimal new dependencies (Lograge only)
- ✅ Flexibility: Modular components, targeted broadcasts allow independent updates
- ✅ Robustness: Error handling for broadcasts, health checks, state machines; Redis lock prevents stampede
- ✅ Scaling: Caching strategy reduces load, indexes optimize queries, appropriate for small customer base
- ✅ Codebase Fit: Follows Rails 8 + Hotwire + ViewComponent patterns, respects state machines, consistent with existing code
- ✅ Requirements Coverage: All Core Flows addressed, Epic Brief acceptance criteria met
Critical Gaps Resolved:
- ✅ Cached metrics race condition fixed (HashcatStatus callback)
- ✅ Task retry state machine violation fixed (proper event added)
- ✅ Turbo Stream + Stimulus conflict resolved (scoped broadcasts)
- ✅ System health stampede prevented (Redis lock)
- ✅ Campaign ETA completeness improved (current + total)
- ✅ Task reassign validation added (agent compatibility)
- ✅ Broadcast error handling added (rescue and log)
Architecture Status: ✅ READY FOR IMPLEMENTATION