Documents
health_status_screen
health_status_screen
Type
External
Status
Published
Created
Feb 27, 2026
Updated
Feb 27, 2026

CipherSwarm System Health UX Design#

Last updated: 2025-10-12


Table of Contents#


Purpose#

The System Health page serves as an operational dashboard, offering immediate insights into the status and performance of critical backend services. It aims to facilitate proactive monitoring and swift issue identification, ensuring the reliability and efficiency of CipherSwarm's infrastructure.

Layout Overview#

The page uses a responsive grid layout, with each service (Object Storage, Cache, Database) represented as a distinct status card. Each card provides real-time metrics and status indicators at a glance. The visual design follows a clean server status layout pattern, aligning with existing CipherSwarm styling.

Service Status Cards#

In addition to backend services, this page also displays the current operational state of all registered agents. Agents are considered system-level participants and should be included in the health view.

Object Storage#

  • Status Indicator: 🟢 Healthy, 🟡 Degraded, 🔴 Unreachable (color-coded badge)

  • Metrics:

    • Latency (API response time)
    • Errors (I/O and timeout count)
    • Storage Utilization (used vs total capacity)
  • Health Source: Storage service health endpoint

Cache Service#

Cache service health is monitored through direct service queries. Do not use shell commands for metrics collection.

MetricSourceNotes
Up/Down StateService pingFast and safe health check
LatencySimple command timinge.g., GET/SET roundtrip time
Memory UsageService infoReturns bytes in use
Active ConnectionsService infoUseful for system health card
Keyspace StatsService info (optional)Admin-only detailed breakdown

These values are readily available without external tooling and can be updated live via real-time polling or background jobs.

  • Status Indicator: Color-coded badge

  • Metrics:

    • Command Latency
    • Memory Usage
    • Active Connections
  • Health Source: Cache service info commands

Database#

Database health is monitored through pooled connections and system queries. No shell commands should be used.

MetricQuery or MethodNotes
Up/Down StateSimple SELECT 1 with timeoutIndicates connection availability
Query LatencyTime an example query executionConsider averaging or sampling over interval
Connection PoolPool inspection toolsOr count active connections
Long-running QueriesSystem activity viewAdmin-only display
Replication LagReplication status viewShown only if replication is configured
Background WorkersBackground worker statsAdmin-only, useful for diagnosing slowdowns

Data should be cached briefly (5–15s) if queried frequently. Limit admin-only data to a collapsed view by default.

  • Status Indicator: Color-coded badge

  • Metrics:

    • Query Latency
    • Connection Pool Usage
    • Replication Lag (if applicable)
  • Health Source: Database system views

Agents#

  • Status Indicator: Color-coded badge (🟢 Online, 🔴 Offline)

  • Metrics:

    • Last seen timestamp
    • Current assigned task (if any)
    • Guess rate (if available)
  • Grouping: Display as a collapsible section or in its own row below Cache/Database services

Design Considerations#

Empty/Error State UX#

  • If a service cannot be reached, show a red badge and a concise message (e.g., "Cache service unreachable — last seen 2m ago").

  • Use skeleton loaders during initial load.

  • If no data is available for a metric, show a subtle placeholder (e.g., “N/A” or gray indicator) with tooltip explaining why.

  • Align with style-guide.md (dark mode, purple accent, etc.)

  • Use standard component patterns

  • Emphasize clarity and minimalism — show only relevant metrics

  • Enable hover tooltips for metric definitions if space is tight

Real-Time Behavior#

Update Strategy#

  • Live metrics should update every 5–10 seconds via real-time streaming from the JSON API.

  • Expensive queries (like object counts or database stats) should be cached for 30–60 seconds server-side.

  • Consider staggered refresh intervals or jitter to avoid burst load after page load.

  • Real-time streaming updates for all metrics

  • Stale data indication if no update received in 30s

  • Optional retry/backoff on failure, with error banners if a system is unreachable

Access Control#

Admin Access Enhancements#

Users with administrative privileges may see additional diagnostic data on this page, including:

  • Object Storage:

    • Bucket/container count and object totals
    • Disk I/O metrics
  • Cache Service:

    • Keyspace breakdown (e.g., # keys by TTL)
    • Eviction stats
  • Database:

    • Long-running queries
    • Background worker stats
    • Transaction log volume

This data is hidden for standard users to reduce clutter and limit sensitive system-level insight.

  • Visible to all authenticated users
  • Admins see more detailed metrics, logs, or advanced diagnostics

Data Collection Strategy#

⚠️ Implementation Note: All system metrics should be gathered from in-process code using libraries or internal APIs.
Do not shell out to external binaries or system commands to collect data in production.

ServiceAccess Notes
Object StorageUse HTTP client to query health endpoints or admin API with access keys
Cache ServiceUse service client library to query info/stats
DatabaseUse ORM or database driver to run system queries
AgentsQuery internal ORM models and use performance timeseries methods for metrics
  • Object Storage: Built-in health endpoints or basic probes via client library
  • Cache Service: INFO/STATS commands or metrics endpoint
  • Database: System activity views, replication status views, etc.

Observability Notes#

CipherSwarm prioritizes lightweight, embedded observability over heavy external integration. This health dashboard reflects that intent:

  • Metrics should be pulled directly from local service APIs or shallow internal probes.

  • Do not require Prometheus, OpenTelemetry, or external collectors to render this page.

  • However, hooks should be designed with extensibility in mind:

    • A shared metrics module or state management layer can abstract the source
    • If Prometheus or OpenTelemetry is adopted later, it should be easy to drop in as a provider

This keeps the UX fast, testable, and offline-compatible — aligning with CipherSwarm's goals.

Object Storage via Client Library#

The storage client library can be used for basic health checks and metadata without shelling out:

MetricMethodNotes
Up/Down StateList buckets/containersFast + safe connectivity check
Bucket CountCount buckets/containersLow cost
Object Count (optional)List and count objectsExpensive; use caching if needed
Storage Usage (optional)Sum object sizes across bucketsAlso expensive; cache if shown

Recommended approach: show only bucket count and service reachability by default. Larger metrics should be backgrounded or admin-only.

Implementation Notes#

  • Status cards should be uniform height and width
  • Use icons and badge color to reinforce state
  • Consider sparkline or mini-chart if historical data is available