CipherSwarm System Health UX Design#

Last updated: 2025-10-12

Table of Contents#

CipherSwarm System Health UX Design

Purpose#

The System Health page serves as an operational dashboard, offering immediate insights into the status and performance of critical backend services. It aims to facilitate proactive monitoring and swift issue identification, ensuring the reliability and efficiency of CipherSwarm's infrastructure.

Layout Overview#

The page uses a responsive grid layout, with each service (Object Storage, Cache, Database) represented as a distinct status card. Each card provides real-time metrics and status indicators at a glance. The visual design follows a clean server status layout pattern, aligning with existing CipherSwarm styling.

Service Status Cards#

In addition to backend services, this page also displays the current operational state of all registered agents. Agents are considered system-level participants and should be included in the health view.

Object Storage#

Status Indicator: 🟢 Healthy, 🟡 Degraded, 🔴 Unreachable (color-coded badge)
Metrics:
- Latency (API response time)
- Errors (I/O and timeout count)
- Storage Utilization (used vs total capacity)
Health Source: Storage service health endpoint

Cache Service#

Cache service health is monitored through direct service queries. Do not use shell commands for metrics collection.

Metric	Source	Notes
Up/Down State	Service ping	Fast and safe health check
Latency	Simple command timing	e.g., GET/SET roundtrip time
Memory Usage	Service info	Returns bytes in use
Active Connections	Service info	Useful for system health card
Keyspace Stats	Service info (optional)	Admin-only detailed breakdown

These values are readily available without external tooling and can be updated live via real-time polling or background jobs.

Status Indicator: Color-coded badge
Metrics:
- Command Latency
- Memory Usage
- Active Connections
Health Source: Cache service info commands

Database#

Database health is monitored through pooled connections and system queries. No shell commands should be used.

Metric	Query or Method	Notes
Up/Down State	Simple `SELECT 1` with timeout	Indicates connection availability
Query Latency	Time an example query execution	Consider averaging or sampling over interval
Connection Pool	Pool inspection tools	Or count active connections
Long-running Queries	System activity view	Admin-only display
Replication Lag	Replication status view	Shown only if replication is configured
Background Workers	Background worker stats	Admin-only, useful for diagnosing slowdowns

Data should be cached briefly (5–15s) if queried frequently. Limit admin-only data to a collapsed view by default.

Status Indicator: Color-coded badge
Metrics:
- Query Latency
- Connection Pool Usage
- Replication Lag (if applicable)
Health Source: Database system views

Agents#

Status Indicator: Color-coded badge (🟢 Online, 🔴 Offline)
Metrics:
- Last seen timestamp
- Current assigned task (if any)
- Guess rate (if available)
Grouping: Display as a collapsible section or in its own row below Cache/Database services

Design Considerations#

Empty/Error State UX#

If a service cannot be reached, show a red badge and a concise message (e.g., "Cache service unreachable — last seen 2m ago").
Use skeleton loaders during initial load.
If no data is available for a metric, show a subtle placeholder (e.g., “N/A” or gray indicator) with tooltip explaining why.
Align with style-guide.md (dark mode, purple accent, etc.)
Use standard component patterns
Emphasize clarity and minimalism — show only relevant metrics
Enable hover tooltips for metric definitions if space is tight

Real-Time Behavior#

Update Strategy#

Live metrics should update every 5–10 seconds via real-time streaming from the JSON API.
Expensive queries (like object counts or database stats) should be cached for 30–60 seconds server-side.
Consider staggered refresh intervals or jitter to avoid burst load after page load.
Real-time streaming updates for all metrics
Stale data indication if no update received in 30s
Optional retry/backoff on failure, with error banners if a system is unreachable

Access Control#

Admin Access Enhancements#

Users with administrative privileges may see additional diagnostic data on this page, including:

Object Storage:
- Bucket/container count and object totals
- Disk I/O metrics
Cache Service:
- Keyspace breakdown (e.g., # keys by TTL)
- Eviction stats
Database:
- Long-running queries
- Background worker stats
- Transaction log volume

This data is hidden for standard users to reduce clutter and limit sensitive system-level insight.

Visible to all authenticated users
Admins see more detailed metrics, logs, or advanced diagnostics

Data Collection Strategy#

⚠️ Implementation Note: All system metrics should be gathered from in-process code using libraries or internal APIs.
Do not shell out to external binaries or system commands to collect data in production.

Service	Access Notes
Object Storage	Use HTTP client to query health endpoints or admin API with access keys
Cache Service	Use service client library to query info/stats
Database	Use ORM or database driver to run system queries
Agents	Query internal ORM models and use performance timeseries methods for metrics

Object Storage: Built-in health endpoints or basic probes via client library
Cache Service: INFO/STATS commands or metrics endpoint
Database: System activity views, replication status views, etc.

Observability Notes#

CipherSwarm prioritizes lightweight, embedded observability over heavy external integration. This health dashboard reflects that intent:

Metrics should be pulled directly from local service APIs or shallow internal probes.
Do not require Prometheus, OpenTelemetry, or external collectors to render this page.
However, hooks should be designed with extensibility in mind:
- A shared metrics module or state management layer can abstract the source
- If Prometheus or OpenTelemetry is adopted later, it should be easy to drop in as a provider

This keeps the UX fast, testable, and offline-compatible — aligning with CipherSwarm's goals.

Object Storage via Client Library#

The storage client library can be used for basic health checks and metadata without shelling out:

Metric	Method	Notes
Up/Down State	List buckets/containers	Fast + safe connectivity check
Bucket Count	Count buckets/containers	Low cost
Object Count (optional)	List and count objects	Expensive; use caching if needed
Storage Usage (optional)	Sum object sizes across buckets	Also expensive; cache if shown

Recommended approach: show only bucket count and service reachability by default. Larger metrics should be backgrounded or admin-only.

Implementation Notes#

Status cards should be uniform height and width
Use icons and badge color to reinforce state
Consider sparkline or mini-chart if historical data is available