Tech Plan: Complete#
Procmond Implementation
Architectural Approach#
1. Core Architectural#
Decisions
Child Process Model
- procmond runs as a child process spawned by daemoneye-agent
- daemoneye-agent's CollectorProcessManager handles lifecycle
(start/stop/restart) - Configuration and broker socket path passed via environment
variables - Single service deployment model (operators manage daemoneye-agent
only)
Startup Coordination (Agent Loading State) - Broker starts before agent spawns collectors (eliminates race
condition) - Agent spawns all configured collectors with broker socket path via
environment variable - Collectors connect to broker and register via RPC
- Collectors report "ready" status after successful registration
- Agent waits for all collectors to report "ready" before dropping
privileges - Agent remains in "loading state" until all configured collectors are
ready - Agent reads collector configuration from config file (defines which
collectors to spawn) - Agent transitions to "steady state" and broadcasts "begin
monitoring" tocontrol.collector.lifecycle - All collectors subscribe to
control.collector.lifecycle
and start collection loops on receiving command - This ensures: (1) no race conditions, (2) agent drops privileges
only when safe, (3) coordinated startup
Event-Driven Architecture - Replace LocalEventBus with DaemoneyeEventBus for broker
communication - Use embedded broker pattern: daemoneye-agent runs DaemoneyeBroker,
procmond connects as client - Topic-based pub/sub for events:
events.process.*
hierarchy - RPC patterns for lifecycle management:
control.collector.procmond
Privilege Separation Model - daemoneye-agent: Starts privileged, drops
privileges after spawning collectors - procmond: Maintains full privileges throughout
runtime (restricted attack surface, no network) - Rationale: procmond needs persistent elevated access for process
enumeration; agent has network connectivity (larger attack surface) so
drops privileges after initialization
2. Integration Strategy#
Phase 1: Event Bus Integration (Foundation)
- Direct refactoring: Replace LocalEventBus with DaemoneyeEventBus in
ProcmondMonitorCollector - Implement connection management with retry logic (3 attempts at
startup, then exit) - Add event buffering (10MB limit) with replay on reconnection
- Validate connectivity before starting collection (strict
validation)
Phase 2: RPC Service Implementation (Lifecycle
Management) - Create RPC service handler in procmond to receive lifecycle
commands - Implement operations: Start, Stop, Restart, HealthCheck,
UpdateConfig, GracefulShutdown - Add registration/deregistration with daemoneye-agent on
startup/shutdown - Implement heartbeat publishing to
control.health.heartbeat.procmond
Phase 3: Testing (TDD Approach) - Unit tests for event bus integration (>80% coverage target)
- Integration tests for RPC communication
- Cross-platform tests (Linux, macOS, Windows)
- Chaos testing for resilience (connection failures,
backpressure)
Phase 4: Security Hardening - Implement privilege detection at startup (capabilities on Linux,
tokens on Windows) - Add data sanitization for command-line arguments and environment
variables - Validate security boundaries between procmond and agent
- Security test suite (privilege escalation, injection, DoS)
Phase 5: FreeBSD Support - Validate FallbackProcessCollector on FreeBSD 13+
- Document limitations (basic metadata only, no enhanced
features) - Add platform detection and capability reporting
- Best-effort support (doesn't block Epic completion)
Phase 6: Performance Validation - Benchmark process enumeration (target: 1,000 processes in
<100ms) - Load testing with 10,000+ processes
- Memory profiling (target: <100MB sustained)
- CPU monitoring (target: <5% sustained)
- Regression testing to prevent performance degradation
3. Key Trade-offs and#
Rationale
Trade-off 1: Direct Refactoring vs. Parallel
Implementation
- Decision: Direct refactoring (replace LocalEventBus
in place) - Rationale: Faster development velocity, simpler
codebase, LocalEventBus is internal-only (no external dependencies) - Risk Mitigation: Comprehensive testing before
merging, feature branch development
Trade-off 2: Event Buffering with Write-Ahead
Log - Decision: Write-ahead log (WAL) with 10MB buffer
and replay on reconnection - Rationale: Prevents data loss during crashes or
non-graceful termination, ensures event durability - Implementation: Events persisted to disk before
buffering, replayed on restart if procmond crashes - Risk Mitigation: Bounded buffer size, WAL rotation
to prevent disk exhaustion, backpressure when buffer full
Trade-off 3: Privilege Model - Decision: procmond maintains full privileges, agent
drops after spawning - Rationale: procmond needs persistent elevated
access; agent has larger attack surface (network connectivity) - Risk Mitigation: procmond has no network access,
minimal attack surface, runs as child process (isolated)
Trade-off 4: FreeBSD Support Level - Decision: Best-effort basic enumeration, documented
limitations - Rationale: FreeBSD is secondary platform, full
feature parity would delay primary platform completion - Risk Mitigation: Clear documentation of
limitations, graceful degradation
4. Technical Constraints#
Platform Constraints
- Must support Linux, macOS, Windows (primary), FreeBSD
(secondary) - Must respect platform security boundaries (SELinux, AppArmor, SIP,
UAC) - Must use platform-native APIs for process enumeration
Performance Constraints - CPU usage <5% sustained during continuous monitoring
- Memory usage <100MB during normal operation
- Process enumeration <100ms for 1,000 processes (average)
- Event publishing must handle backpressure gracefully
Security Constraints - No unsafe code (workspace-level
unsafe_code = "forbid") - All external inputs must be validated and sanitized
- Privilege boundaries must be enforced and tested
- Audit trail for all security-relevant operations
Compatibility Constraints - Must maintain backward compatibility with ProcessRecord data
model - Must integrate with existing collector-core framework
- Must use Rust 2024 edition with MSRV 1.91+
- Must follow workspace-level lints
(warnings = "deny")
5. Deployment Architecture#
`sequenceDiagram
participant Operator
participant Agent as daemoneye-agent
participant Broker as DaemoneyeBroker
(embedded)
participant Procmond as procmond
(child process)
participant OS as Operating System
Note over Operator,OS: System Startup
Operator->>Agent: Start daemoneye-agent (privileged)
Agent->>Broker: Initialize embedded broker
Broker-->>Agent: Broker ready (socket path)
Agent->>Procmond: Spawn procmond (privileged)<br/>ENV: DAEMONEYE_BROKER_SOCKET
Procmond->>Broker: Connect to broker
Broker-->>Procmond: Connection established
Procmond->>Broker: Register (RPC)<br/>Topic: control.collector.procmond
Broker->>Agent: Route registration
Agent->>Agent: Wait for all collectors ready
Agent-->>Broker: Registration accepted
Broker-->>Procmond: Registration response
Agent->>Agent: Drop privileges (after collectors ready)
Agent->>Broker: Send "begin monitoring" command
Broker->>Procmond: Route start command
Note over Procmond,OS: Continuous Monitoring
loop Every collection interval
Procmond->>OS: Enumerate processes (privileged)
OS-->>Procmond: Process list with metadata
Procmond->>Procmond: Lifecycle analysis
Procmond->>Broker: Publish events<br/>Topic: events.process.*
Broker->>Agent: Deliver events
Procmond->>Broker: Publish heartbeat<br/>Topic: control.health.heartbeat.procmond
end
Note over Operator,OS: Lifecycle Management
Operator->>Agent: Request health check (via CLI)
Agent->>Broker: Health check RPC<br/>Topic: control.collector.procmond
Broker->>Procmond: Route health check
Procmond-->>Broker: Health status
Broker-->>Agent: Health response
Agent-->>Operator: Display health status
Note over Operator,OS: Graceful Shutdown
Operator->>Agent: Stop daemoneye-agent
Agent->>Broker: Graceful shutdown RPC<br/>Topic: control.collector.procmond
Broker->>Procmond: Route shutdown
Procmond->>Procmond: Complete current cycle
Procmond->>Broker: Flush buffered events
Procmond->>Broker: Deregister
Procmond-->>Agent: Exit (success)
Agent->>Broker: Shutdown broker
Agent-->>Operator: Shutdown complete`
Data Model#
1. Existing Data#
Models (No Changes Required)
ProcessEvent (collector-core)
// Used for event bus communicationpubstruct ProcessEvent {pub pid:u32,pub ppid:Option<u32>,pub name:String,pub executable_path:Option<String>,pub command_line:Vec<String>,pub start_time:Option<SystemTime>,pub cpu_usage:Option<f64>,pub memory_usage:Option<u64>,pub executable_hash:Option<String>,pub user_id:Option<String>,pub accessible:bool,pub file_exists:bool,pub timestamp: SystemTime,pub platform_metadata:Option<serde_json::Value>,}
ProcessRecord (daemoneye-lib)
// Used for database storagepubstruct ProcessRecord {pub id: ProcessId,pub name:String,pub executable_path:Option<String>,pub command_line:Option<String>,pub parent_id:Option<ProcessId>,pub start_time:Option<DateTime<Utc>>,pub cpu_usage:Option<f64>,pub memory_usage:Option<u64>,pub status: ProcessStatus,pub user_id:Option<String>,pub executable_hash:Option<String>,// ... additional fields}
ProcessSnapshot (procmond)
// Used for lifecycle trackingpubstruct ProcessSnapshot {pub pid:u32,pub ppid:Option<u32>,pub name:String,pub executable_path:Option<String>,pub command_line:Vec<String>,pub start_time:Option<SystemTime>,pub cpu_usage:Option<f64>,pub memory_usage:Option<u64>,pub executable_hash:Option<String>,pub user_id:Option<String>,pub accessible:bool,pub file_exists:bool,pub snapshot_time: SystemTime,pub platform_metadata:Option<serde_json::Value>,}
Conversion Functions (Already Exist)
ProcessEvent↔︎ProcessSnapshot:
Bidirectional conversion viaFromtraitProcessRecord←ProcessEvent: One-way
conversion for database storage
2. New Configuration Models#
EventBusConfig (New)
// Configuration for event bus connectionpubstruct EventBusConfig {pub broker_socket_path:String,// From DAEMONEYE_BROKER_SOCKET env varpub connection_timeout: Duration,// Default: 10 secondspub event_buffer_size_bytes:usize,// Default: 10MBpub heartbeat_interval: Duration,// Default: 30 secondspub enable_event_buffering:bool,// Default: truepub wal_directory:PathBuf,// Write-ahead log directorypub wal_max_size_bytes:usize,// Default: 100MB (10x buffer)pub wal_rotation_threshold:f64,// Default: 0.8 (80% full)pub backpressure_buffer_threshold:f64,// Default: 0.7 (70% full triggers backpressure)pub backpressure_interval_multiplier:f64,// Default: 1.5 (increase interval by 50%)}
RpcServiceConfig (New)
// Configuration for RPC servicepubstruct RpcServiceConfig {pub collector_id:String,// Default: "procmond"pub collector_type:String,// Default: "process-monitor"pub registration_timeout: Duration,// Default: 10 secondspub health_check_timeout: Duration,// Default: 5 secondspub graceful_shutdown_timeout: Duration,// Default: 60 seconds}
ActorMessage (New)
// Messages sent to ProcmondMonitorCollector actorpubenum ActorMessage { HealthCheck { respond_to:oneshot::Sender<HealthCheckData>,}, UpdateConfig { config: ProcmondMonitorConfig, respond_to:oneshot::Sender<Result<()>>,}, GracefulShutdown { respond_to:oneshot::Sender<Result<()>>,}, BeginMonitoring,// From control.collector.lifecycle broadcast AdjustInterval { new_interval: Duration,// From EventBusConnector backpressure reason: BackpressureReason,// BufferFull, Reconnecting, etc.},}pubenum BackpressureReason { BufferFull { level_percent:f64}, Reconnecting, WalRotation,}
WriteAheadLogEntry (New)
// Entry in the write-ahead log (bincode serialization)pubstruct WriteAheadLogEntry {pub sequence:u64,// Monotonic sequence numberpub timestamp: SystemTime,// When event was writtenpub event: ProcessEvent,// The actual eventpub checksum:u32,// CRC32 for corruption detection}
WAL File Format:
- Binary format using bincode serialization for efficiency
- Sequence-numbered files:
procmond-{sequence:05}.wal
(e.g.,procmond-00001.wal) - Each file contains multiple WriteAheadLogEntry records
- Rotation at 80% of max size (80MB of 100MB default)
- Delete WAL file after all events successfully published to
broker - Corruption handling: Skip corrupted entries (CRC32 validation), log
warning, continue with next entry
3. Event Bus Message Schemas#
Registration Message
// Published to: control.collector.procmond (RPC)pubstruct RegistrationRequest {pub collector_id:String,// "procmond"pub collector_type:String,// "process-monitor"pub hostname:String,// System hostnamepub version:Option<String>,// procmond versionpub pid:Option<u32>,// procmond PIDpub capabilities:Vec<String>,// ["process"]pub attributes: HashMap<String,String>,// Platform-specific attributespub heartbeat_interval_ms:Option<u64>,// Requested heartbeat interval}
Heartbeat Message
// Published to: control.health.heartbeat.procmondpubstruct HeartbeatData {pub collector_id:String,// "procmond"pub timestamp: SystemTime,// Current timepub sequence:u64,// Monotonic sequence numberpub status: HealthStatus,// Healthy/Degraded/Unhealthy}
Process Event Message
// Published to: events.process.batch or events.process.lifecycle// Uses existing ProcessEvent struct (no changes needed)
4. Data Flow#
`flowchart TD
A[OS Process APIs] -->|Raw Process Data| B[ProcessCollector]
B -->|ProcessEvent| C[LifecycleTracker]
C -->|ProcessSnapshot| C
C -->|ProcessLifecycleEvent| D[ProcmondMonitorCollector
Actor]
D -->|ProcessEvent| E[EventBusConnector]
E -->|Persist| WAL[Write-Ahead Log
Disk]
E -->|Buffer| F[Event Buffer
10MB Memory]
F -->|Publish| G[DaemoneyeEventBus]
G -->|Topic: events.process.*| H[DaemoneyeBroker]
H -->|Deliver| I[daemoneye-agent]
I -->|ProcessRecord| J[Database]
K[RPC Commands] -->|control.collector.procmond| H
H -->|Route| L[RpcServiceHandler]
L -->|Actor Messages| D
D -->|Oneshot Responses| L
D -->|Heartbeat| M[RegistrationManager]
M -->|control.health.heartbeat.procmond| H
WAL -.->|Replay on Restart| E
F -.->|Backpressure 70%| D
style WAL fill:#ffa,stroke:#333,stroke-width:2px
style F fill:#f9f,stroke:#333,stroke-width:2px
style H fill:#bbf,stroke:#333,stroke-width:2px
style D fill:#afa,stroke:#333,stroke-width:2px`
Component Architecture#
1. New Components#
WriteAheadLog (New)
- Responsibility: Durable event persistence for crash
recovery - Location: procmond/src/wal.rs
- Key Functions:
- Persist events to disk using bincode serialization (append-only
log) - Use sequence-numbered files:
procmond-{sequence:05}.wal - Rotate log files when size reaches 80% of max (80MB of 100MB
default) - Replay events from WAL on startup (crash recovery)
- Delete WAL files after all events successfully published to
broker - Handle WAL corruption (skip corrupted entries with CRC32 validation,
log warning, continue) - Track which events have been published (mark for deletion)
EventBusConnector (New)
- Persist events to disk using bincode serialization (append-only
- Responsibility: Manage connection to
daemoneye-agent's embedded broker with durable event buffering - Location: procmond/src/event_bus_connector.rs
- Key Functions:
- Connect to broker via socket path from
DAEMONEYE_BROKER_SOCKETenv var - Integrate with WriteAheadLog for event persistence (write before
buffering) - Buffer events (10MB limit) when connection lost
- Replay buffered events (from WAL) on reconnection or restart
- Publish events to topic hierarchy
(events.process.*) - Dynamic backpressure: Monitor buffer level (70% threshold triggers
backpressure) - Send ActorMessage::AdjustInterval to MonitorCollector via shared
channel reference - Calculate new interval: current_interval * 1.5 (50% increase)
- Release backpressure when buffer drops below 50% (send
AdjustInterval with original interval)
RpcServiceHandler (New)
- Connect to broker via socket path from
- Responsibility: Handle incoming RPC requests and
coordinate with MonitorCollector via actor pattern - Location: procmond/src/rpc_service.rs
- Key Functions:
- Subscribe to
control.collector.procmondtopic (for RPC
requests) - Subscribe to
control.collector.lifecycletopic (for
"begin monitoring" broadcast) - Handle lifecycle operations: Start, Stop, Restart, HealthCheck,
UpdateConfig, GracefulShutdown - Send messages to MonitorCollector actor via bounded mpsc channel
(capacity: 100) - Wait for MonitorCollector responses via oneshot channels
- Return RPC responses with appropriate status codes
- Handle channel full errors (return RPC error if actor channel
full) - Serialize concurrent RPC requests (process one at a time)
RegistrationManager (New)
- Subscribe to
- Responsibility: Handle collector registration and
heartbeat publishing - Location: procmond/src/registration.rs
- Key Functions:
- Register with daemoneye-agent on startup via RPC
- Report "ready" status after successful registration
- Publish periodic heartbeats to
control.health.heartbeat.procmond(every 30 seconds) - Include health status in heartbeat (Healthy/Degraded/Unhealthy)
- Deregister on graceful shutdown
- Track registration state and heartbeat sequence number
ConfigurationManager (Enhanced)
- Responsibility: Manage configuration with
hot-reload support at cycle boundaries - Location: procmond/src/config.rs (enhance
existing) - Key Functions:
- Load configuration from environment variables and config files
- Validate configuration changes via RPC
- Apply configuration updates at next collection cycle boundary
(atomic) - Send configuration change message to MonitorCollector actor
- Document which configurations are hot-reloadable vs. require
restart
2. Modified Components#
ProcmondMonitorCollector (Modified)
- Changes:
- Replace
LocalEventBuswith
DaemoneyeEventBus(via EventBusConnector) - Implement actor pattern: Process messages from bounded mpsc channel
(capacity: 100) - Add configuration hot-reload at cycle boundaries (atomic
application) - Enhance health check to include event bus connectivity status
- Wait for "begin monitoring" broadcast on
control.collector.lifecyclebefore starting collection
loop - Respond to dynamic interval adjustments from EventBusConnector
backpressure - Provide shared channel reference to EventBusConnector for
backpressure signaling
- Replace
- Location:
file/src/monitor_collector.rs
main.rs (Modified) - Changes:
- Read
DAEMONEYE_BROKER_SOCKETenvironment variable - Initialize WriteAheadLog with configured directory
- Initialize EventBusConnector with WAL integration
- Create bounded mpsc channel (capacity: 100) for actor messages
- Initialize RpcServiceHandler with channel sender and topic
subscriptions - Initialize RegistrationManager for registration and heartbeat
- Pass channel sender to EventBusConnector for backpressure
signaling - Initialize ProcmondMonitorCollector as actor with channel
receiver - Add graceful shutdown coordination with RPC
- Read
- Location: file/src/main.rs
3. Component Interactions#
`sequenceDiagram
participant Main as main.rs
participant Config as ConfigurationManager
participant EventBus as EventBusConnector
participant Reg as RegistrationManager
participant RPC as RpcServiceHandler
participant Monitor as ProcmondMonitorCollector
participant Collector as ProcessCollector
participant Lifecycle as LifecycleTracker
Note over Main,Lifecycle: Startup Sequence
Main->>Config: Load configuration
Config-->>Main: EventBusConfig + RpcServiceConfig
Main->>EventBus: Connect to broker
EventBus-->>Main: Connection established
Main->>Reg: Register with agent
Reg->>EventBus: Publish registration (RPC)
EventBus-->>Reg: Registration accepted
Main->>Main: Wait for "begin monitoring" command
EventBus->>Main: Receive start command from agent
Main->>RPC: Start RPC service
RPC->>EventBus: Subscribe to control.collector.procmond
Main->>Monitor: Create collector
Monitor->>Collector: Initialize platform collector
Monitor->>Lifecycle: Initialize lifecycle tracker
Main->>Monitor: Start monitoring
Note over Main,Lifecycle: Runtime Operation
loop Every collection interval
Monitor->>Collector: Collect processes
Collector-->>Monitor: ProcessEvent list
Monitor->>Lifecycle: Update and detect changes
Lifecycle-->>Monitor: ProcessLifecycleEvent list
Monitor->>EventBus: Publish events
EventBus->>EventBus: Write to WAL, then buffer
EventBus->>EventBus: Check buffer level for backpressure
alt Buffer > 70% full
EventBus->>Monitor: Increase collection interval (backpressure)
end
end
loop Every heartbeat interval
Reg->>EventBus: Publish heartbeat
end
Note over Main,Lifecycle: RPC Request Handling
EventBus->>RPC: Incoming RPC request
RPC->>RPC: Parse request
alt HealthCheck
RPC->>Monitor: Send health check message (actor)
Monitor-->>RPC: Health data via oneshot
RPC->>EventBus: Publish response
else UpdateConfig
RPC->>Config: Validate config changes
Config->>Monitor: Send config update message (actor)
Note over Monitor: Config applied at next cycle boundary
Monitor-->>RPC: Update result via oneshot
RPC->>EventBus: Publish response
else GracefulShutdown
RPC->>Monitor: Send shutdown message (actor)
Monitor->>Monitor: Complete current cycle
Monitor->>EventBus: Flush buffered events + WAL
Monitor-->>RPC: Shutdown ready via oneshot
RPC->>EventBus: Publish response
RPC->>Reg: Deregister
RPC->>Main: Signal shutdown
end
Note over Main,Lifecycle: Graceful Shutdown
Main->>Monitor: Stop monitoring
Monitor->>Collector: Cleanup
Monitor->>Lifecycle: Cleanup
Main->>EventBus: Disconnect
Main->>Main: Exit`
4. Actor Pattern Coordination#
ProcmondMonitorCollector as Actor:
- Runs in its own task with message processing loop
- Receives messages via mpsc channel from RpcServiceHandler
- Processes messages sequentially (no concurrent state mutations)
- Responds via oneshot channels for request/response patterns
Message Types:
enum ActorMessage { HealthCheck { respond_to:oneshot::Sender<HealthCheckData>,}, UpdateConfig { config: Config, respond_to:oneshot::Sender<Result<()>>,}, GracefulShutdown { respond_to:oneshot::Sender<Result<()>>,}, BeginMonitoring,// From agent after loading state AdjustInterval { new_interval: Duration,},// From EventBusConnector backpressure}
Coordination Benefits: - Eliminates race conditions (single-threaded message processing)
- Simplifies state management (no complex locking)
- Clear request/response semantics via oneshot channels
- Serializes concurrent RPC requests automatically
Configuration Hot-Reload at Cycle Boundary: - Config update message queued in actor's message channel
- Actor processes message at start of next collection cycle
- Ensures atomic config application (no mid-cycle changes)
- Some configs may require restart (documented in
ConfigurationManager)
5. Integration Points#
With daemoneye-agent:
- BrokerManager: Spawns procmond as child process,
manages lifecycle - CollectorProcessManager: Monitors procmond process
health, handles restarts - CollectorRegistry: Tracks procmond registration and
heartbeat status - RPC Clients: Sends lifecycle commands to
procmond - Loading State Management:
- Agent initializes broker first (before spawning collectors)
- Agent spawns all configured collectors with
DAEMONEYE_BROKER_SOCKETenv var - Agent waits for all collectors to register and report "ready"
status - Agent drops privileges only after all collectors are ready
- Agent sends "begin monitoring" command to transition collectors to
steady state
- Heartbeat Monitoring: Agent detects missed
heartbeats (3+ consecutive) and takes escalating actions:- Send health check RPC (timeout: 5 seconds) - verify
responsiveness - Send graceful shutdown RPC (timeout: 60 seconds) - attempt clean
shutdown - Kill procmond process (force termination) - last resort
- Restart procmond via CollectorProcessManager - restore service
With daemoneye-eventbus:
- Send health check RPC (timeout: 5 seconds) - verify
- DaemoneyeBroker: Embedded broker that procmond
connects to - Topic Hierarchy:
events.process.*for
events,control.collector.procmondfor RPC - RPC Patterns: Request/response for lifecycle
management
With collector-core: - EventSource trait: ProcmondMonitorCollector
implements this interface - MonitorCollector trait: Provides statistics and
health check interface - ProcessEvent: Standard event format for process
data
AgentCollectorConfig (New)
# Agent configuration file: /etc/daemoneye/agent.yamlcollectors:-id: procmondtype: process-monitorbinary_path: /usr/bin/procmondenabled:trueauto_restart:truestartup_timeout_secs:60config:collection_interval_secs:30enhanced_metadata:truecompute_hashes:false
6. daemoneye-agent#
Enhancements Required
Collector Configuration Loading (New)
- Load collector configuration from
/etc/daemoneye/agent.yamlon startup - Parse collector list with binary paths, enabled status, and
auto-restart settings - Validate collector binary paths exist and are executable
- Spawn collectors in order defined in configuration file
- Pass collector-specific configuration via environment variables or
config files
Loading State Management (New) - Add state machine: Loading → Ready → Steady State
- Track collector readiness: Wait for all collectors to report
"ready" - Privilege dropping: Drop privileges only after all collectors
ready - Transition command: Broadcast "begin monitoring" to
control.collector.lifecyclewhen entering steady state - Timeout: If collectors don't report ready within timeout (60s
default), fail startup with error
Heartbeat Failure Detection (Enhanced) - Monitor heartbeat messages from all collectors
- Track missed heartbeat count per collector (threshold: 3
consecutive) - Implement escalating recovery actions:
- Health check RPC with 5-second timeout
- Graceful shutdown RPC with 60-second timeout
- Force kill via CollectorProcessManager
- Automatic restart via CollectorProcessManager (if auto_restart
enabled in config)
- Log all recovery actions for operator visibility
- Emit alerts for repeated collector failures (e.g., 3+ restarts in 10
minutes)
Configuration Push (Enhanced) - Validate configuration changes before pushing to collectors
- Send configuration updates via RPC to
control.collector.{collector_id} - Track which configurations require restart vs. hot-reload
- Handle configuration update failures (rollback or retry)
- Support configuration validation without applying (validate_only
mode)
7. Error Handling Strategy#
Connection Failures:
- Startup: Broker ready before spawn (no retry needed at startup)
- Runtime: Buffer events (10MB limit) with write-ahead log, attempt
reconnection, replay on success - If buffer full: Dynamic interval adjustment - connector increases
collection interval by 50% - WAL persistence: Events written to disk before buffering, replayed
on restart after crash - Reconnection: Exponential backoff (1s, 2s, 4s, 8s, max 30s) with
indefinite retries
Heartbeat Failures: - Agent detects missed heartbeats (threshold: 3 consecutive
misses) - Escalating recovery actions:
- Health check RPC (timeout: 5s) - verify procmond is responsive
- Graceful shutdown RPC (timeout: 60s) - attempt clean shutdown
- Force kill - terminate procmond process
- Restart - spawn new procmond instance
- Heartbeat independence: Heartbeat publishing runs in separate task
(not blocked by collection)
RPC Failures: - Invalid requests: Return error response with details
- Timeout: Return timeout error after configured duration
- State conflicts: Return error with current state information
- Concurrent requests: Serialize via actor pattern (process one at a
time) - Actor message failures: Return error if actor channel closed or
full
Collection Failures: - Permission denied: Log error, skip process, continue with
others - Platform API failure: Fall back to basic sysinfo collector
- Timeout: Cancel collection, report degraded health status
- Cycle boundary: Configuration changes applied only at cycle start
(atomic)
Resource Exhaustion: - Memory approaching limit: Reduce buffer size, disable enhanced
metadata, rotate WAL - CPU usage high: Increase collection interval, reduce metadata
collection - Event buffer full: Dynamic interval adjustment (increase by 50%),
WAL rotation - WAL disk space low: Rotate and compress old WAL files, alert
operator
8. Testing Strategy#
Unit Tests (>80% coverage target):
- WriteAheadLog: Persistence, rotation, replay, corruption recovery,
compression - EventBusConnector: Connection, WAL integration, buffering, replay,
dynamic backpressure - RpcServiceHandler: Request parsing, actor message sending, response
handling, concurrent request serialization - RegistrationManager: Registration, "ready" reporting, heartbeat,
deregistration - ConfigurationManager: Loading, validation, cycle-boundary
hot-reload, restart detection - Actor Pattern: Message processing, oneshot responses, channel
handling
Integration Tests: - Event bus communication: Publish/subscribe, reconnection,
buffering - RPC communication: Lifecycle operations, health checks, config
updates - Cross-platform: Linux, macOS, Windows process enumeration
- Lifecycle tracking: Start/stop/modification detection
Chaos Tests: - Connection failures: Broker restart, network interruption
- Backpressure: Slow consumer, high event volume
- Resource limits: Memory constraints, CPU throttling
- Concurrent operations: Multiple RPC requests, collection during
shutdown
Security Tests: - Privilege escalation: Attempt to gain unauthorized access
- Injection attacks: Malicious process names, command lines
- DoS attacks: Excessive RPC requests, event flooding
- Data sanitization: Verify secrets are not logged or published
Implementation Phases#
Phase 1: Event Bus#
Integration (Week 1-2)
Goal: Replace LocalEventBus with DaemoneyeEventBus
with durable buffering
Tasks:
- Create WriteAheadLog component for event persistence
- Create EventBusConnector with WAL integration and dynamic
backpressure - Implement event buffering (10MB limit) with WAL persistence
- Implement WAL replay on startup (crash recovery)
- Update ProcmondMonitorCollector to use EventBusConnector and actor
pattern - Add environment variable reading for broker socket path
- Implement startup coordination (wait for "begin monitoring"
command) - Unit tests for WriteAheadLog and EventBusConnector
- Integration tests for event publishing, WAL replay, and
backpressure
Success Criteria:
- procmond connects to daemoneye-agent's broker on startup
- Events published to
events.process.*topics - WAL persists events before buffering
- WAL replay works after crash (events not lost)
- Dynamic backpressure adjusts collection interval when buffer
fills - procmond waits for agent's "begin monitoring" command before
starting collection
Phase 2: RPC#
Service Implementation (Week 3-4)
Goal: Enable lifecycle management via RPC with actor
pattern coordination
Tasks:
procmond Changes:
- Implement actor pattern in ProcmondMonitorCollector (message
processing loop) - Create ActorMessage enum for actor communication
- Create RpcServiceHandler with actor message sending via mpsc
channel - Implement lifecycle operations: Start, Stop, Restart, HealthCheck,
UpdateConfig, GracefulShutdown - Implement configuration hot-reload at cycle boundaries
- Create RegistrationManager for registration, "ready" reporting, and
heartbeat - Implement "begin monitoring" command handling (wait before starting
collection) - Unit tests for RpcServiceHandler, actor coordination, and
RegistrationManager
daemoneye-agent Changes: - Add collector configuration file format
(/etc/daemoneye/agent.yaml) - Implement configuration loading and validation on agent startup
- Implement loading state management (Loading → Ready → Steady
State) - Add collector readiness tracking (wait for all collectors to report
"ready") - Implement privilege dropping after all collectors ready
- Add "begin monitoring" broadcast to
control.collector.lifecycletopic - Implement heartbeat failure detection with escalating actions:
- Track missed heartbeats per collector (threshold: 3 consecutive)
- Action 1: Health check RPC (timeout: 5s)
- Action 2: Graceful shutdown RPC (timeout: 60s)
- Action 3: Force kill via CollectorProcessManager
- Action 4: Automatic restart (if auto_restart enabled)
- Integration tests for RPC communication and loading state
coordination
Success Criteria:
- procmond registers with daemoneye-agent on startup
- procmond reports "ready" status after registration
- Agent waits for procmond "ready" before dropping privileges
- Agent sends "begin monitoring" command after all collectors
ready - procmond waits for "begin monitoring" before starting collection
loop - Heartbeats published every 30 seconds
- Agent detects missed heartbeats and takes escalating actions (health
check → graceful shutdown → kill → restart) - Health check RPC returns accurate status via actor pattern
- Graceful shutdown RPC completes within timeout
- Configuration update RPC applies changes at next cycle boundary
(atomic)
Phase 3: Testing (TDD#
Approach) (Week 5-6)
Goal: Achieve >80% unit coverage, >90%
critical path coverage
Tasks:
- Expand unit test coverage for all new components
- Create integration test suite for event bus and RPC
- Add cross-platform tests (Linux, macOS, Windows)
- Implement chaos tests for resilience
- Add security tests for privilege and injection
- Performance baseline tests
Success Criteria:
- Unit test coverage >80%
- Critical path coverage >90% (enumeration, event bus, RPC,
security) - All tests pass on Linux, macOS, Windows
- Chaos tests validate resilience to failures
Phase 4: Security Hardening#
(Week 7)
Goal: Implement privilege management and data
sanitization
Tasks:
- Add privilege detection at startup (capabilities, tokens)
- Implement data sanitization for command-line args and env vars
- Validate security boundaries between procmond and agent
- Add security test suite (privilege escalation, injection, DoS)
- Document security model and threat analysis
Success Criteria:
- Privilege detection works on all platforms
- Sensitive data sanitized before logging/publishing
- Security tests pass with no critical vulnerabilities
- Security documentation complete
Phase 5: FreeBSD Support (Week#
Goal: Validate basic process enumeration on
FreeBSD
Tasks:
- Test FallbackProcessCollector on FreeBSD 13+
- Document limitations (basic metadata only)
- Add platform detection and capability reporting
- Create FreeBSD-specific tests
- Update documentation with FreeBSD support status
Success Criteria:
- Basic process enumeration works on FreeBSD
- Limitations documented clearly
- Platform detection reports FreeBSD correctly
- Tests pass on FreeBSD 13+
Phase 6: Performance#
Validation (Week 9)
Goal: Validate performance against targets
Tasks:
- Benchmark process enumeration (1,000 processes target:
<100ms) - Load testing with 10,000+ processes
- Memory profiling (target: <100MB sustained)
- CPU monitoring (target: <5% sustained)
- Regression testing to prevent degradation
- Performance optimization if targets not met
Success Criteria:
- Enumerate 1,000 processes in <100ms (average)
- Support 10,000+ processes without degradation
- Memory usage <100MB during normal operation
- CPU usage <5% during continuous monitoring
- No performance regressions
References#
- Epic Brief:
spec:54226c8a-719a-479a-863b-9c91f43717a9/0fc3298b-37df-4722-a761-66a5a0da16b3 - Core Flows:
spec:54226c8a-719a-479a-863b-9c91f43717a9/f086f464-1e81-42e8-89f5-74a8638360d1 - Event Bus Architecture:
file/embedded-broker-architecture.md - Topic Hierarchy:
file/docs/topic-hierarchy.md - RPC Patterns: file/docs/rpc-patterns.md
- Process Collector: file/src/process_collector.rs
- Monitor Collector: file/src/monitor_collector.rs
- Lifecycle Tracker: file/src/lifecycle.rs
- Broker Manager: file/src/broker_manager.rs
- Collector Registry:
file/src/collector_registry.rs
Source note: Migrated from the public repo
(spec/procmond/specs/Tech_Plan__Complete_Procmond_Implementation.md)
on 2026-04-17. The repo copy has been removed.