SQL-to-IPC Detection Architecture

AI summary
This document, "DaemonEye Spec -- SQL-to-IPC Detection Architecture," defines the model for translating SQL detection rules into instructions for collectors to produce security alerts.
Key Concepts:

Purpose: To standardize the SQL to IPC to storage to Alert path, allowing collectors (e.g., procmond) and the daemoneye-agent to evolve independently.
Mental Model (Virtual Database): Collectors stream event data as "records" via IPC into virtual schemas (logical, read-only tables). The agent exposes a global catalog of these tables.
Detection Flow: The Agent loads SQL rules, translates them into DetectionTasks for Collectors with pushdown optimization, stores events, and generates alerts.
Specialty Collectors: YARA integration, supplemental rule data for eBPF/ETW/EndpointSecurity.
Dynamic Reactive Pipeline: Non-DAG reactive system where SQL JOINs trigger cascading analysis.
Performance Bounds for Regex: Strict bounds including 10ms per-pattern latency, 100ms compilation timeout, 1MB memory limit per pattern.

1) Purpose & Non-Goals#

Purpose. Define a precise model for how SQL rules are translated into collector instructions and evaluated to produce alerts. This spec standardizes the SQL to IPC to storage to Alert path so collectors and the agent evolve independently without breaking rules.

Non-Goals#

Building a general RDBMS. The agent executes a constrained, read-only SQL dialect for detections.
Requiring kernel hooks. Baseline collectors may be user-mode (e.g., sysinfo), with advanced sources added by tier (eBPF/ETW/EndpointSecurity).

2) Mental Model (Virtual Database)#

Each collector contributes a virtual schema---logical, read-only tables backed by event streams. The agent exposes a global catalog (namespaced) used by rules. Examples:

processes.* (e.g., processes.snapshots, processes.events.exec)
network.* (e.g., network.connections)
kernel.* (e.g., kernel.image_loads)
Collectors "insert" records into these tables by streaming IPC messages to the agent. The agent persists events in a lightweight embedded store and runs validated SQL rules over the persisted store only when the entire rule cannot be pushed down to the collector.
Audit Trail: The collector is responsible for logging matches in the write-only audit log using the collector-core framework, independent of the agent's further analysis.

Key Architectural Principles#

Logical Views: SQL rules are treated as logical views, not materialized tables
Pushdown Optimization: Simple predicates and projections are pushed to collectors
Operator Pipeline: Complex operations (joins, aggregations) execute in the agent's operator pipeline
Bounded Memory: All operations use cardinality caps and time windows to prevent unbounded growth

3) End-to-End Flow#

The end-to-end flow operates on a host where a Collector (procmond or others) communicates via IPC (protobuf + framing) to the daemoneye-agent. The agent persists data and evaluates rules using the Operator Pipeline over redb, then sends alerts to Alert Sinks (stdout/syslog/webhook). Rule Packs / SQL Rules feed into the agent.

Two Layers#

Stream Layer (IPC) --- Low-latency filtering; collectors push only the needed records.
Operator Pipeline Layer --- SQL parsed into logical plans, mapped onto scan/query operators over redb. This enables joins, grouping, time windows, and dedupe.

4) Query Lifecycle & Pushdown#

The query lifecycle proceeds as follows:

Operator/Rule Pack loads/enables SQL rules
Agent parses SQL to AST, validates dialect and safety
Agent plans query against virtual catalog
Agent derives DetectionTasks with pushdown for Collector
Collector streams matching records via IPC
Agent appends events to tables and executes operator pipeline
Agent generates alerts with dedupe and metadata

Pushdown Rules#

Agent derives from the AST:

Tables/aliases - which collectors to engage
Columns - projection list for each collector
Predicates - pre-filter on collector side when feasible
Windows/Aggregations/Joins - kept for the agent's operator pipeline
If a collector cannot pushdown a predicate, it sends its minimal projection and the agent filters on ingest.

4.1) Collector Schema Contract (Virtual Catalog)#

Purpose#

Standardize how collectors announce their tables, columns, and pushdown capabilities so the agent can plan rules deterministically.

Descriptor (sent at startup & on change)#

{
  "collector_id": "<name>@<platform-arch>",
  "version": "semver",
  "tables": [
    {
      "name": "<domain>.<table>",
      "columns": { "col": "type[?]", "collection_time": "datetime" },
      "keys": {
        "primary": ["collection_time", "seq"],
        "secondary": ["pid", "ppid", "name"]
      },
      "join_hints": [
        { "left": "ppid", "right_table": "processes.snapshots", "right": "pid", "relation": "parent_of" }
      ],
      "pushdown": {
        "predicates": ["=", "!=", ">", "<", ">=", "<=", "IN", "LIKE", "REGEXP"],
        "project": true,
        "order_by": ["collection_time"],
        "limits": { "max_rate": 200000 }
      }
    }
  ]
}

Agent Behavior#

Validates schema, registers/updates global catalog
Unknown optional columns are ignored; missing required columns reject the table
Capabilities become the upper bound for pushdown during planning

Evolution#

Additive columns allowed anytime
Breaking changes require version bump and a deprecation grace period

Health & Metrics#

Collectors MUST expose: send rate, drops, queue depth, last ACK seq_no, and schema version.

4.2) Minimum Contracts by Collector Type#

Processes Collector (procmond)#

Table: processes.snapshots
Required columns: pid, ppid?, name, executable_path?, command_line?, start_time?, executable_hash?, collection_time
Secondary keys (min): pid, ppid, name, executable_hash
Join hints: ppid -> processes.snapshots.pid (parent_of)
Pushdown (min): =, !=, IN, LIKE (including % wildcards), REGEXP, projection; optional > >= < <= on numeric/time

Network Collector#

Table: network.connections
Required columns: pid?, proto(tcp,udp), src_ip, src_port, dst_ip, dst_port, state?, bytes_sent?, bytes_recv?, collection_time

Image/Module Load Collector#

Table: kernel.image_loads
Required columns: pid?, image_path, image_hash?, collection_time

File System Collector (optional baseline)#

Table: fs.events
Required columns: pid?, op(create,write,read,delete,rename), path, size?, collection_time

Auth/Security Events Collector#

Table: security.auth
Required columns: user, host?, event(logon,logoff,lock,unlock,auth_fail), status(success,fail)?, pid?, collection_time

General Requirements (All Collectors)#

Clock: timestamps MUST be monotonic per host; include clock_skew_ms in heartbeat
Sequence: seq_no strictly increasing per task_id
Types: adhere to documented scalar types; ? denotes optional
Backpressure: respect agent credits; cap local buffers; report drops with counters

Regex Implementation Requirements#

All collectors MUST implement regex pattern matching with:

Standard regex engine (e.g., Rust regex crate, RE2, PCRE)
Case-sensitive and case-insensitive matching via (?i) flag
Configurable per-pattern latency bounds
Bounded compilation and execution to prevent DoS
Compilation caching with LRU eviction

Configurable Performance Bounds#

Default Configuration:

Per-Pattern Latency: 10ms
Compilation Timeout: 100ms
Memory Limit: 1MB per pattern

detection:
  regex:
    max_pattern_latency_ms: 10
    compilation_timeout_ms: 100
    max_memory_per_pattern: 1048576
    enable_precompilation: true
    cache_size: 1000
    monitoring:
      enable_metrics: true
      alert_threshold_ms: 5

Pattern Complexity Guidelines#

Acceptable:

Simple patterns: basic literals, character classes, quantifiers
Moderate patterns: alternation with limited branches
Complex patterns: backreferences, lookahead/lookbehind (with execution limits)
Avoid: Exponential complexity patterns, catastrophic backtracking, excessive alternation, multiple lookaheads.

4.7) Complex Pattern Matching (YARA Integration)#

YARA Integration Architecture#

YARA rules represent a fundamentally different detection paradigm from SQL-based key-value predicates. The architecture handles YARA through the "agent-only operations" principle with specialized collectors.

YARA Engine Collector#

{
  "collector_id": "yara-engine@linux-x86_64",
  "tables": [{
    "name": "yara.scan_results",
    "columns": {
      "file_path": "string", "rule_name": "string",
      "matches": "json", "scan_time": "datetime", "file_hash": "hex256"
    },
    "pushdown": { "predicates": ["=", "!=", "IN"], "project": true }
  }]
}

Hybrid Detection Patterns#

-- Complex rule combining simple filters + YARA results
SELECT p.pid, p.name, f.path, y.rule_name, y.matches
FROM processes.snapshots p
JOIN fs.events f ON f.pid = p.pid
JOIN yara.scan_results y ON y.file_path = f.path
WHERE p.name = 'rundll32.exe'
  AND f.path LIKE '/tmp/%'
  AND y.rule_name = 'suspicious_behavior'
  AND f.collection_time > datetime('now', '-30 minutes');

Performance Considerations#

YARA scanning happens per-file in the collector
Only scan results (not file content) stream to agent
Agent stores lightweight YARA results in redb
Collector can implement sampling for high-volume files

4.8) Supplemental Rule Data for Specialty Collectors#

Specialty collectors receive supplemental rule data alongside standard SQL pushdown. This includes YARA rules, eBPF programs, network analysis filters, and platform-specific rule data.

pub enum SupplementalRuleData {
    YaraRules { version: String, rules: Vec<YaraRule>, scan_options: YaraScanOptions },
    NetworkAnalysis { version: String, analysis_type: NetworkAnalysisType, filters: Vec<NetworkFilter>, patterns: Vec<NetworkPattern>, config: NetworkAnalysisConfig },
    PlatformSpecific { platform: String, rule_type: String, data: serde_json::Value },
}

pub trait SpecialtyRuleEngine {
    type RuleData;
    type ExecutionResult;
    async fn load_rules(&mut self, rules: Self::RuleData) -> Result<(), EngineError>;
    async fn execute_rules(&self, data: &[u8]) -> Result<Vec<Self::ExecutionResult>, EngineError>;
    async fn cleanup(&mut self) -> Result<(), EngineError>;
}

4.9) Dynamic Reactive Pipeline (Non-DAG Architecture)#

The SQL-to-IPC detection engine operates as a reactive pipeline where:

Initial Triggers - Base collectors detect events
JOIN-Driven Collection - SQL JOINs trigger collection of the second half of the join
Cascading Analysis - Results from one collector trigger additional analysis
Dynamic JOINs - Agent orchestrates cross-collector correlation
Feedback Loops - Results can trigger deeper analysis of the same data

Agent Orchestration Logic#

pub struct ReactiveOrchestrator {
    collectors: HashMap<String, Box<dyn Collector>>,
    trigger_rules: Vec<TriggerRule>,
    analysis_queue: VecDeque<AnalysisTask>,
}

pub enum AnalysisType {
    PeAnalysis { file_path: String },
    YaraScanning { file_path: String, rules: Vec<String> },
    MemoryAnalysis { pid: u32, regions: Vec<MemoryRegion> },
    NetworkDeepPacket { connection_id: String },
}

4.10) Automatic JOIN Triggers (Implicit Correlation)#

SQL rules can specify automatic JOIN triggers using special syntax:

-- Automatic network socket collection for processes
SELECT p.pid, p.name, n.dst_ip, n.dst_port, n.protocol
FROM processes.snapshots p
AUTO JOIN network.connections n ON n.pid = p.pid
WHERE p.name = 'suspicious.exe'
  AND n.dst_port > 1024;

-- Conditional Auto-JOIN with WHEN
SELECT p.pid, p.name, n.dst_ip, n.dst_port
FROM processes.snapshots p
AUTO JOIN network.connections n ON n.pid = p.pid
  WHEN p.name LIKE '%suspicious%' OR p.cpu_usage > 80.0
WHERE p.name = 'suspicious.exe';

The sqlparser crate is extended with a custom DaemonEye dialect that maintains SQLite compatibility while adding AUTO JOIN and WHEN clause support.

9) Storage & Execution Model#

Why Not a Full RDBMS?#

Embedded SQL engines like SQLite are heavyweight and don't align with the zero-network design. redb is a key-value store, not a relational engine.

Chosen Approach: Operator Pipeline#

Parse SQL using sqlparser (SQLite dialect)
Translate AST to internal logical plan to chain of operators (scan, filter, project, join, aggregate)
Execute operators directly against redb
Support streaming ingestion with append-only tables + time-indexed keys

Smart Joins (No Mandatory Time Window)#

Joins are supported without requiring time windows, using bounded state:
Physical Strategies:

Index Nested-Loop (INLJ) - Default, selective. Build from smaller/filtered input, probe via secondary index. O(n log m).
Bounded Symmetric Hash Join (SHJ) - For concurrent streams. Hash both sides in bounded LRU maps. Cardinality budget (e.g., max 100k keys).
Materialized Relation Cache (MRC) - Parent/child fast path. Compact map (pid -> {ppid, parent_name, start_time}). Single keyed lookup.
Bounding Without Time Windows:

Cardinality caps per-join with configurable memory budgets
Automatic fallback from SHJ to INLJ when caps reached
Optional spill to KV-backed scratch area

Write-Through & Persistence Semantics#

Ingest Transaction: On every IPC StreamRecord: validate/normalize, append to redb base table + update secondary indexes, ACK to collector after commit.
No-Join Path: Evaluate predicate on ingested row, compute dedupe key, emit alert immediately.
Join Path: Execute join strategy over redb + in-memory state, dedupe and emit alert.
Aggregation Path: Maintain in-memory hash aggregates, flush snapshots for recovery, emit alert on HAVING satisfaction.
Crash Safety: Ingest commit is atomic; aggregation state restored from snapshots; MRC rebuilt from recent events.

Materialization Policy#

Core Principle: Treat every rule as a logical view. Store the parts (base tables + minimal derived state), not the fully realized join results.
What We Persist:

Base event tables per collector plus secondary indexes
Alert log with rule ID, dedupe key, pointers to source rows
Operator state (aggregation snapshots, MRC)
What We Do Not Persist: Materialized join outputs (computed on-demand, used for alerts, then discarded).

redb Performance Playbook#

Physical Layout:

Partitions: One base table per source per time bucket (daily/hourly)
Primary Key: (ts_ms: u64, seq: u32) - 16 bytes, strictly increasing
Secondary Indexes: idx, idx, idx (hash), idx
Query Planning: Time first (resolve partitions), driving index by selectivity, set-based intersections of posting lists.
Writer Architecture: Single writer thread with group commit (N=2,000 records or T=5-10ms). Pipeline: IPC -> lock-free MPSC -> ring buffer -> writer. MVCC snapshots for readers.
Key Encoding:

#[repr(C, packed)]
struct RowKey { ts_ms: u64, seq: u32, pad: u32 }

#[repr(C, packed)]
struct NameIdxKey { name_hash128_hi: u64, name_hash128_lo: u64, ts_ms: u64, seq: u32 }

Tunables (Sane Defaults):

partition_kind = hourly if > ~2M events/day; else daily
writer_batch_records = 2,000, writer_batch_ms = 7
posting_cache_bytes = 128MiB
mrc_window = 30m
join_cardinality_cap = 100k keys, rows_per_key_cap = 64

Rule Lifecycle & Management#

Rule Packs: Collections of related SQL rules with metadata
Hot Reloading: Rules can be added/removed/updated without agent restart
Rule States: Draft -> Active -> Disabled -> Archived
Versioning: Rule changes create new versions with deprecation periods

Error Handling & Recovery#

Connection Loss: Agent marks tables as unavailable
Schema Mismatch: Version incompatibility handling
Rate Limiting: Backpressure triggers
Automatic Reconnection: Exponential backoff with jitter

Monitoring & Observability#

Rule execution latency, match rates, error counts
IPC message rates, queue depths
redb write latency, index maintenance, compaction stats
Health, metrics, debug, and catalog endpoints

11) IPC Message Contracts#

DetectionTask (Agent -> Collector)#

{
  "task_id": "t_9482",
  "rule_id": "r_process_name_exact",
  "table": "processes.snapshots",
  "project": ["pid", "name", "executable_path", "cpu_usage", "collection_time"],
  "where": { "and": [{ "eq": ["name", "minidump.exe"] }, { "gt": ["cpu_usage", 0.0] }] },
  "rate_limit": { "per_sec": 25000 },
  "ttl_ms": 600000
}

StreamRecord (Collector -> Agent)#

Envelope: { seq_no, task_id, table, checksum, record }. Record is a typed row matching the advertised schema.

Persistence & Query (Agent-Side)#

Write path: append-only into redb tables with periodic checkpoints
Read path: prepared, read-only statements with timeouts and memory quotas
Dedupe: rule-scoped dedupe key to avoid alert storms

12) SQL Dialect (Constrained)#

We do not embed SQLite. We only use the SQLite dialect for rule authoring and parsing. The agent parses rules with sqlparser, pushes down projections/predicates to collectors, and executes remaining logic against the operator pipeline over redb.

Allowed Statements: SELECT only
Banned Functions: load_extension, readfile, system, random, printf
Allowed Functions: Basic aggregations, string ops, date/time helpers
Security: AST validation enforces constraints before execution

13) Examples#

-- Simple Name Match (pushdown heavy)
SELECT pid, name, executable_path, collection_time
FROM processes.snapshots WHERE name = 'minidump.exe';

-- Regex Pattern Matching
SELECT pid, name, executable_path, collection_time
FROM processes.snapshots WHERE name REGEXP '(?i)malware|trojan|virus|backdoor';

-- Parent/Child Relationship (join in operator pipeline)
SELECT c.pid, c.name, p.pid AS ppid, p.name AS parent
FROM processes.snapshots c
LEFT JOIN processes.snapshots p ON p.pid = c.ppid
WHERE c.name = 'rundll32.exe' AND parent = 'winword.exe';

-- Aggregation / Rare Events
SELECT name, COUNT(*) AS launches
FROM processes.snapshots
WHERE collection_time BETWEEN :t_start AND :t_end
GROUP BY name HAVING launches > 50;

14) Performance & Backpressure#

Targets: <100ms/rule eval, >1k records/sec sustained write per agent
Channels: bounded MPSC with credit-based flow control
Rate Caps: agent can issue rate_limit hints per task
Overload Policy: drop oldest buffered streams per-task before global backoff

15) Reliability & Security#

Framing: length-delimited protobuf, CRC32, monotonically increasing seq_no
AuthZ: per-collector identity; table-level ACLs in the catalog
Sandbox: read-only query engine, allow-listed functions, strict limits
Tamper-evidence: audit ledger chained with cryptographic integrity

16) Testing & Validation#

AST fuzzing of SQL parser/validator and pushdown planner
Golden tests: SQL to DetectionTask JSON snapshots
Soak tests: synthetic high-rate streams with backpressure assertions
E2E: collector to agent to redb to alert sinks

17) Versioning & Compatibility#

Contracts: catalog@vN, detection_task@vN, stream_record@vN
Forward compat: agent tolerates unknown optional fields
Deprecation: announce in catalog.changes feed

18) Open Items#

Implement operator pipeline with redb as backend
Optimize join and aggregation strategies
Sliding/windowed aggregations helper primitives
Query plan caching and statistics collection

19) TL;DR (Operator-Facing)#

Collectors act like table providers. The agent turns SQL rules into simple tasks so collectors only ship what matters. The agent then compiles SQL into an operator pipeline over redb for joins/aggregations, and emits deduped alerts.
Key Benefits:

Performance: Pushdown reduces network traffic; operator pipeline handles complex logic efficiently
Scalability: Bounded memory usage prevents resource exhaustion
Flexibility: SQL-based rules are familiar to security operators
Reliability: ACK-after-commit semantics ensure data consistency
Debugging: Alert records contain pointers to reconstruct join results