Documents
ADR-0006 - Detection Query Execution (redb + DataFusion)
ADR-0006 - Detection Query Execution (redb + DataFusion)
Type
External
Status
Published
Created
Apr 18, 2026
Updated
Apr 27, 2026
Updated by
Dosu Bot
Source
View

Status#

Accepted (2026-04-27) — adopted as the Phase 2 executor strategy for v0.3.0 (Detection Engine, aspirational target 2026-10-31). Implementation lives in END-269 (SQL-Based Detection Engine epic).

Status history#

  • 2026-04-18 — Proposed (during open-core hygiene pass; backfilled to Nygard-style template)
  • 2026-04-27 — Accepted. v0.3.0 needs an executor; DataFusion is the obvious choice; the spec/dependency analysis is already done. The current execute_sql_query placeholder in daemoneye-lib/src/detection.rs is replaced as part of v0.3.0 delivery. Supersedes spec/daemon_eye_spec_sql_to_ipc_detection_architecture.md §9.1–§9.4.

Context#

DaemonEye detection rules are authored in a custom SQL-like dialect that is parsed and split at rule-compile time into (a) protobuf collection tasks for procmond/collectors and (b) a derived standard SQL query that runs against the collected results in redb. The Phase 2 executor (execute_sql_query in daemoneye-lib/src/detection.rs) is currently a placeholder using pattern matching.
redb is a pure-Rust embedded key-value store. It is the right substrate for the audit ledger and for per-collector event storage, but it does not execute SQL. The detection pipeline (spec §3) has two phases:

  1. Phase 1 (compile-time): sqlparser analyzes the rule AST. The compiler emits (a) protobuf IPC collection tasks describing what procmond and other collectors must gather, and (b) a derived standard-SQL query to run against the collected results.
  2. Phase 2 (runtime): Collectors publish data to redb tables. The derived SQL query executes against those tables and emits alerts.
    The dialect reference (docs/src/technical/sql-dialect-reference.md) and query pipeline doc (docs/src/technical/query-pipeline.md) already commit to a real execution surface: JOIN, GROUP BY/HAVING, COUNT/SUM/AVG/MIN/MAX, LENGTH/SUBSTR/INSTR/LIKE, HEX/UNHEX. Hand-rolling an executor for this surface is a multi-thousand-line undertaking with its own bug surface.
    Spec §4.10 introduces dialect extensions — AUTO JOIN … WHEN …, implicit correlation across collector domains (processes.snapshots, network.connections, fs.events, memory.analysis_results, pe.analysis_results). These are compile-time directives:
  • AUTO JOIN x ON … emits an additional collection task for the joined domain (with trigger conditions from WHEN), and lowers to a standard JOIN in the Phase 2 SQL.
  • Implicit correlation references (n.dst_port = 4444 where n is not in FROM) are resolved by the compiler into explicit JOINs plus the collection tasks that guarantee the data will be present.
    After lowering, Phase 2 is standard SQL. The dialect extensions do not need executor support; they need compiler support.
    Spec §4.1 describes a collector schema contract: each collector publishes its own schema (processes.*, network.*, fs.*, memory.*, pe.*) at startup and on change. This is a federated query model — multiple heterogeneous sources under a single namespace — not a single-table event store.
    Hard constraint: unsafe_code = "forbid" at the workspace level. The spirit of the rule, and EvilBit Labs' posture, is that C/C++ library wrappers (rusqlite, libsql, duckdb-rs, rocksdb) are undesirable even where transitive unsafe is technically allowed.

Decision#

Storage. redb is retained as the persistence substrate for all collector-produced data and the audit ledger. Per-domain tables under collector-owned namespaces for collected events. Audit ledger stays on redb with BLAKE3 hash chains and Merkle inclusion proofs (unchanged from ADR-0001). redb 4.0 (April 2026) is actively maintained, has a declared-stable file format, and is the most mature pure-Rust ACID store.
Phase 2 execution. Apache DataFusion is adopted as the SQL execution engine for the derived Phase 2 query. Each collector domain registers a custom TableProvider with the DataFusion SessionContext. Providers wrap redb tables (and may also wrap in-memory caches or streaming windows for reactive data per spec §4.9). The provider is responsible for pushing filters, projections, and limits into redb secondary-index scans.
The SessionContext is constructed read-only, giving natural SELECT-only enforcement at the executor layer (in addition to the existing compile-time validator). The compiler constrains its output to a known DataFusion-compatible SQL subset (filters, projections, JOIN, GROUP BY/HAVING, the function whitelist in the dialect reference). This keeps plans predictable and cacheable.
Compilation pipeline. Unchanged from the spec. sqlparser continues to produce the AST. The lowering stage (spec §4.10) continues to emit (a) protobuf collection tasks and (b) the derived SQL. The only new requirement is that the derived SQL must be valid DataFusion SQL; this is a constraint on the lowering rules, not an architectural change.
This decision supersedes spec/daemon_eye_spec_sql_to_ipc_detection_architecture.md §9.1–§9.4 ("Why Not a Full RDBMS", "Chosen Approach: Operator Pipeline", "Store Abstraction", "Operator Examples"). §11.5 (Smart Joins), §11.6 (Write-Through & Persistence Semantics), and §11.7 (redb Performance Playbook) remain authoritative — DataFusion sits on top of the redb storage layout those sections describe.

Consequences#

Positive#

  • Phase 2 solved with a mature engine. DataFusion provides JOIN, GROUP BY, aggregates, predicate pushdown, and the dialect's string/encoding functions out of the box.
  • Federated multi-collector model maps cleanly. One TableProvider per collector domain; SessionContext::register_table() is dynamic, matching the virtual catalog contract (spec §4.1).
  • Dialect extensions stay first-class. They live in the compiler, not the executor. The dialect can evolve (future items in spec §4.7 YARA, §4.9 reactive pipeline) without destabilizing the executor.
  • No SQL engine to build or maintain. The executor problem is solved by a widely-adopted Apache project used by InfluxDB 3, Comet, and others.
  • Read-only SessionContext is a natural defense-in-depth layer on top of the existing compile-time validator.
  • Audit ledger untouched. ADR-0001's BLAKE3 chain and Merkle inclusion proof design are unaffected.

Negative#

  • Dependency weight. DataFusion pulls the Arrow stack (~60 crates, ~15 MB added to release binary size). This must be measured against the <100 MB resident memory budget; DataFusion is heap-lean at rest but allocates during query execution.
  • Internal unsafe in Arrow SIMD kernels. Workspace-level forbid(unsafe_code) binds only first-party crates; this is consistent with the existing redb dependency (which also uses internal unsafe). No C/C++ FFI is introduced.
  • Compile-time contract. The lowering stage must emit DataFusion-compatible SQL. Adding dialect extensions requires verifying the lowering produces valid output. This is a testable contract, not a runtime risk.
  • Predictable-plan discipline. To avoid surprise performance cliffs, the compiler should constrain its output to a documented subset of DataFusion SQL rather than emitting arbitrary features.

Neutral#

  • Phase 1 is unchanged. sqlparser, AST analysis, protobuf collection task generation, and the auto-join orchestrator (spec §4.10) all continue unmodified.
  • Storage layout is unchanged. redb tables per collector domain, audit ledger on redb.

Alternatives Considered#

Keep redb-only, write a custom query engine#

The original spec direction was a hand-rolled operator pipeline. Pros: no new dependency; full control over operator semantics; dialect extensions could have first-class operators. Cons: the compiler already produces standard SQL as its lowering target, so writing a custom executor means throwing away a free, battle-tested query engine to re-implement JOIN, GROUP BY, aggregates, predicate pushdown, and string functions by hand. Estimated multi-thousand lines, ongoing maintenance burden, and a bug surface in exactly the security-sensitive path. Rejected — the custom-IR argument only holds if dialect extensions survive to runtime; spec §4.10 explicitly lowers them at compile time.

rusqlite / libsql / duckdb-rs#

Mature SQL engines with excellent optimizers. All wrap C/C++ libraries via FFI, which violates the spirit of the no-unsafe posture. duckdb-rs in particular pulls a large native dependency and complicates cross-compilation for the supported OS matrix (Linux/macOS/Windows × x86_64/ARM64, plus FreeBSD). Rejected on pure-Rust grounds.

Turso (ex-Limbo)#

Pure-Rust SQLite-compatible rewrite. Would collapse storage and execution into one store. As of April 2026 still pre-release (v0.6.0-pre.18), uses deterministic simulation testing instead of a forbid-unsafe posture, and SQLite source-compatibility is irrelevant for a greenfield schema. Rejected for v1.0; revisit at Turso 1.0 (likely late 2026 / 2027) — could collapse the hybrid into a single store if it ships with proven durability guarantees.

sled#

Pure-Rust KV store. Last published release v0.34.7 in September 2021, no 1.0 in 4.5 years, effectively dormant at the release-artifact level with long-standing "1.0 blocker" correctness items in-tree. Rejected. Also not a SQL engine — would leave the Phase 2 problem unsolved.

GlueSQL#

Pure-Rust SQL engine with pluggable storage (including a redb backend). Would let Phase 2 run on top of redb directly. Small community, less mature optimizer and planner than DataFusion, fewer deployed users at scale, custom TableProvider-equivalent plumbing is less documented. Rejected — strictly dominated by DataFusion on maturity, ecosystem, and planner quality; no compensating advantage.

Polars / polars-sql#

Excellent columnar analytics library, pure Rust. polars-sql is explicitly "not for external use" per upstream. Polars is a dataframe library, not an event-store query engine; persistence and incremental ingest are the caller's problem. Wrong shape. Rejected.

Hybrid: DataFusion over Parquet segments instead of redb#

Columnar scans are faster for long-range forensic queries. Adds segment rotation, retention management, and a second write path. Premature without evidence that the redb hot-window path is insufficient. Deferred — DataFusion's TableProvider abstraction makes this additive; rotated Parquet segments can be added later without changing the executor or the dialect.

ADR-0006 - Detection Query Execution (redb + DataFusion) | Dosu