Functional Requirements Specification - DBSurveyor#

Lead Section#

The Functional Requirements Specification for DBSurveyor defines 24 comprehensive requirements (F000-F023) organized into seven distinct categories that govern the design, implementation, and operation of a security-first, dual-binary database surveying toolchain. These requirements specify the complete operational envelope for surveying unknown database systems while maintaining strict security, privacy, and offline-operation guarantees.

DBSurveyor operates as a two-part toolchain with separate collector (dbsurveyor-collect) and postprocessor (dbsurveyor) binaries, enabling flexible workflows where data collection occurs in secure environments and documentation generation happens offline. The requirements encompass all aspects of functionality from system architecture and authentication through metadata collection, data profiling, core processing, reporting, and advanced Pro features.

The specification prioritizes critical security guarantees including offline-only operation, zero telemetry, no credentials in outputs, and airgap compatibility, making it suitable for red/blue team operators, DBAs, data analysts, developers, security analysts, and compliance officers working with sensitive database environments.

System Architecture Requirements (F000-F002)#

F000: Dual-Binary Architecture#

F000 mandates a dual-binary architecture with independent collector (dbsurveyor-collect) and postprocessor (dbsurveyor) executables. This separation enables:

Collection operations in secure/restricted environments
Offline documentation generation without database access
Independent versioning and deployment of collection vs. reporting capabilities
Airgap deployment scenarios where collection and documentation occur in different security zones

The architecture is implemented through three main crates: dbsurveyor-core (shared library), dbsurveyor-collect (collection binary), and dbsurveyor (documentation binary).

F001: Multi-Database Connection Support#

F001 requires multi-database connection support via Rust traits for PostgreSQL, MySQL, SQLite, SQL Server, and MongoDB. Implementation includes:

Database adapters implementing a common trait for unified access
Factory pattern providing database-agnostic instantiation
UnifiedDataType mapping for cross-database type normalization
Support for both SQL (relational) and NoSQL (MongoDB) database paradigms
DatabaseAdapter trait includes a sample_table(TableRef, &SamplingConfig) method for per-table data sampling

F002: Feature Flag-Based Driver Selection#

F002 specifies feature flag-based database driver selection to minimize binary size and dependencies. This enables:

Default build including only PostgreSQL and SQLite drivers
Full build with all databases (PostgreSQL, MySQL, SQLite, MongoDB, MSSQL)
Custom builds with only required database support
Reduced attack surface through minimal dependency inclusion

Authentication and Security Requirements (F003-F005)#

F003: Multiple Authentication Methods#

F003 mandates multiple authentication methods including username/password, certificates, and environment variables. This provides flexibility for various operational scenarios while supporting:

Username/password authentication
TLS/SSL certificate-based authentication
Environment variable-based credential passing
Connection string sanitization in all error messages and logs

F004: AES-GCM Encryption#

F004 requires AES-GCM encryption of output files with random nonces and embedded KDF parameters. Implementation details:

F005: Zero Telemetry Operation#

F005 mandates zero telemetry operation with no credential storage in outputs or logs. Security guarantees:

Metadata Collection Requirements (F006-F010)#

F006: Comprehensive Schema Discovery#

F006 requires comprehensive schema discovery and enumeration across all supported database engines. The system collects:

F007: Table Structure Analysis#

F007 mandates table structure analysis including columns, data types, constraints, indexes, and foreign keys. This provides complete structural understanding of each table entity including:

Column names, types, nullability, and default values
Primary key constraints
Unique constraints
Foreign key relationships and referential integrity rules
Index definitions for performance optimization

F008: Database-Specific Object Extraction#

F008 requires database-specific object extraction including views, stored procedures, functions, triggers, and user-defined types. The system captures:

Database procedures and functions
Triggers and custom types
Engine-specific objects that extend base SQL functionality

F009: Incremental Collection with Change Detection#

F009 specifies incremental collection with change detection and delta updates for large databases. This enables efficient re-surveying of large database systems by:

Detecting schema changes since last collection
Updating only modified objects
Reducing collection time for databases with thousands of tables
Minimizing database load during repeated surveys

F010: Connection Pooling and Graceful Degradation#

F010 requires connection pooling, timeout handling, and graceful degradation on partial failures. Implementation includes:

Connection pooling with security-focused defaults: 30s connect timeout, 30s query timeout, max 10 connections, read-only mode
Comprehensive error hierarchy with connection errors, collection errors, configuration errors, encryption errors, and I/O errors
Partial collection success when some objects fail
Detailed error reporting without credential leakage

Data Profiling and Analysis Requirements (F011-F013)#

F011: Statistical Analysis#

F011 requires statistical analysis including row counts, table sizes, and column value distributions. This provides quantitative understanding of:

Table row counts and approximate sizes
Column cardinality (distinct value counts)
Value distribution patterns
Data volume metrics for capacity planning

F012: Data Quality Metrics#

F012 mandates data quality metrics including completeness, consistency, and uniqueness with configurable thresholds. Analysis includes:

Completeness: percentage of non-null values
Consistency: adherence to expected patterns
Uniqueness: duplicate detection
Configurable quality thresholds for automated reporting

F013: Configurable Sample Data Extraction#

F013 requires configurable sample data extraction with privacy controls and pattern-based redaction. Features include:

Optional data samples via pagination and chunk processing
Configurable sample size limits
Pattern-based redaction for sensitive data
Privacy-aware sampling strategies

Core Processing Requirements (F014-F016)#

F014: Portable JSON Output Format#

F014 mandates generation of .dbsurveyor.json output that is fully portable and structured with "format_version": "1.0" specification. The output includes:

Unified data structures including DatabaseSchema with tables, views, indexes, constraints, procedures, functions, triggers, custom types, samples, and collection metadata
Format version for backward compatibility
Complete serialization of discovered schema
Platform-independent JSON encoding
TableSample includes an optional sample_status field of type SampleStatus (enum values: Complete, PartialRetry, Skipped) for tracking sampling outcomes, which is omitted from JSON when not set for backward compatibility

F015: Postprocessor for Report Generation#

F015 requires a postprocessor that loads .dbsurveyor.json and generates markdown/JSON reports. Implementation:

Schema analysis feeds template context to Askama template engine for multiple output formats
Independent operation without database connectivity
Multiple output format support
Template-driven document generation

F016: SQL Reconstruction Mode#

F016 specifies SQL reconstruction mode that outputs CREATE TABLE definitions based on discovered schema. This enables:

SQL DDL script generation
Schema replication in different environments
Documentation of database structure as executable SQL
Database migration assistance

Reporting and Visualization Requirements (F017-F019)#

F017: Report Mode#

F017 mandates report mode that renders a browsable table-of-contents style markdown document. Output includes:

Markdown output
Hierarchical table of contents
Browsable navigation structure
Human-readable schema documentation

F018: Diagram Mode (Pro)#

F018 specifies diagram mode (Pro feature) that generates Mermaid.js or D2 visual schema diagrams. Visualization includes:

Mermaid ERD diagrams
D2 diagram format support
Entity-relationship visualizations
Table relationship mapping

F019: Classify Mode (Pro)#

F019 requires classify mode (Pro feature) that tags likely PII/PCI fields based on regex or naming heuristics. Classification includes:

PII (Personally Identifiable Information) detection
PCI (Payment Card Industry) data identification
Regex-based pattern matching
Naming convention heuristics
Automated compliance reporting

Pro Features and Advanced Capabilities (F020-F023)#

F020: HTML Output (Pro)#

F020 mandates HTML output (Pro feature) with standalone styled HTML report generation with search/filter. Features:

HTML output
Standalone single-file distribution
Built-in search and filtering
Styled, interactive documentation

F021: Offline-Only Operation#

F021 requires that all features must function without network access. This guarantees:

AIRGAP COMPATIBLE: Full functionality in air-gapped environments
Complete separation of collection phase from documentation generation
No external dependencies during operation
Security-critical deployment compatibility

F022: Pluggable Database Engine Support#

F022 specifies support for pluggable DB engines via a Rust trait-based adapter system. The DatabaseAdapter trait defines the interface for all database implementations, including a sample_table(TableRef, &SamplingConfig) method for per-table data sampling. Adapters may be:

Compiled-in via feature flags
Loaded as WASM modules
Executed as stdio child-process plugins using stable JSON contract
OutputGenerator trait with implementations for each format

F023: Additional Advanced Capabilities#

F023 encompasses additional capabilities:

Data Redaction: Allow user to redact sensitive sample values in postprocessor
NoSQL Support: MVP must include at least one NoSQL database with schema-like inspection support (MongoDB)
Throttling: Support configurable throttling rate in collector to reduce detection risk and avoid slow logs
Compression and Encryption: Support optional output compression (e.g., gzip) and encryption (e.g., AES-GCM) via CLI flags

Requirements Priority Levels#

The specification defines three priority levels:

High Priority (F000-F007, F014, F015, F021-F023)#

Core architectural, security, and basic collection requirements essential for MVP:

System architecture foundation
Security guarantees
Basic metadata collection
Core processing and output
Offline operation
Pluggable engine support

Medium Priority (F013, F016-F019)#

Enhanced functionality for practical operational use:

Sample data extraction with privacy
SQL reconstruction
Report and diagram generation
PII/PCI classification

Low Priority (F020, Advanced Diagrams)#

Advanced features for premium/Pro offering:

Standalone HTML reports
Advanced visualization options

Performance Requirements#

The specification includes measurable performance targets:

CLI startup < 100ms
Completion < 10s for databases with < 1000 tables
Streaming and batch processing for memory efficiency with pagination for large tables and chunk processing
Async/await with Tokio runtime enables concurrent operations with connection pooling

Security Best Practices#

The requirements enforce comprehensive security best practices:

Use read-only database users with SELECT permissions only
All database operations are read-only (SELECT/DESCRIBE only, no INSERT, UPDATE, DELETE, or DDL)
All queries use parameterized statements to prevent SQL injection
Use strong encryption passwords (minimum 12 characters, mix of character types)
Set restrictive permissions on encrypted files (chmod 600)
Use secure deletion (shred) when done with sensitive data
TLS/SSL certificate validation is enforced for encrypted connections
Avoid credentials in command history using environment variables

Implementation Architecture#

The requirements are implemented through comprehensive architectural components:

Workspace Structure#

dbsurveyor/ - Main postprocessor binary
dbsurveyor-core/ - Shared library and core modules
dbsurveyor-collect/ - Collection binary
Cargo.toml - Workspace configuration

Core Modules#

adapters - Database-specific implementations
error - Comprehensive error hierarchy
models - Unified data structures
security - Encryption and credential management

Testing Strategy#

Comprehensive test organization:

Unit tests
Integration tests (PostgreSQL, MySQL, SQLite, MongoDB)
Security tests (credential protection, encryption validation, offline operation)
Performance tests

Configuration Management#

Configuration loaded from multiple sources with clear precedence:

Command line arguments (highest priority)
Environment variables
Project configuration (.dbsurveyor.toml)
User configuration (~/.config/dbsurveyor/config.toml)
Default values (lowest priority)

Development Milestones#

The requirements support a phased development approach:

v0.1 (Collector MVP): Core collection functionality (F000-F007, F014)
v0.2 (Postprocessor MVP): Basic reporting (F015-F017)
v0.3 (Pro Features): Advanced capabilities (F018-F023)
v1.0 (Production Release): Complete specification compliance

Relevant Code Files#

The functional requirements are implemented across the following key code files in the DBSurveyor repository:

File Path	Description	Related Requirements
project_specs/requirements.md	Complete functional requirements specification document	F000-F023 (All)
Cargo.toml	Workspace configuration with feature flags for database adapters	F000, F002
dbsurveyor-collect/src/main.rs	Collector binary with CLI, connection testing, and schema collection orchestration	F000, F003, F006-F013
dbsurveyor-collect/Cargo.toml	Feature flags for database drivers (postgresql, mysql, sqlite, mongodb, mssql)	F002
dbsurveyor/src/main.rs	Postprocessor binary with subcommands for generate, analyze, SQL DDL reconstruction, and validation	F000, F015-F020
dbsurveyor-core/src/lib.rs	Core library module organization and public API exports	F000, F001
dbsurveyor-core/Cargo.toml	Feature flags for database adapters, compression, and encryption	F002, F004, F023
dbsurveyor-core/src/models.rs	Unified data structures (DatabaseSchema, Table, Column, UnifiedDataType, Index, etc.)	F001, F006-F008, F014
dbsurveyor-core/src/error.rs	Comprehensive error hierarchy with credential sanitization	F005, F010
dbsurveyor-core/src/adapters/mod.rs	DatabaseAdapter trait definition and create_adapter factory with feature-gated instantiation	F001, F022
dbsurveyor-core/src/adapters/postgres/	PostgreSQL adapter with connection pooling, schema collection, type mapping, sampling, views, routines, triggers	F001, F003, F006-F010
dbsurveyor-core/src/adapters/mysql/	MySQL adapter with connection handling, schema queries, type conversion, sampling	F001, F003, F006-F010
dbsurveyor-core/src/adapters/sqlite/	SQLite adapter with connection management, schema introspection, type mapping, sampling	F001, F003, F006-F010
dbsurveyor-core/src/security/mod.rs	Security module with credential protection and connection string parsing	F003, F005
dbsurveyor-core/src/security/credentials.rs	Zeroizing credential container implementation	F005
dbsurveyor-core/src/security/encryption.rs	AES-GCM-256 encryption with Argon2id key derivation	F004, F023
dbsurveyor-core/src/quality/	Quality analysis module with completeness, consistency, uniqueness, and anomaly detection	F011-F013
dbsurveyor-core/src/validation.rs	JSON schema validation for output files	F014

Database Schema Discovery
Database Reverse Engineering
SQL Generation and DDL Extraction
Data Classification and PII Detection
Security and Encryption Standards
Offline-First Architecture Patterns
Database Adapter Design Patterns