Performance Requirements and Benchmarks#

Lead Section#

DBSurveyor implements a comprehensive performance management strategy centered on connection pooling, query timeouts, data sampling, and concurrent processing to efficiently handle large-scale database schema collection. While the project lacks formally documented performance targets with specific numeric goals, it establishes implicit performance expectations through configuration defaults and comprehensive testing infrastructure using the Criterion benchmarking framework. The architecture prioritizes memory efficiency with a target of <100MB memory usage for schemas containing 10,000 tables, resource-constrained operation through configurable connection limits and timeouts, and output file size management with Zstandard compression achieving 60-80% size reduction for large schemas.

The performance strategy emphasizes pragmatic optimization over theoretical benchmarks, with default configurations including 30-second connection and query timeouts, maximum 10 concurrent connections, and 100-row sampling per table. Despite planned benchmark implementations in benches/collection.rs and benches/documentation.rs described in documentation, no actual Criterion benchmark files currently exist in the codebase. Instead, the project relies on extensive integration tests in files like postgres_connection_pooling.rs (748 lines) to validate performance characteristics through real-world scenarios.

This article documents DBSurveyor's performance requirements, optimization strategies, testing infrastructure, and output file format implementations as they exist in the current codebase rather than theoretical targets.

Performance Targets and Expectations#

Memory Usage Requirements#

DBSurveyor establishes one explicit memory performance target:

Memory usage must remain below 100MB for schemas containing 10,000 tables (documented in testing.md example)

This target is validated through the test test_memory_usage_large_schema which verifies streaming processing handles large datasets within memory constraints.

Connection and Timeout Configuration#

Default connection parameters defined in ConnectionConfig:

Connection timeout: 30 seconds (default)
Query timeout: 30 seconds (default)
Maximum connections: 10 (default), capped at 100 for safety
Minimum idle connections: 2 (default)
Idle timeout: 10 minutes (default)
Maximum connection lifetime: 1 hour (default)
Read-only mode: Enforced for security

These defaults indicate implicit expectations that normal operations complete within seconds to minutes rather than hours.

Output File Size Constraints#

Documentation specifies file size expectations:

Maximum 100MB per schema file (validation rule)
Compression reduces 1-10MB schemas by 60-70%
Compression reduces >10MB schemas by 70-80%

Data Sampling Configuration#

Default sampling behavior:

Sample size: 100 rows per table (default)
Configurable via CLI: --sample N argument allows customization
Disable sampling: --sample 0 for faster collection without data samples
Throttling support: --throttle MS adds configurable delays between operations for stealth mode

Performance Optimization Strategies#

Connection Pooling#

DBSurveyor implements connection pooling across all database adapters using SQLx's native pooling capabilities. PostgreSQL implementation:

let pool = sqlx::postgres::PgPoolOptions::new()
    .max_connections(config.max_connections.min(100))
    .min_connections(config.min_idle_connections)
    .acquire_timeout(config.connect_timeout)
    .idle_timeout(config.idle_timeout)
    .max_lifetime(config.max_lifetime)
    .test_before_acquire(true)
    .connect_lazy(connection_string)

Key features:

Health checks: test_before_acquire(true) validates connections before use
Lazy initialization: Avoids creating unnecessary connections at startup
Configurable limits: Balance between parallelism and resource consumption
Similar implementations exist for MySQL and MongoDB

Query Timeout Mechanisms#

Query timeouts are configured at multiple levels to prevent resource exhaustion:

Session-Level Timeouts (PostgreSQL) - Configured via after_connect hook:

.after_connect(move |conn, _meta| {
    Box::pin(async move {
        conn.execute("SET statement_timeout = '30s'").await?;
        conn.execute("SET lock_timeout = '30s'").await?;
        conn.execute("SET idle_in_transaction_session_timeout = '60s'").await?;
    })
})

MySQL Timeouts - Set via max_execution_time:

conn.execute("SET max_execution_time = 30000").await?;

CLI Configuration - Timeouts are configurable via connection string parameters like connect_timeout and statement_timeout.

Pagination and Data Sampling with LIMIT#

DBSurveyor implements intelligent data sampling through the DatabaseAdapter trait's sample_table() method. The trait defines a standardized interface that accepts a TableRef parameter containing the table name and optional schema:

async fn sample_table(
    &self,
    table_ref: TableRef<'_>,
    config: &SamplingConfig,
) -> Result<TableSample>;

The TableRef struct provides a unified reference to database tables:

pub struct TableRef<'a> {
    pub schema_name: Option<&'a str>,
    pub table_name: &'a str,
}

PostgreSQL implementation uses SQL LIMIT clauses with intelligent ordering:

pub async fn sample_table(
    pool: &PgPool,
    schema: Option<&str>,
    table: &str,
    config: &SamplingConfig,
) -> Result<TableSample, DbSurveyorError> {
    let detection_schema = schema.unwrap_or("public");
    let strategy = detect_ordering_strategy(pool, detection_schema, table).await?;
    let order_clause = generate_order_by_clause(&strategy, true);

    let from_clause = match schema {
        Some(s) => format!(r#""{}"."{}" t"#, s, table),
        None => format!(r#""{}" t"#, table),
    };

    let sample_query = format!(
        "SELECT row_to_json(t.*) AS row_data FROM {} {} LIMIT $1",
        from_clause, order_clause
    );

    let rows: Vec<JsonValue> = sqlx::query_scalar(&sample_query)
        .bind(config.sample_size as i64)
        .fetch_all(pool)
        .await?;
}

Intelligent ordering strategy detection:

Primary key columns (most reliable)
Timestamp columns (created_at, updated_at, etc.)
Auto-increment/serial columns
Random sampling fallback (RANDOM() for PostgreSQL, RAND() for MySQL)

Sample Status Tracking: The returned TableSample includes an optional sample_status field with three states:

SampleStatus::Complete - Successful sampling (PostgreSQL)
SampleStatus::PartialRetry { original_limit } - Partial sampling that may be retried
SampleStatus::Skipped { reason } - Sampling not performed (MySQL, SQLite, MongoDB stubs)

The sample_status field is backward-compatible: it's omitted from JSON serialization when None, allowing older consumers to process the output without modification.

MySQL, SQLite, and MongoDB adapters currently return stub implementations with SampleStatus::Skipped { reason: "not yet implemented" }.

Row Count Estimation#

Instead of expensive COUNT(*) queries, DBSurveyor uses database statistics for fast approximate counts:

SELECT reltuples::bigint AS estimated_count
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname = $1 AND c.relname = $2

This approach trades perfect accuracy for significant performance gains on large tables.

Concurrent Multi-Database Collection#

DBSurveyor supports collecting from multiple databases concurrently:

async fn collect_databases_concurrent(
    adapter: &PostgresAdapter,
    databases: &[EnumeratedDatabase],
    config: &MultiDatabaseConfig,
) -> (Vec<DatabaseCollectionResult>, Vec<DatabaseFailure>) {
    let collection_futures = databases.iter().map(|db| {
        async move {
            collect_single_database(adapter, &db.name).await
        }
    });

    let mut stream = stream::iter(collection_futures)
        .buffer_unordered(config.max_concurrency);
}

Configuration parameters:

Default max concurrency: 4 databases
Controlled parallelism: Uses futures::stream::buffer_unordered
Async/await with Tokio: Entire system uses Tokio runtime for asynchronous operations

Rate Limiting and Throttling#

Throttling is implemented in sampling operations:

if let Some(throttle_ms) = config.throttle_ms {
    let delay = Duration::from_millis(throttle_ms);
    tokio::time::sleep(delay).await;
}

Configurable via --throttle CLI argument to prevent database overload during stealth operations.

Output Compression and Memory Optimization#

Compression using Zstandard reduces memory footprint:

async fn save_compressed(json_data: &str, output_path: &PathBuf) -> Result<()> {
    let mut encoder = zstd::Encoder::new(Vec::new(), 3)?;
    encoder.write_all(json_data.as_bytes())?;
    let compressed_data = encoder.finish()?;
    tokio::fs::write(output_path, compressed_data).await?;
}

Compression level 3 is used as a balance between speed and compression ratio. Enabled via --compress CLI flag.

Benchmarking Infrastructure#

Planned vs. Implemented#

DBSurveyor has Criterion listed as a dev-dependency and includes a bench command in the justfile, but no actual benchmark files exist in the codebase.

Documentation describes planned benchmark structure:

benches/collection.rs - Collection benchmarks
benches/documentation.rs - Documentation benchmarks

The documentation even includes example Criterion benchmark code showing how to benchmark schema serialization, but these files don't exist in the actual codebase.

Integration Testing as Performance Validation#

Instead of formal benchmarks, DBSurveyor uses extensive integration tests to validate performance characteristics:

Connection Pooling Tests (postgres_connection_pooling.rs, 748 lines):

Performance testing categories documented in testing.md:

Criterion Benchmarks (planned)
Memory Tests
Concurrency Tests

Running Performance Tests#

# Run all tests including performance categories
just test

# Run benchmarks (requires benchmark files)
cargo bench

# Generate coverage report (>80% required)
just coverage

Output File Formats and Performance#

.dbsurveyor.json Format#

Primary output format defined by DatabaseSchema struct in dbsurveyor-core/src/models.rs. Uses FORMAT_VERSION = "1.0" per DBSurveyor Schema specification.

Serialization process:

Convert schema to JSON with serde_json::to_value()
Validate output using validate_schema_output()
Format with to_string_pretty()
Write to disk

.dbsurveyor.json.zst Compression#

Compression uses Zstandard library (zstd = "0.13.3"):

Implementation in save_compressed():

Compression level 3 (balance between speed and ratio)
Requires --features compression at compile time
Enabled via --compress CLI flag

Documented compression ratios:

Schema Size	Recommendation	Expected Compression
< 1 MB	Standard JSON	N/A
1-10 MB	Zstandard compression	60-70% reduction
> 10 MB	Compression + chunking	70-80% reduction

Decompression implementation uses zstd::Decoder with automatic format detection.

.dbsurveyor.enc Encryption#

Encryption implemented in dbsurveyor-core/src/security/encryption.rs using:

aes-gcm = "0.10.3" for AES-GCM-256
argon2 = "0.5.3" for key derivation
zeroize = "1.8.2" for secure memory management

Cryptographic parameters:

Parameter	Value	Rationale
AES key size	256 bits (32 bytes)	NSA CNSSP-15 compliant
Nonce size	96 bits (12 bytes)	NIST SP 800-38D optimal
Auth tag size	128 bits (16 bytes)	Maximum strength
Salt size	128 bits (16 bytes)	RFC 9106 minimum
Argon2id memory	64 MiB (65536 KiB)	OWASP 2024 high-security
Time cost	3 iterations	RFC 9106 recommendation
Parallelism	4 threads	Multi-core optimization

Encrypted file structure:

pub struct EncryptedData {
    pub algorithm: String, // "AES-GCM-256"
    pub nonce: Vec<u8>, // 12 bytes
    pub ciphertext: Vec<u8>, // encrypted payload
    pub auth_tag: Vec<u8>, // 16 bytes
    pub kdf_params: KdfParams, // Key derivation parameters
}

Output Validation#

All output undergoes strict validation before writing:

Embedded JSON Schema v1.0 validation
Format version checking for compatibility
Security validation - credential pattern detection
Connection string sanitization

Relevant Code Files#

File	Purpose	Lines
`dbsurveyor-core/src/adapters/config/connection.rs`	Connection pool configuration and limits	-
`dbsurveyor-core/src/adapters/postgres/connection.rs`	PostgreSQL connection pooling and timeouts	-
`dbsurveyor-core/src/adapters/mysql/connection.rs`	MySQL connection pooling and timeouts	-
`dbsurveyor-core/src/adapters/postgres/sampling.rs`	PostgreSQL data sampling with LIMIT	-
`dbsurveyor-core/src/adapters/mysql/sampling.rs`	MySQL data sampling with LIMIT	-
`dbsurveyor-core/src/adapters/config/sampling.rs`	Sampling configuration structures	-
`dbsurveyor-core/src/adapters/postgres/multi_database.rs`	Concurrent multi-database collection	-
`dbsurveyor-core/tests/postgres_connection_pooling.rs`	Performance integration tests	748
`dbsurveyor-collect/src/main.rs`	CLI, output generation, compression, encryption	-
`dbsurveyor-core/src/models.rs`	Core data structures with serialization	-
`dbsurveyor-core/src/security/encryption.rs`	AES-GCM encryption implementation	713
`dbsurveyor-core/src/validation.rs`	JSON Schema validation and security checks	628
`docs/src/testing.md`	Testing documentation (includes planned benchmarks)	-
`docs/src/output-formats.md`	Output format documentation and compression ratios	-

Database Adapters: Connection pooling and timeout implementations are adapter-specific
Security: Encryption and validation ensure safe storage of sensitive schema information
Testing: Integration tests serve as performance validation in absence of formal benchmarks
CLI Configuration: Performance parameters are user-configurable via command-line flags
Output Formats: File format choice impacts performance through compression and encryption overhead