Performance Requirements and Benchmarks#
Lead Section#
DBSurveyor implements a comprehensive performance management strategy centered on connection pooling, query timeouts, data sampling, and concurrent processing to efficiently handle large-scale database schema collection. While the project lacks formally documented performance targets with specific numeric goals, it establishes implicit performance expectations through configuration defaults and comprehensive testing infrastructure using the Criterion benchmarking framework. The architecture prioritizes memory efficiency with a target of <100MB memory usage for schemas containing 10,000 tables, resource-constrained operation through configurable connection limits and timeouts, and output file size management with Zstandard compression achieving 60-80% size reduction for large schemas.
The performance strategy emphasizes pragmatic optimization over theoretical benchmarks, with default configurations including 30-second connection and query timeouts, maximum 10 concurrent connections, and 100-row sampling per table. Despite planned benchmark implementations in benches/collection.rs and benches/documentation.rs described in documentation, no actual Criterion benchmark files currently exist in the codebase. Instead, the project relies on extensive integration tests in files like postgres_connection_pooling.rs (748 lines) to validate performance characteristics through real-world scenarios.
This article documents DBSurveyor's performance requirements, optimization strategies, testing infrastructure, and output file format implementations as they exist in the current codebase rather than theoretical targets.
Performance Targets and Expectations#
Memory Usage Requirements#
DBSurveyor establishes one explicit memory performance target:
- Memory usage must remain below 100MB for schemas containing 10,000 tables (documented in testing.md example)
This target is validated through the test test_memory_usage_large_schema which verifies streaming processing handles large datasets within memory constraints.
Connection and Timeout Configuration#
Default connection parameters defined in ConnectionConfig:
- Connection timeout: 30 seconds (default)
- Query timeout: 30 seconds (default)
- Maximum connections: 10 (default), capped at 100 for safety
- Minimum idle connections: 2 (default)
- Idle timeout: 10 minutes (default)
- Maximum connection lifetime: 1 hour (default)
- Read-only mode: Enforced for security
These defaults indicate implicit expectations that normal operations complete within seconds to minutes rather than hours.
Output File Size Constraints#
Documentation specifies file size expectations:
- Maximum 100MB per schema file (validation rule)
- Compression reduces 1-10MB schemas by 60-70%
- Compression reduces >10MB schemas by 70-80%
Data Sampling Configuration#
- Sample size: 100 rows per table (default)
- Configurable via CLI:
--sample Nargument allows customization - Disable sampling:
--sample 0for faster collection without data samples - Throttling support:
--throttle MSadds configurable delays between operations for stealth mode
Performance Optimization Strategies#
Connection Pooling#
DBSurveyor implements connection pooling across all database adapters using SQLx's native pooling capabilities. PostgreSQL implementation:
let pool = sqlx::postgres::PgPoolOptions::new()
.max_connections(config.max_connections.min(100))
.min_connections(config.min_idle_connections)
.acquire_timeout(config.connect_timeout)
.idle_timeout(config.idle_timeout)
.max_lifetime(config.max_lifetime)
.test_before_acquire(true)
.connect_lazy(connection_string)
Key features:
- Health checks:
test_before_acquire(true)validates connections before use - Lazy initialization: Avoids creating unnecessary connections at startup
- Configurable limits: Balance between parallelism and resource consumption
- Similar implementations exist for MySQL and MongoDB
Query Timeout Mechanisms#
Query timeouts are configured at multiple levels to prevent resource exhaustion:
Session-Level Timeouts (PostgreSQL) - Configured via after_connect hook:
.after_connect(move |conn, _meta| {
Box::pin(async move {
conn.execute("SET statement_timeout = '30s'").await?;
conn.execute("SET lock_timeout = '30s'").await?;
conn.execute("SET idle_in_transaction_session_timeout = '60s'").await?;
})
})
MySQL Timeouts - Set via max_execution_time:
conn.execute("SET max_execution_time = 30000").await?;
CLI Configuration - Timeouts are configurable via connection string parameters like connect_timeout and statement_timeout.
Pagination and Data Sampling with LIMIT#
DBSurveyor implements intelligent data sampling through the DatabaseAdapter trait's sample_table() method. The trait defines a standardized interface that accepts a TableRef parameter containing the table name and optional schema:
async fn sample_table(
&self,
table_ref: TableRef<'_>,
config: &SamplingConfig,
) -> Result<TableSample>;
The TableRef struct provides a unified reference to database tables:
pub struct TableRef<'a> {
pub schema_name: Option<&'a str>,
pub table_name: &'a str,
}
PostgreSQL implementation uses SQL LIMIT clauses with intelligent ordering:
pub async fn sample_table(
pool: &PgPool,
schema: Option<&str>,
table: &str,
config: &SamplingConfig,
) -> Result<TableSample, DbSurveyorError> {
let detection_schema = schema.unwrap_or("public");
let strategy = detect_ordering_strategy(pool, detection_schema, table).await?;
let order_clause = generate_order_by_clause(&strategy, true);
let from_clause = match schema {
Some(s) => format!(r#""{}"."{}" t"#, s, table),
None => format!(r#""{}" t"#, table),
};
let sample_query = format!(
"SELECT row_to_json(t.*) AS row_data FROM {} {} LIMIT $1",
from_clause, order_clause
);
let rows: Vec<JsonValue> = sqlx::query_scalar(&sample_query)
.bind(config.sample_size as i64)
.fetch_all(pool)
.await?;
}
Intelligent ordering strategy detection:
- Primary key columns (most reliable)
- Timestamp columns (created_at, updated_at, etc.)
- Auto-increment/serial columns
- Random sampling fallback (RANDOM() for PostgreSQL, RAND() for MySQL)
Sample Status Tracking: The returned TableSample includes an optional sample_status field with three states:
SampleStatus::Complete- Successful sampling (PostgreSQL)SampleStatus::PartialRetry { original_limit }- Partial sampling that may be retriedSampleStatus::Skipped { reason }- Sampling not performed (MySQL, SQLite, MongoDB stubs)
The sample_status field is backward-compatible: it's omitted from JSON serialization when None, allowing older consumers to process the output without modification.
MySQL, SQLite, and MongoDB adapters currently return stub implementations with SampleStatus::Skipped { reason: "not yet implemented" }.
Row Count Estimation#
Instead of expensive COUNT(*) queries, DBSurveyor uses database statistics for fast approximate counts:
SELECT reltuples::bigint AS estimated_count
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname = $1 AND c.relname = $2
This approach trades perfect accuracy for significant performance gains on large tables.
Concurrent Multi-Database Collection#
DBSurveyor supports collecting from multiple databases concurrently:
async fn collect_databases_concurrent(
adapter: &PostgresAdapter,
databases: &[EnumeratedDatabase],
config: &MultiDatabaseConfig,
) -> (Vec<DatabaseCollectionResult>, Vec<DatabaseFailure>) {
let collection_futures = databases.iter().map(|db| {
async move {
collect_single_database(adapter, &db.name).await
}
});
let mut stream = stream::iter(collection_futures)
.buffer_unordered(config.max_concurrency);
}
- Default max concurrency: 4 databases
- Controlled parallelism: Uses
futures::stream::buffer_unordered - Async/await with Tokio: Entire system uses Tokio runtime for asynchronous operations
Rate Limiting and Throttling#
Throttling is implemented in sampling operations:
if let Some(throttle_ms) = config.throttle_ms {
let delay = Duration::from_millis(throttle_ms);
tokio::time::sleep(delay).await;
}
Configurable via --throttle CLI argument to prevent database overload during stealth operations.
Output Compression and Memory Optimization#
Compression using Zstandard reduces memory footprint:
async fn save_compressed(json_data: &str, output_path: &PathBuf) -> Result<()> {
let mut encoder = zstd::Encoder::new(Vec::new(), 3)?;
encoder.write_all(json_data.as_bytes())?;
let compressed_data = encoder.finish()?;
tokio::fs::write(output_path, compressed_data).await?;
}
Compression level 3 is used as a balance between speed and compression ratio. Enabled via --compress CLI flag.
Benchmarking Infrastructure#
Planned vs. Implemented#
DBSurveyor has Criterion listed as a dev-dependency and includes a bench command in the justfile, but no actual benchmark files exist in the codebase.
Documentation describes planned benchmark structure:
benches/collection.rs- Collection benchmarksbenches/documentation.rs- Documentation benchmarks
The documentation even includes example Criterion benchmark code showing how to benchmark schema serialization, but these files don't exist in the actual codebase.
Integration Testing as Performance Validation#
Instead of formal benchmarks, DBSurveyor uses extensive integration tests to validate performance characteristics:
Connection Pooling Tests (postgres_connection_pooling.rs, 748 lines):
- Pool limit testing - concurrent connection handling and exhaustion
- Timeout scenarios - various timeout configurations including edge cases
- Concurrent operations - multiple simultaneous schema collections
- Pool health monitoring - statistics tracking and health checks
Performance testing categories documented in testing.md:
- Criterion Benchmarks (planned)
- Memory Tests
- Concurrency Tests
Running Performance Tests#
# Run all tests including performance categories
just test
# Run benchmarks (requires benchmark files)
cargo bench
# Generate coverage report (>80% required)
just coverage
Output File Formats and Performance#
.dbsurveyor.json Format#
Primary output format defined by DatabaseSchema struct in dbsurveyor-core/src/models.rs. Uses FORMAT_VERSION = "1.0" per DBSurveyor Schema specification.
- Convert schema to JSON with
serde_json::to_value() - Validate output using
validate_schema_output() - Format with
to_string_pretty() - Write to disk
.dbsurveyor.json.zst Compression#
Compression uses Zstandard library (zstd = "0.13.3"):
Implementation in save_compressed():
- Compression level 3 (balance between speed and ratio)
- Requires
--features compressionat compile time - Enabled via
--compressCLI flag
Documented compression ratios:
| Schema Size | Recommendation | Expected Compression |
|---|---|---|
| < 1 MB | Standard JSON | N/A |
| 1-10 MB | Zstandard compression | 60-70% reduction |
| > 10 MB | Compression + chunking | 70-80% reduction |
Decompression implementation uses zstd::Decoder with automatic format detection.
.dbsurveyor.enc Encryption#
Encryption implemented in dbsurveyor-core/src/security/encryption.rs using:
aes-gcm = "0.10.3"for AES-GCM-256argon2 = "0.5.3"for key derivationzeroize = "1.8.2"for secure memory management
| Parameter | Value | Rationale |
|---|---|---|
| AES key size | 256 bits (32 bytes) | NSA CNSSP-15 compliant |
| Nonce size | 96 bits (12 bytes) | NIST SP 800-38D optimal |
| Auth tag size | 128 bits (16 bytes) | Maximum strength |
| Salt size | 128 bits (16 bytes) | RFC 9106 minimum |
| Argon2id memory | 64 MiB (65536 KiB) | OWASP 2024 high-security |
| Time cost | 3 iterations | RFC 9106 recommendation |
| Parallelism | 4 threads | Multi-core optimization |
pub struct EncryptedData {
pub algorithm: String, // "AES-GCM-256"
pub nonce: Vec<u8>, // 12 bytes
pub ciphertext: Vec<u8>, // encrypted payload
pub auth_tag: Vec<u8>, // 16 bytes
pub kdf_params: KdfParams, // Key derivation parameters
}
Output Validation#
All output undergoes strict validation before writing:
- Embedded JSON Schema v1.0 validation
- Format version checking for compatibility
- Security validation - credential pattern detection
- Connection string sanitization
Relevant Code Files#
| File | Purpose | Lines |
|---|---|---|
dbsurveyor-core/src/adapters/config/connection.rs | Connection pool configuration and limits | - |
dbsurveyor-core/src/adapters/postgres/connection.rs | PostgreSQL connection pooling and timeouts | - |
dbsurveyor-core/src/adapters/mysql/connection.rs | MySQL connection pooling and timeouts | - |
dbsurveyor-core/src/adapters/postgres/sampling.rs | PostgreSQL data sampling with LIMIT | - |
dbsurveyor-core/src/adapters/mysql/sampling.rs | MySQL data sampling with LIMIT | - |
dbsurveyor-core/src/adapters/config/sampling.rs | Sampling configuration structures | - |
dbsurveyor-core/src/adapters/postgres/multi_database.rs | Concurrent multi-database collection | - |
dbsurveyor-core/tests/postgres_connection_pooling.rs | Performance integration tests | 748 |
dbsurveyor-collect/src/main.rs | CLI, output generation, compression, encryption | - |
dbsurveyor-core/src/models.rs | Core data structures with serialization | - |
dbsurveyor-core/src/security/encryption.rs | AES-GCM encryption implementation | 713 |
dbsurveyor-core/src/validation.rs | JSON Schema validation and security checks | 628 |
docs/src/testing.md | Testing documentation (includes planned benchmarks) | - |
docs/src/output-formats.md | Output format documentation and compression ratios | - |
Related Topics#
- Database Adapters: Connection pooling and timeout implementations are adapter-specific
- Security: Encryption and validation ensure safe storage of sensitive schema information
- Testing: Integration tests serve as performance validation in absence of formal benchmarks
- CLI Configuration: Performance parameters are user-configurable via command-line flags
- Output Formats: File format choice impacts performance through compression and encryption overhead