Data Quality Metrics And Analysis Engine#
Lead Section#
The Data Quality Metrics and Analysis Engine (also known as QualityAnalyzer) is a comprehensive data quality assessment system planned for DBSurveyor milestone v0.1. It enables security assessment operators to quickly identify data anomalies, assess completeness, and validate data consistency during database reconnaissance operations.
The system processes TableSample objects through a collection pipeline, computing four distinct types of quality metrics: completeness, consistency, uniqueness, and anomaly detection. The architecture follows five core design principles: KISS (simple focused metrics), TDD (test-first development), Security-First (no PII in outputs), Database-Agnostic (unified metrics), and Performance (O(n) algorithms).
Quality metrics are embedded in JSON output and expose only statistical aggregates (counts and ratios), never actual data values, ensuring security-conscious data analysis.
Problem Statement and Purpose#
DBSurveyor originally lacked comprehensive data quality assessment capabilities, preventing rapid identification of data anomalies and validation of data consistency during security assessments. The QualityAnalyzer addresses this gap by providing automated quality metrics generation.
Architecture and Design#
Core Design Principles#
The QualityAnalyzer follows five fundamental principles:
- KISS (Keep It Simple): Simple, focused metrics without over-engineering
- TDD (Test-Driven Development): Test-first development approach
- Security-First: No PII or sensitive data in metric outputs
- Database-Agnostic: Unified metrics across all database types
- Performance: O(n) algorithm complexity with < 100ms for 1000 row samples
Data Flow#
The QualityAnalyzer processes TableSample objects through the existing collection pipeline and generates TableQualityMetrics structures containing the computed assessments.
TableSample structure contains sampled rows as JSON values along with metadata:
pub struct TableSample {
pub table_name: String,
pub schema_name: Option<String>,
pub rows: Vec<serde_json::Value>,
pub sample_size: u32,
pub total_rows: Option<u64>,
pub sampling_strategy: SamplingStrategy,
pub collected_at: chrono::DateTime<chrono::Utc>,
pub warnings: Vec<String>,
}
The collection pipeline follows this flow:
- Adapter collects schema with optional samples
- Quality analysis runs if enabled:
- Build QualityConfig from CLI arguments
- Create QualityAnalyzer instance
- Process all samples via
analyzer.analyze_all() - Attach metrics to schema using
add_quality_metrics()
- Schema serialized to JSON with all optional fields
Quality Metric Types#
The QualityAnalyzer computes four distinct types of quality metrics, each implemented as a specialized analyzer component.
Completeness Metrics#
The completeness analyzer measures data availability by tracking null and empty value distribution. It performs three operations per column:
- Count null values
- Count empty strings
- Calculate completeness ratio using:
(total - nulls - empty) / total
The analyzer produces a CompletenessMetrics structure containing:
- Overall completeness score (0.0-1.0)
- Per-column null/empty counts
- Total null values across all columns
- Total empty string values
Consistency Metrics#
The consistency analyzer validates data format consistency through three methods:
- Type consistency: Verifies all values within a column maintain the same JSON type
- Format pattern matching: Validates common patterns like dates, UUIDs, and emails
- Length consistency: Checks fixed-format fields for consistent lengths
Output includes a consistency score, type inconsistencies (e.g., mixed "123" vs 123), and format violations.
Uniqueness Metrics#
The uniqueness analyzer detects duplicate data through:
- Per-column uniqueness ratio calculation
- Full row duplicate detection
- Primary key uniqueness validation
The analyzer outputs a uniqueness score, columns containing duplicate values, and the total count of duplicate rows (exact matches).
Anomaly Detection#
The anomaly detector performs statistical outlier identification on numeric columns using:
- Z-score calculation: Primary method for detecting outliers
- IQR (Interquartile Range): Fallback method when Z-score is inappropriate
- Configurable sensitivity: Three threshold levels (Low: 3.0σ, Medium: 2.5σ, High: 2.0σ)
Output consists of outlier counts and per-column anomaly details.
Implementation Phases#
The implementation follows a four-phase approach:
Phase 1: Core Models and Configuration#
Phase 1 establishes foundational types including QualityConfig for configurable thresholds, TableQualityMetrics for comprehensive quality assessment results, and supporting metric structures for completeness, consistency, uniqueness, and anomaly data.
Phase 2: Quality Analysis Engine#
Phase 2 implements four specialized analyzers: Completeness Analyzer (null/empty distribution), Consistency Analyzer (type/format validation), Uniqueness Analyzer (duplicate detection), and Anomaly Detector (statistical outlier detection using Z-score and IQR methods).
Phase 3: Integration#
Phase 3 integrates quality analysis into existing models, configuration structures, and CLI with new flags like --enable-quality and --quality-threshold.
Phase 4: Output Integration#
Phase 4 embeds quality metrics in JSON output following existing patterns for optional data.
Configuration#
The QualityConfig structure defines all configurable thresholds for quality analysis:
pub struct QualityConfig {
pub enabled: bool, // Enable quality analysis
pub completeness_min: f64, // Minimum completeness threshold (0.0-1.0)
pub uniqueness_min: f64, // Minimum uniqueness threshold (0.0-1.0)
pub consistency_min: f64, // Minimum consistency threshold (0.0-1.0)
pub anomaly_detection: AnomalyConfig, // Anomaly detection settings
}
pub struct AnomalyConfig {
pub enabled: bool,
pub sensitivity: AnomalySensitivity,
}
pub enum AnomalySensitivity {
Low, // 3.0 standard deviations
Medium, // 2.5 standard deviations
High, // 2.0 standard deviations
}
Thresholds control the minimum acceptable values for each metric type. Anomaly detection sensitivity determines how many standard deviations from the mean constitute an outlier.
CLI Usage#
Three new CLI flags enable quality analysis control:
--enable-quality: Enable quality analysis (disabled by default)--quality-threshold <metric:value>: Override specific threshold (e.g.,completeness:0.9)--disable-anomaly-detection: Disable statistical anomaly detection
Example usage:
dbsurveyor collect --enable-quality --quality-threshold completeness:0.95
Output Format#
Quality metrics are embedded in JSON output under the quality_metrics key within each table's data structure. The output follows existing DBSurveyor patterns for optional data, appearing only when quality analysis is enabled.
Performance#
Performance targets and characteristics:
- Algorithm complexity: O(n) where n = sample size (typically 100-1000 rows)
- Duplicate detection: Hash-based for O(1) average lookups
- Statistics computation: Single-pass algorithms where possible
- Memory usage: Bounded by sample size (data already loaded for sampling)
- Target execution time: < 100ms for 1000 row samples
Success criteria specify that analysis must complete in under 100ms for 1000-row samples.
Security Considerations#
Security constraints are fundamental to the design:
- Data privacy: Quality metrics expose only counts and ratios, never actual data values
- Anomaly details: Outlier information does not include the actual outlier values
- Output sanitization: All outputs follow existing security patterns
- Offline operation: No network calls required - purely local analysis of sampled data
This security-first design principle ensures no PII or sensitive data appears in metric outputs.
Testing#
Unit Tests (detailed in specification):
- Completeness analyzer: null counts, empty counts, ratio calculation
- Consistency analyzer: type uniformity, format patterns, length validation
- Uniqueness analyzer: column duplicates, row duplicates, primary key validation
- Anomaly detector: Z-score calculation, IQR fallback, sensitivity thresholds
Integration Tests (across all databases):
- PostgreSQL, MySQL, SQLite, MongoDB
- End-to-end collection with quality analysis
- Configuration parsing and CLI flag processing
Implementation Details#
File Structure#
The implementation requires 8 new source files (~1510 lines total) in the quality module:
| File | Purpose | Est. Lines |
|---|---|---|
dbsurveyor-core/src/quality/mod.rs | Module declaration | 30 |
dbsurveyor-core/src/quality/config.rs | Configuration types | 150 |
dbsurveyor-core/src/quality/models.rs | Metric data structures | 200 |
dbsurveyor-core/src/quality/completeness.rs | Completeness analysis | 150 |
dbsurveyor-core/src/quality/consistency.rs | Consistency analysis | 180 |
dbsurveyor-core/src/quality/uniqueness.rs | Uniqueness analysis | 150 |
dbsurveyor-core/src/quality/anomaly.rs | Anomaly detection | 200 |
dbsurveyor-core/src/quality/analyzer.rs | Main analyzer facade | 150 |
dbsurveyor-core/tests/quality_*.rs | Integration tests | 300 |
Six existing files require modifications for integration:
| File | Change |
|---|---|
dbsurveyor-core/src/lib.rs | Add pub mod quality; |
dbsurveyor-core/src/models.rs | Add quality_metrics field to DatabaseSchema |
dbsurveyor-core/src/adapters/config/collection.rs | Add quality_config field |
dbsurveyor-core/src/adapters/config/mod.rs | Re-export quality config |
dbsurveyor-collect/src/main.rs | Add CLI flags and orchestration |
dbsurveyor-core/Cargo.toml | Feature flag data-quality |
Dependencies#
No new external dependencies required. Uses existing libraries:
serde- Serializationchrono- Timestampsserde_json- JSON value analysis- Standard library for statistics (mean, standard deviation)
Out of Scope#
The following features are explicitly excluded:
- Real-time quality monitoring (batch analysis only)
- Machine learning-based anomaly detection (statistical methods only)
- Data correction or repair functionality
- Cross-table referential integrity validation (constraint-based only)
- Historical quality trend analysis
- Custom quality rule DSL (future enhancement)
- SQL Server support (not yet implemented in codebase)
Success Criteria#
Nine specific success criteria must be met:
- All quality metrics (completeness, consistency, uniqueness) computed correctly
- Anomaly detection identifies statistical outliers
- Configurable thresholds work via CLI and config file
- Quality metrics integrated into JSON output format
- All database adapters produce quality metrics from samples
just ci-checkpasses with zero warnings- Test coverage meets 80% threshold
- Documentation updated for new features
- Performance target: Analysis completes in < 100ms for 1000 row samples
Relevant Code Files#
The following table lists key source files in the EvilBit-Labs/dbsurveyor repository related to the Data Quality Metrics and Analysis Engine:
| File Path | Description | Lines |
|---|---|---|
project_specs/TASK-004-data-quality-metrics.md | Complete technical specification for QualityAnalyzer feature | 347 |
dbsurveyor-core/src/quality/mod.rs | Quality module declaration and exports | ~30 |
dbsurveyor-core/src/quality/config.rs | Configuration types (QualityConfig, AnomalyConfig) | ~150 |
dbsurveyor-core/src/quality/models.rs | Quality metric data structures | ~200 |
dbsurveyor-core/src/quality/completeness.rs | Completeness analyzer implementation | ~150 |
dbsurveyor-core/src/quality/consistency.rs | Consistency analyzer implementation | ~180 |
dbsurveyor-core/src/quality/uniqueness.rs | Uniqueness analyzer implementation | ~150 |
dbsurveyor-core/src/quality/anomaly.rs | Anomaly detection implementation | ~200 |
dbsurveyor-core/src/quality/analyzer.rs | Main analyzer facade coordinating all analyzers | ~150 |
dbsurveyor-core/src/models.rs | Core data models including TableSample and DatabaseSchema | Existing |
dbsurveyor-collect/src/main.rs | Collection pipeline with quality analysis integration | Existing |
Related Topics#
- DBSurveyor: Parent project providing database surveying and schema extraction capabilities
- TableSample: Input data structure containing sampled rows processed by QualityAnalyzer
- Database Adapters: PostgreSQL, MySQL, SQLite, and MongoDB integrations that produce table samples
- Statistical Analysis: Z-score and IQR methods used for anomaly detection
- Data Privacy: Security patterns ensuring no PII exposure in quality metrics
See Also#
- TASK-004 Specification - Complete technical specification
- DBSurveyor v0.1 Milestone - Target release for QualityAnalyzer feature