JSON Schema Specification#
The .dbsurveyor.json format is the standard output format for DBSurveyor schema collection. This specification documents the complete structure, validation rules, and usage examples for the schema format.
Overview#
The .dbsurveyor.json format provides a comprehensive, validated representation of database schemas with the following characteristics:
- Security-First: No credential fields or sensitive data allowed
- Validation-Complete: Full JSON Schema validation ensures data integrity
- Database-Agnostic: Unified representation across PostgreSQL, MySQL, SQLite, and MongoDB
- Version-Aware: Format versioning for backward compatibility
- Frictionless-Compatible: Based on Frictionless Data Table Schema specification
Schema Structure#
Root Object#
Every .dbsurveyor.json file contains these required fields:
{
"format_version": "1.0",
"database_info": { /* Database metadata */ },
"tables": [ /* Table definitions */ ],
"collection_metadata": { /* Collection process info */ }
}
Required Fields#
| Field | Type | Description |
|---|---|---|
format_version | String | Schema format version (currently "1.0") |
database_info | Object | Database-level information and status |
collection_metadata | Object | Collection process metadata |
Optional Fields#
| Field | Type | Description |
|---|---|---|
tables | Array | Table definitions (default: empty array) |
views | Array | View definitions |
indexes | Array | Database indexes |
constraints | Array | Database constraints |
procedures | Array | Stored procedures |
functions | Array | Database functions |
triggers | Array | Database triggers |
custom_types | Array | Custom data types |
samples | Array | Data samples from tables |
Database Information#
The database_info object contains essential database metadata:
{
"name": "production_db",
"version": "13.7",
"size_bytes": 1073741824,
"encoding": "UTF8",
"collation": "en_US.UTF-8",
"owner": "dbadmin",
"is_system_database": false,
"access_level": "Full",
"collection_status": "Success"
}
Access Levels#
Full: Complete schema access with all metadataLimited: Partial access due to permission constraintsNone: No access to schema information
Collection Status#
"Success": Schema collected successfully{"Failed": {"error": "Permission denied"}}: Collection failed with reason{"Skipped": {"reason": "System database"}}: Database skipped with explanation
Table Structure#
Tables are defined with comprehensive metadata:
{
"name": "users",
"schema": "public",
"columns": [
{
"name": "id",
"data_type": {
"Integer": {
"bits": 32,
"signed": true
}
},
"is_nullable": false,
"is_primary_key": true,
"is_auto_increment": true,
"ordinal_position": 1,
"comment": "Unique user identifier"
},
{
"name": "email",
"data_type": {
"String": {
"max_length": 255
}
},
"is_nullable": false,
"ordinal_position": 2,
"comment": "User email address"
}
],
"primary_key": {
"name": "users_pkey",
"columns": [
"id"
]
},
"foreign_keys": [
{
"name": "users_profile_fk",
"columns": [
"profile_id"
],
"referenced_table": "profiles",
"referenced_columns": [
"id"
],
"on_delete": "Cascade",
"on_update": "Cascade"
}
],
"indexes": [
{
"name": "users_email_idx",
"columns": [
{
"name": "email",
"order": "asc"
}
],
"is_unique": true,
"index_type": "btree"
}
],
"comment": "User account information",
"row_count": 50000
}
Data Type System#
DBSurveyor uses a unified data type system that maps database-specific types to a common representation:
Basic Types#
"Boolean" // Boolean values
"Date" // Date without time
"Json" // JSON data
"Uuid" // UUID/GUID values
String Types#
{
"String": {
"max_length": 255 // Maximum character length
}
}
Numeric Types#
{
"Integer": {
"bits": 32, // Bit size: 8, 16, 32, 64, 128
"signed": true // Signed or unsigned
}
}
{
"Float": {
"precision": 53 // Floating point precision (1-53)
}
}
Date/Time Types#
{
"DateTime": {
"with_timezone": true // Includes timezone information
}
}
{
"Time": {
"with_timezone": false // Time without timezone
}
}
Complex Types#
{
"Array": {
"element_type": "String" // Array element type
}
}
{
"Binary": {
"max_length": 1024 // Maximum binary length
}
}
{
"Custom": {
"type_name": "geometry" // Database-specific custom type
}
}
Constraints and Relationships#
Primary Keys#
{
"name": "users_pkey",
"columns": [
"id"
]
}
Foreign Keys#
{
"name": "orders_user_fk",
"columns": [
"user_id"
],
"referenced_table": "users",
"referenced_schema": "public",
"referenced_columns": [
"id"
],
"on_delete": "Cascade",
"on_update": "Restrict"
}
Referential Actions:
Cascade: Delete/update cascades to related recordsSetNull: Set foreign key to NULLSetDefault: Set foreign key to default valueRestrict: Prevent deletion/update if references existNoAction: No automatic action
Check Constraints#
{
"name": "users_age_check",
"constraint_type": "Check",
"definition": "age >= 0 AND age <= 150",
"enforced": true
}
Indexes#
Index definitions include performance characteristics:
{
"name": "users_email_idx",
"table_name": "users",
"schema": "public",
"columns": [
{
"name": "email",
"order": "asc",
"nulls_order": "last"
}
],
"is_unique": true,
"is_primary": false,
"index_type": "btree",
"comment": "Unique index on email for fast lookups"
}
Index Types:
btree: Balanced tree (default)hash: Hash-based indexgin: Generalized inverted indexgist: Generalized search treespgist: Space-partitioned GiST
Data Sampling#
Optional data samples provide insight into actual data:
{
"samples": [
{
"table_name": "users",
"schema_name": "public",
"rows": [
{
"id": 1001,
"email": "user1001@example.com",
"created_at": "2024-01-15T09:00:00Z"
},
{
"id": 1002,
"email": "user1002@example.com",
"created_at": "2024-01-15T09:15:00Z"
}
],
"sample_size": 2,
"total_rows": 50000,
"sampling_strategy": {
"MostRecent": {
"limit": 10
}
},
"collected_at": "2024-01-15T10:30:00Z",
"warnings": [
"Large table - limited sample collected"
],
"sample_status": "Complete"
}
]
}
Table Sample Fields#
| Field | Type | Required | Description |
|---|---|---|---|
table_name | String | Yes | Name of the sampled table |
schema_name | String | No | Schema/database name (null for databases without schemas) |
rows | Array | Yes | Array of sampled row data as JSON objects |
sample_size | Integer | Yes | Number of rows actually sampled |
total_rows | Integer | No | Estimated total row count in the table |
sampling_strategy | Object/String | Yes | Strategy used for sampling (see below) |
collected_at | String | Yes | ISO 8601 timestamp of when sample was collected |
warnings | Array | Yes | Array of warning messages (empty if no warnings) |
sample_status | String/Object | No | Status of the sampling operation (see below) |
Sampling Strategies#
{"MostRecent": {"limit": 10}}: Latest N rows{"Random": {"limit": 100}}: Random sample of N rows"None": No sampling performed
Sample Status#
The optional sample_status field tracks the outcome of the sampling operation. This field is backward-compatible and will be omitted when not set.
Status Values:
"Complete": Sampling completed successfully{"PartialRetry": {"original_limit": 100}}: Sampling partially completed with a reduced limit due to constraints or errors{"Skipped": {"reason": "Not implemented for this database"}}: Sampling was skipped with an explanation
Example with Complete Status:
{
"table_name": "orders",
"schema_name": "public",
"rows": [...],
"sample_size": 100,
"sampling_strategy": {"MostRecent": {"limit": 100}},
"collected_at": "2024-01-15T10:30:00Z",
"warnings": [],
"sample_status": "Complete"
}
Example with Skipped Status:
{
"table_name": "large_archive",
"schema_name": null,
"rows": [],
"sample_size": 0,
"sampling_strategy": "None",
"collected_at": "2024-01-15T10:30:00Z",
"warnings": [],
"sample_status": {
"Skipped": {
"reason": "Table exceeds size threshold"
}
}
}
Backward Compatibility Note: The sample_status field is optional and omitted when not explicitly set. Older schema files without this field remain fully compatible.
Multi-Database Collections#
For server-level collections, the format supports multiple databases:
{
"format_version": "1.0",
"server_info": {
"server_type": "PostgreSQL",
"version": "13.7",
"host": "localhost",
"port": 5432,
"total_databases": 5,
"collected_databases": 3,
"system_databases_excluded": 2,
"connection_user": "dbadmin",
"has_superuser_privileges": true,
"collection_mode": {
"MultiDatabase": {
"discovered": 5,
"collected": 3,
"failed": 0
}
}
},
"databases": [
/* Individual database schemas */
],
"collection_metadata": {
"collected_at": "2024-01-15T10:30:00Z",
"collection_duration_ms": 2500,
"collector_version": "1.0.0",
"warnings": []
}
}
Collection Metadata#
Every schema file includes metadata about the collection process:
{
"collection_metadata": {
"collected_at": "2024-01-15T10:30:00Z",
"collection_duration_ms": 1500,
"collector_version": "1.0.0",
"warnings": [
"Large table 'audit_logs' - collection took 45 seconds",
"Custom type 'geometry' not fully supported"
],
"collector_options": {
"include_system_tables": false,
"sample_data": true,
"max_sample_size": 1000
}
}
}
Validation Rules#
Security Validation#
The schema enforces strict security rules:
- No credential fields: Field names cannot contain password, secret, token, etc.
- No connection strings: Database URLs are automatically filtered
- No sensitive patterns: Common credential patterns are rejected
- Sanitized output: All error messages are credential-free
Data Validation#
- String lengths: Maximum 255 characters for names, 1000 for comments
- Array limits: Maximum 1000 items in arrays
- Nesting depth: Maximum 10 levels of object nesting
- File size: Maximum 100MB per schema file
Type Validation#
- Required fields: All mandatory fields must be present
- Data type consistency: Values must match declared types
- Constraint validation: Foreign keys must reference valid tables
- Index validation: Index columns must exist in referenced table
Usage Examples#
Basic Schema Collection#
# Collect PostgreSQL schema
dbsurveyor-collect postgres://user:pass@localhost/mydb --output schema.json
# Validate the output
dbsurveyor schema schema.json --validate
# Generate documentation
dbsurveyor schema schema.json --format markdown --output schema.md
Multi-Database Collection#
# Collect all databases on server
dbsurveyor-collect postgres://admin:pass@localhost --all-databases --output server_schema.json
# Process specific database
dbsurveyor schema server_schema.json --database app_db --format json --output app_schema.json
Schema Validation#
# Validate against JSON Schema
dbsurveyor validate schema.json
# Check for specific issues
dbsurveyor validate schema.json --check-security --check-constraints
Error Handling#
Validation Errors#
When validation fails, DBSurveyor provides detailed error information:
{
"validation_errors": [
{
"path": "/tables/0/columns/1/data_type",
"message": "Invalid data type: expected String, Integer, Boolean, Date, Json, Uuid, Array, Binary, Custom, or DateTime",
"value": "VARCHAR",
"suggestion": "Use {\"String\": {\"max_length\": 255}} instead"
}
]
}
Collection Warnings#
Warnings are included in the metadata for non-critical issues:
{
"warnings": [
"Large table 'audit_logs' (1.2M rows) - collection took 45 seconds",
"Custom type 'geometry' not fully supported - using Custom type",
"Table 'temp_users' appears to be temporary - may not persist"
]
}
Version Compatibility#
Current Version: 1.0#
- Format: Stable and fully supported
- Validation: Complete JSON Schema validation
- Features: All documented features available
- Backward Compatibility: N/A (first version)
Future Versions#
The schema is designed for evolution:
- Additive Changes: New optional fields can be added
- Version Detection: Format version enables version-specific handling
- Migration Support: Tools will support upgrading between versions
- Deprecation Path: Old fields will be marked before removal
Integration#
Documentation Tools#
The schema format integrates with all DBSurveyor tools:
dbsurveyor-collect: Generates schema filesdbsurveyor: Processes and validates schemasdbsurveyor-docs: Generates documentation from schemasdbsurveyor-validate: Standalone validation tool
External Tools#
The format is compatible with:
- JSON Schema validators: jsonschema, ajv, etc.
- Data analysis tools: pandas, jq, etc.
- Documentation generators: Docusaurus, MkDocs, etc.
- CI/CD pipelines: GitHub Actions, GitLab CI, etc.
Best Practices#
Schema Collection#
- Use descriptive names: Avoid generic names like "db1", "test"
- Include comments: Add meaningful descriptions for tables and columns
- Sample strategically: Use sampling for large tables to avoid huge files
- Validate early: Check schemas immediately after collection
Schema Storage#
- Version control: Track schema changes in Git
- Backup regularly: Keep historical schema versions
- Compress large files: Use
.zstcompression for schemas >1MB - Secure access: Limit access to production schemas
Schema Processing#
- Validate inputs: Always validate before processing
- Handle errors gracefully: Check collection status before proceeding
- Monitor performance: Track collection times for optimization
- Document changes: Keep records of schema evolution
Troubleshooting#
Common Issues#
Collection Fails with Permission Error
# Check database user privileges
dbsurveyor-collect postgres://user:pass@localhost/db --test-connection
# Verify user has SELECT on information_schema
GRANT SELECT ON ALL TABLES IN SCHEMA information_schema TO username;
Large Schema File Size
# Use compression
dbsurveyor-collect postgres://localhost/db --compress
# Limit data sampling
dbsurveyor-collect postgres://localhost/db --max-sample-size 100
Validation Errors
# Check schema format
dbsurveyor validate schema.json --verbose
# Fix common issues
dbsurveyor fix schema.json --output fixed_schema.json
Performance Optimization#
- Connection pooling: Use connection pooling for large databases
- Parallel collection: Collect multiple databases simultaneously
- Selective sampling: Only sample essential tables
- Incremental updates: Collect only changed schemas
This specification provides a complete reference for the .dbsurveyor.json format, ensuring consistent, validated, and secure schema collection across all supported database types.