Architecture Overview#
The libmagic-rs library is designed around a clean separation of concerns, following a parser-evaluator architecture that promotes maintainability, testability, and performance.
High-Level Architecture#
Core Components#
1. Parser Module (src/parser/)#
The parser is responsible for converting magic files (text-based DSL) into an Abstract Syntax Tree (AST).
Key Files:
ast.rs: Core data structures representing magic rules (✅ Complete)grammar.rs: nom-based parsing components for magic file syntax (✅ Complete)mod.rs: Parser interface, format detection, and hierarchical rule building (✅ Complete)
Responsibilities:
- Parse magic file syntax into structured data (✅ Complete)
- Handle hierarchical rule relationships (✅ Complete)
- Validate syntax and report meaningful errors (✅ Complete)
- Detect file format (text, directory, binary) (✅ Complete)
- Support incremental parsing for large magic databases (📋 Planned)
Current Implementation Status:
- ✅ Number parsing: Decimal and hexadecimal with overflow protection
- ✅ Offset parsing: Absolute offsets with comprehensive validation
- ✅ Operator parsing: Equality (
=,==), inequality (!=,<>), comparison (<,>,<=,>=), bitwise (&,^,~), and any-value (x) operators - ✅ Value parsing: Strings, numbers, and hex byte sequences with escape sequences
- ✅ PString suffixes:
/B,/H,/h,/L,/l(length prefix width),/J(self-inclusive length) - ✅ Error handling: Comprehensive nom error handling with meaningful messages
- ✅ Rule parsing: Complete rule parsing via
parse_magic_rule() - ✅ File parsing: Complete magic file parsing with
parse_text_magic_file() - ✅ Hierarchy building: Parent-child relationships via
build_rule_hierarchy() - ✅ Format detection: Text, directory, and binary format detection
- 📋 Indirect offsets: Pointer dereferencing patterns
2. AST Data Structures (src/parser/ast.rs)#
The AST provides a complete representation of magic rules in memory.
Core Types:
pub struct MagicRule {
pub offset: OffsetSpec, // Where to read data
pub typ: TypeKind, // How to interpret bytes
pub op: Operator, // Comparison operation
pub value: Value, // Expected value
pub message: String, // Human-readable description
pub children: Vec<MagicRule>, // Nested rules
pub level: u32, // Indentation level
}
pub enum TypeKind {
Byte { signed: bool }, // Single byte with explicit signedness
Short { endian: Endianness, signed: bool },
Long { endian: Endianness, signed: bool },
Quad { endian: Endianness, signed: bool },
String { max_length: Option<usize> },
PString {
max_length: Option<usize>,
length_width: PStringLengthWidth,
length_includes_itself: bool,
}, // Pascal string (length-prefixed)
}
pub enum Operator {
Equal, // = or ==
NotEqual, // != or <>
LessThan, // <
GreaterThan, // >
LessEqual, // <=
GreaterEqual, // >=
BitwiseAnd, // &
BitwiseAndMask(u64), // & with mask
BitwiseXor, // ^
BitwiseNot, // ~
AnyValue, // x (always matches)
}
pub enum PStringLengthWidth {
OneByte, // 1-byte prefix (default, /B)
TwoByteBE, // 2-byte big-endian prefix (/H)
TwoByteLE, // 2-byte little-endian prefix (/h)
FourByteBE, // 4-byte big-endian prefix (/L)
FourByteLE, // 4-byte little-endian prefix (/l)
}
Design Principles:
- Immutable by default: Rules don't change after parsing
- Serializable: Full serde support for caching
- Self-contained: No external dependencies in AST nodes
- Type-safe: Rust's type system prevents invalid rule combinations
- Explicit signedness:
TypeKind::Byteand integer types (Short, Long, Quad) distinguish signed from unsigned interpretations
PString Length Prefix Support:
The PString type supports multiple length prefix formats through the length_width field:
- OneByte (
/B): Default 1-byte length prefix (0-255 range) - TwoByteBE (
/H): 2-byte big-endian prefix - TwoByteLE (
/h): 2-byte little-endian prefix - FourByteBE (
/L): 4-byte big-endian prefix - FourByteLE (
/l): 4-byte little-endian prefix
The length_includes_itself field (controlled by the /J suffix) indicates JPEG-style self-inclusive length, where the stored length value includes the length field itself. This can be combined with any width variant (e.g., /HJ for 2-byte big-endian with self-inclusive length).
3. Evaluator Module (src/evaluator/)#
The evaluator executes magic rules against file buffers to identify file types. (✅ Complete)
Structure:
mod.rs: Public API surface (~720 lines) withEvaluationContext,RuleMatchtypes, and re-exportsengine/: Core evaluation engine submodulemod.rs:evaluate_single_rule,evaluate_rules, andevaluate_rules_with_configfunctionstests.rs: Engine unit tests
types/: Type interpretation submodulemod.rs: Public API surface withread_typed_value,coerce_value_to_type, and type re-exportsnumeric.rs: Numeric type handling (read_byte,read_short,read_long,read_quad) with endianness and signedness supportstring.rs: String type handling (read_string,read_pstring) with null-termination, UTF-8 conversion, and multi-byte length prefix supporttests.rs: Module tests
offset/: Offset resolution submodulemod.rs: Dispatcher (resolve_offset) and re-exportsabsolute.rs:OffsetError,resolve_absolute_offsetindirect.rs:resolve_indirect_offsetstub (issue #37)relative.rs:resolve_relative_offsetstub (issue #38)
operators/: Operator application submodulemod.rs: Dispatcher (apply_operator) and re-exportsequality.rs:apply_equal,apply_not_equalcomparison.rs:compare_values,apply_less_than/greater_than/less_equal/greater_equalbitwise.rs:apply_bitwise_and,apply_bitwise_and_mask,apply_bitwise_xor,apply_bitwise_not
Organization Note: The evaluator module has been refactored to split monolithic files into focused submodules. The initial refactoring split a 2,638-line mod.rs into engine/ submodules, and a subsequent refactoring reorganized the 1,836-line types.rs into types/ submodules for numeric and string handling. The public API surface remains in mod.rs with core logic distributed across focused submodules. This maintains the same public API through re-exports (no breaking changes) while improving code organization and staying within the 500-600 line module guideline.
Implemented Features:
- ✅ Hierarchical Evaluation: Parent rules must match before children
- ✅ Lazy Evaluation: Only process rules when necessary
- ✅ Bounds Checking: Safe buffer access with overflow protection
- ✅ Context Preservation: Maintain state across rule evaluations
- ✅ Graceful Degradation: Skip problematic rules, continue evaluation
- ✅ Timeout Protection: Configurable time limits
- ✅ Recursion Limiting: Prevent stack overflow from deep nesting
- ✅ Signedness Coercion: Automatic value coercion for signed type comparisons (e.g.,
0xff→-1for signed byte) - ✅ Comparison Operators: Full support for
<,>,<=,>=with numeric and lexicographic ordering - 📋 Indirect Offsets: Pointer dereferencing (planned)
4. I/O Module (src/io/)#
Provides efficient file access through memory-mapped I/O. (✅ Complete)
Implemented Features:
- FileBuffer: Memory-mapped file buffers using
memmap2 - Safe buffer access: Comprehensive bounds checking with
safe_read_bytesandsafe_read_byte - Error handling: Structured IoError types for all failure scenarios
- Resource management: RAII patterns with automatic cleanup
- File validation: Size limits, empty file detection, and metadata validation
- Overflow protection: Safe arithmetic in all buffer operations
Key Components:
pub struct FileBuffer {
mmap: Mmap,
path: PathBuf,
}
pub fn safe_read_bytes(buffer: &[u8], offset: usize, length: usize) -> Result<&[u8], IoError>
pub fn safe_read_byte(buffer: &[u8], offset: usize) -> Result<u8, IoError>
pub fn validate_buffer_access(buffer_size: usize, offset: usize, length: usize) -> Result<(), IoError>
5. Output Module (src/output/)#
Formats evaluation results into different output formats.
Planned Formatters:
text.rs: Human-readable output (GNUfilecompatible)json.rs: Structured JSON output with metadatamod.rs: Format selection and coordination
Data Flow#
1. Magic File Loading#
- Parsing: Convert text DSL to structured AST
- Validation: Check rule consistency and dependencies
- Optimization: Reorder rules for evaluation efficiency
- Caching: Serialize compiled rules for reuse
2. File Evaluation#
- File Access: Create memory-mapped buffer
- Rule Matching: Execute rules hierarchically
- Result Collection: Gather matches and metadata
- Output Generation: Format results as text or JSON
Design Patterns#
Parser-Evaluator Separation#
The clear separation between parsing and evaluation provides:
- Independent Testing: Each component can be tested in isolation
- Performance Optimization: Rules can be pre-compiled and cached
- Flexible Input: Support for different magic file formats
- Error Isolation: Parse errors vs. evaluation errors are distinct
Hierarchical Rule Processing#
Magic rules form a tree structure where:
- Parent rules define broad file type categories
- Child rules provide specific details and variants
- Evaluation stops when a definitive match is found
- Context flows from parent to child evaluations
Operator Support:
The evaluator supports all comparison, bitwise, and special matching operators:
- Equality:
=or==(exact match) - Inequality:
!=or<>(not equal) - Less-than:
<(numeric or lexicographic) - Greater-than:
>(numeric or lexicographic) - Less-equal:
<=(numeric or lexicographic) - Greater-equal:
>=(numeric or lexicographic) - Bitwise AND:
&(bit pattern matching) - Bitwise XOR:
^(exclusive OR pattern matching) - Bitwise NOT:
~(bitwise complement comparison) - Any-value:
x(unconditional match, always succeeds)
Comparison operators support both numeric comparisons (with automatic type coercion between signed and unsigned integers via i128) and lexicographic comparisons for strings and byte sequences.
Memory-Safe Buffer Access#
All buffer operations use safe Rust patterns:
// Safe buffer access with bounds checking
fn read_bytes(buffer: &[u8], offset: usize, length: usize) -> Option<&[u8]> {
buffer.get(offset..offset.saturating_add(length))
}
Error Handling Strategy#
The library uses Result types with nested error enums throughout:
pub type Result<T> = std::result::Result<T, LibmagicError>;
#[derive(Debug, thiserror::Error)]
pub enum LibmagicError {
#[error("Parse error: {0}")]
ParseError(#[from] ParseError),
#[error("Evaluation error: {0}")]
EvaluationError(#[from] EvaluationError),
#[error("I/O error: {0}")]
IoError(#[from] std::io::Error),
#[error("Evaluation timeout exceeded after {timeout_ms}ms")]
Timeout { timeout_ms: u64 },
}
#[derive(Debug, thiserror::Error)]
pub enum ParseError {
#[error("Invalid syntax at line {line}: {message}")]
InvalidSyntax { line: usize, message: String },
#[error("Unsupported format at line {line}: {format_type}")]
UnsupportedFormat { line: usize, format_type: String, message: String },
// ... additional variants
}
#[derive(Debug, thiserror::Error)]
pub enum EvaluationError {
#[error("Buffer overrun at offset {offset}")]
BufferOverrun { offset: usize },
#[error("Recursion limit exceeded (depth: {depth})")]
RecursionLimitExceeded { depth: u32 },
// ... additional variants
}
Performance Considerations#
Memory Efficiency#
- Zero-copy operations where possible
- Memory-mapped I/O to avoid loading entire files
- Lazy evaluation to skip unnecessary work
- Rule caching to avoid re-parsing magic files
Computational Efficiency#
- Early termination when definitive matches are found
- Optimized rule ordering based on match probability
- Efficient string matching using algorithms like Aho-Corasick
- Minimal allocations in hot paths
Scalability#
- Parallel evaluation for multiple files (future)
- Streaming support for large files (future)
- Incremental parsing for large magic databases
- Resource limits to prevent runaway evaluations
Module Dependencies#
Dependency Rules:
- No circular dependencies between modules
- Clear interfaces with well-defined responsibilities
- Minimal coupling between components
- Testable boundaries for each module
This architecture ensures the library is maintainable, performant, and extensible while providing a clean API for both CLI and library usage.