Parser-Evaluator Architecture#
The Parser-Evaluator Architecture is the core architectural pattern in libmagic-rs, a pure-Rust implementation of libmagic (the library behind the Unix file command). This pattern enforces a strict separation of concerns between parsing magic file DSL syntax (handled by the parser module using nom combinators) and interpreting the resulting Abstract Syntax Tree (handled by the evaluator module). The architecture enables file type identification by transforming magic file rules into an executable AST and evaluating them against target file buffers.
The parser module is responsible for converting text-based magic files into hierarchical AST structures. It consists of multiple specialized sub-modules: grammar/ (implementing nom-based parsers split across mod.rs, numbers.rs, and value.rs), ast.rs (defining AST node structures), preprocessing.rs (handling line continuations and directives), and hierarchy.rs (building parent-child rule relationships). The evaluator module consumes the AST and executes rules against file buffers through a three-stage process: offset resolution (determining where to read), type interpretation (reading typed values with endianness support), and operator application (comparing values). This clean separation allows the parser to evolve independently from evaluation logic, provides a serializable intermediate representation, and enables potential alternative evaluator implementations.
The architecture follows a clear data flow pipeline: Magic File Text → Parser → AST → Evaluator → Match Results. The AST serves as the contract between modules, with MagicRule as the primary data structure carrying offset specifications, type information, operators, values, messages, and hierarchical children. The evaluator's specialized sub-modules (offset/, types.rs, operators/) each handle specific aspects of AST interpretation, demonstrating single-responsibility design principles.
Parser Module Architecture#
Module Organization#
The parser module is organized into specialized files and directories, each with distinct responsibilities:
mod.rs: Public API and module documentation, exposingparse_text_magic_file(),load_magic_file(), and type re-exportsast.rs: AST node definitions includingMagicRule,OffsetSpec,TypeKind,Operator,Value, andEndiannessgrammar/: Parser combinator implementations using nom, organized into submodules:mod.rs: Main grammar logic (796 lines) with offset, type, operator, and rule parsingnumbers.rs: Numeric literal parsing (decimal, hexadecimal, signed, unsigned)value.rs: Value literal parsing (strings, hex bytes, floats, numeric values)
preprocessing.rs: Line-level transformations including comment removal, line continuation handling, and strength directive parsinghierarchy.rs: Stack-based algorithm for building parent-child relationships from indentation levelsformat.rs: Magic file format detection (text, directory, or binary .mgc)loader.rs: File system operations for loading and merging magic files from paths and directories
Parsing Pipeline#
The parser follows a three-stage pipeline:
pub fn parse_text_magic_file(input: &str) -> Result<Vec<MagicRule>, ParseError> {
let lines = preprocess_lines(input)?; // Stage 1: Preprocessing
build_rule_hierarchy(lines) // Stage 2: Parsing + Stage 3: Hierarchy
}
- Preprocessing: Removes empty lines, processes comments, handles line continuations (backslashes), and parses
!:strengthdirectives - Parsing: Converts preprocessed lines into flat
MagicRuleobjects using nom combinators, including parsing float literals withparse_float_value()(requiring decimal points to distinguish from integers) and six new type keywords:float,double,befloat,bedouble,lefloat,ledouble - Hierarchy Building: Constructs parent-child relationships based on indentation levels (indicated by
>prefixes)
Nom Parser Combinator Implementation#
The grammar module uses nom 8.0.0 parser combinators for composable, type-safe parsing. The module is split into three files to keep each under 1000 lines: grammar/mod.rs dispatches to specialized parsers, grammar/numbers.rs handles numeric literals (parse_number, parse_unsigned_number, overflow checks), and grammar/value.rs handles value literals (parse_value, parse_quoted_string, parse_hex_bytes, parse_float_value). The parser handles type specifications for quad (64-bit integers), long (32-bit integers), short (16-bit integers), byte (8-bit integers), float (32-bit IEEE 754), double (64-bit IEEE 754), and string types, with endianness variants like quad, uquad, lequad, ulequad, bequad, ubequad, float, lefloat, befloat, double, ledouble, and bedouble. Key combinators include:
alt: Try multiple parsers in order (e.g., parsing different value types)tag: Match exact string literals like "byte", "leshort", "0x"map: Transform parser output (e.g., converting parsed strings to enum variants)opt: Make parsers optional, returningOption<T>pair: Apply two parsers in sequencemany0: Apply a parser zero or more times
Example of composed parsing logic:
pub fn parse_value(input: &str) -> IResult<&str, Value> {
let (input, _) = multispace0(input)?;
let (input, value) = alt((
map(parse_quoted_string, Value::String), // Try string first
map(parse_hex_bytes, Value::Bytes), // Then hex bytes
map(parse_float_value, Value::Float), // Then floats (with decimal)
parse_numeric_value, // Finally integers
))
.parse(input)?;
Ok((input, value))
}
This demonstrates priority-based parsing with alt, where quoted strings take precedence over numeric values to correctly handle patterns like "123" as a string rather than a number.
AST Node Structure#
The Abstract Syntax Tree consists of six main enums and one struct that represent parsed magic rules:
MagicRule Struct - The primary AST node:
pub struct MagicRule {
pub offset: OffsetSpec, // Where to read data
pub typ: TypeKind, // How to interpret bytes
pub op: Operator, // Comparison operator
pub value: Value, // Expected value
pub message: String, // Human-readable message
pub children: Vec<MagicRule>, // Child rules (hierarchy)
pub level: u32, // Indentation depth
pub strength_modifier: Option<StrengthModifier>, // Rule priority adjustment
pub value_transform: Option<ValueTransform>, // Pre-comparison value transform (e.g., lelong+1)
}
Supporting Enums:
OffsetSpec: Absolute, Indirect (pointer dereferencing with arithmetic adjustment viaIndirectAdjustmentOp), Relative, FromEndTypeKind: Byte, Short, Long, Quad, Float, Double, String, String16 (UCS-2 with explicit endianness:lestring16,bestring16), Regex, Search (each numeric type with endianness and signedness where applicable)Operator: Equal, NotEqual, LessThan, GreaterThan, LessEqual, GreaterEqual, BitwiseAnd, BitwiseAndMask, BitwiseXor, BitwiseNot, AnyValueValue: Uint, Int, Float, Bytes, String (note:Valuederives onlyPartialEq, notEq, due to IEEE 754 NaN semantics)Endianness: Little, Big, NativeStrengthModifier: Add, Subtract, Multiply, Divide, SetValueTransform: Stores aValueTransformOpand operand, enabling pre-comparison transformations (+,-,*,/,%,&,|,^) on the read value
Example TypeKind variants:
TypeKind::Byte { signed: bool }
TypeKind::Short { endian: Endianness, signed: bool }
TypeKind::Long { endian: Endianness, signed: bool }
TypeKind::Quad { endian: Endianness, signed: bool }
TypeKind::Float { endian: Endianness }
TypeKind::Double { endian: Endianness }
TypeKind::String { max_length: Option<usize> }
TypeKind::String16 { endian: Endianness } // UCS-2 (lestring16/bestring16)
TypeKind::Regex { flags: RegexFlags, count: Option<NonZeroU32> }
TypeKind::Search { range: NonZeroUsize }
All types are serializable via serde, enabling JSON import/export of parsed rules.
Hierarchical Rule Construction#
The hierarchy builder uses a stack-based algorithm to construct parent-child relationships:
Magic File: AST Representation:
----------- -------------------
0 string \x7fELF ELF executable MagicRule { level: 0, children: [
>4 byte 1 32-bit MagicRule { level: 1, ... },
>4 byte 2 64-bit MagicRule { level: 1, ... }
]}
Rules with level=0 are roots, level=1 rules become children of the most recent level=0 rule, and so on. When indentation decreases, the stack unwinds and completed rules are attached to their parents.
Evaluator Module Architecture#
Module Organization#
The evaluator module is organized into specialized submodules:
mod.rs: Public API surface withEvaluationContext,RuleMatch, and re-exports of core functions (~720 lines)engine.rs: Core evaluation engine withevaluate_single_rule(),evaluate_rules(), andevaluate_rules_with_config()(~2,096 lines)offset/: Offset resolution submodule for convertingOffsetSpecto absolute byte positionsmod.rs: Dispatcher (resolve_offset,resolve_offset_with_context) and public API re-exportsabsolute.rs: Absolute offset resolution withOffsetErrorandresolve_absolute_offset()indirect.rs: Indirect offset stub (pointer dereferencing, not yet implemented)relative.rs: Relative offset resolution using previous-match anchors (GNU file/libmagic semantics)
types.rs: Type interpretation for reading typed values with endianness support (1,505 lines)operators/: Operator application submodule for comparing valuesmod.rs: Dispatcher (apply_operator) and public API re-exportsequality.rs: Equality operations (apply_equal,apply_not_equal)comparison.rs: Ordering comparisons (compare_values,apply_less_than,apply_greater_than,apply_less_equal,apply_greater_equal)bitwise.rs: Bitwise operations (apply_bitwise_and,apply_bitwise_and_mask)
strength.rs: Rule strength calculation for prioritization (874 lines)
Each sub-module exposes public functions and types while encapsulating implementation details, following single-responsibility principles. The public API remains unchanged through re-exports in mod.rs. The modular structure prepares the codebase for future v0.2.0 feature additions.
Evaluation Pipeline#
The evaluator implements a three-stage evaluation process:
pub fn evaluate_single_rule(
rule: &MagicRule,
buffer: &[u8],
context: &mut EvaluationContext,
) -> Result<Vec<RuleMatch>, LibmagicError> {
// Delegates to evaluate_rules for context-aware evaluation
evaluate_rules(std::slice::from_ref(rule), buffer, context)
}
The internal evaluate_single_rule_with_anchor helper performs the three-stage pipeline for a single rule:
- Stage 1: Resolve offset specification to absolute position using
offset::resolve_offset_with_context - Stage 2: Read typed value from buffer with endianness using
types::read_typed_value(for fixed-width types) ortypes::read_typed_value_with_pattern(for pattern-bearing types like Regex and Search) - Stage 3: Apply operator to compare read value with expected value using
operators::apply_operator(for value-based types), or directly derive the match state fromread_pattern_match(for pattern types)
For pattern-bearing types (Regex, Search), the evaluation follows a variant path: the rule's value operand is the pattern itself rather than an expected matched value. The engine calls types::read_pattern_match, which returns Some(Value) on a successful match (possibly zero-width) and None on a miss. The engine maps that directly to Equal/NotEqual without calling apply_operator -- running pattern types through operator comparison would produce nonsense lexicographic comparisons against the pattern source text (e.g., matching "123" against the literal "[0-9]+"). Non-equality operators on pattern types are rejected as TypeReadError::UnsupportedType.
The evaluator uses zero-copy value coercion: types::coerce_value_to_type returns Cow::Borrowed on the hot path (no allocation for string matches), and Cow::Owned only when values must be transformed (e.g., widening unsigned integers to signed representation for signed types). Relative offset resolution uses context tracking: evaluate_rules threads the last_match_end anchor through EvaluationContext, advancing it after each successful match by the bytes consumed. Variable-width types include c-string NUL terminators and pstring length prefixes; for pattern types (Regex, Search), bytes_consumed_with_pattern is used instead of the regular bytes_consumed so the engine can re-compute the match position from the pattern. The resolver computes last_match_end + delta with bounds checks.
Offset Resolution#
The offset/ submodule converts OffsetSpec variants to absolute byte positions. The implementation is organized into separate files for each offset type, with offset/mod.rs providing the resolve_offset() dispatcher function.
Absolute Offsets (handled by absolute.rs): Positive values are offsets from file start; negative values are offsets from file end.
resolve_absolute_offset(0, buffer) // First byte
resolve_absolute_offset(-1, buffer) // Last byte
Algorithm for negative offsets:
let offset_from_end = usize::try_from(-offset)?;
if offset_from_end > buffer_len {
return Err(OffsetError::BufferOverrun { ... });
}
let resolved_offset = buffer_len - offset_from_end;
The implementation includes comprehensive bounds checking and special handling for i64::MIN to prevent arithmetic overflow.
Indirect Offsets (pointer dereferencing, handled by indirect.rs) resolve by reading a pointer value at the base offset, applying an arithmetic operation via IndirectAdjustmentOp (+, -, *, /, %, &, |, ^), and using the result as the final file offset. The implementation now supports the full range of magic(5) arithmetic indirect offsets (e.g., (0x200.s*2)).
Relative Offsets (handled by relative.rs) are fully implemented following GNU file/libmagic semantics. They resolve against the end position of the most recent successful match (the "previous match" anchor), which starts at 0 for a fresh evaluation pass. The implementation uses isize::try_from and checked_add_signed to compute last_match_end + delta with overflow/underflow detection, mapping both conditions to InvalidOffset and out-of-buffer targets to BufferOverrun (PR #211, issue #38).
Type Interpretation and Endianness#
The types.rs module provides safe, bounds-checked reading of typed values including read_byte(), read_short(), read_long(), read_quad(), read_float(), read_double(), read_string(), read_string16(), read_regex(), and read_search(). The module exposes two entry points: read_typed_value for fixed-width types and read_typed_value_with_pattern for pattern-bearing types (Regex, Search) that require threading the rule's value operand through as the pattern parameter.
String Reading uses two distinct read modes depending on whether a comparison value is present:
read_string_exact: Used when the rule supplies a comparison value (e.g.,0 string PNCIHISK\0). Reads exactlypattern.len()bytes with NO NUL truncation, allowing patterns with embedded NULs to match correctly.read_string: Used when a regex/search pattern is specified. Reads a NUL-terminated or variable-length string with UTF-8 conversion viaString::from_utf8_lossy.
This dual-mode dispatch is critical for correct magic rule matching. Without read_string_exact, patterns like \177ELF (which contain embedded NUL bytes) would fail to match real file signatures.
Byte Reading:
pub fn read_byte(buffer: &[u8], offset: usize, signed: bool) -> Result<Value, TypeReadError> {
buffer.get(offset)
.map(|&byte| {
if signed {
Value::Int(i64::from(byte as i8))
} else {
Value::Uint(u64::from(byte))
}
})
.ok_or(TypeReadError::BufferOverrun { ... })
}
Multi-byte Reading with Endianness:
pub fn read_short(
buffer: &[u8],
offset: usize,
endian: Endianness,
signed: bool
) -> Result<Value, TypeReadError> {
let bytes = buffer.get(offset..offset + 2)
.ok_or(TypeReadError::BufferOverrun { ... })?;
let value = match endian {
Endianness::Little => u16::from_le_bytes(bytes.try_into()?),
Endianness::Big => u16::from_be_bytes(bytes.try_into()?),
Endianness::Native => u16::from_ne_bytes(bytes.try_into()?),
};
if signed {
Ok(Value::Int(i64::from(value as i16)))
} else {
Ok(Value::Uint(u64::from(value)))
}
}
The implementation uses standard library methods (u16::from_le_bytes, u16::from_be_bytes, u16::from_ne_bytes) for byte order conversion and Rust's safe slice::get method for all buffer access, preventing buffer overruns at compile-time.
Float and Double Reading (IEEE 754 standard):
pub fn read_float(
buffer: &[u8],
offset: usize,
endian: Endianness
) -> Result<Value, TypeReadError> {
let bytes = buffer.get(offset..offset + 4)
.ok_or(TypeReadError::BufferOverrun { ... })?;
let value = match endian {
Endianness::Little => f32::from_le_bytes(bytes.try_into()?),
Endianness::Big => f32::from_be_bytes(bytes.try_into()?),
Endianness::Native => f32::from_ne_bytes(bytes.try_into()?),
};
// Widen f32 to f64 for uniform Value::Float representation
Ok(Value::Float(f64::from(value)))
}
pub fn read_double(
buffer: &[u8],
offset: usize,
endian: Endianness
) -> Result<Value, TypeReadError> {
let bytes = buffer.get(offset..offset + 8)
.ok_or(TypeReadError::BufferOverrun { ... })?;
let value = match endian {
Endianness::Little => f64::from_le_bytes(bytes.try_into()?),
Endianness::Big => f64::from_be_bytes(bytes.try_into()?),
Endianness::Native => f64::from_ne_bytes(bytes.try_into()?),
};
Ok(Value::Float(value))
}
Float types follow the IEEE 754 standard with three endianness variants: befloat/bedouble (big-endian), lefloat/ledouble (little-endian), and float/double (native endianness). The read_float() function reads 4 bytes and widens to f64, while read_double() reads 8 bytes directly into f64. Both return Value::Float(f64) for uniform representation. The implementations use standard library methods (f32::from_le_bytes, f64::from_be_bytes, etc.) for byte order conversion.
String16 Reading (UCS-2 support for lestring16/bestring16) reads 16-bit code units with explicit endianness until encountering a U+0000 terminator (encoded as the 2-byte sequence 0x00 0x00), the buffer end, or an 8192-unit cap. Each character occupies two bytes in the file; the decoded value is returned as a Rust String. Surrogate-pair code units (U+D800-U+DFFF) are emitted as the Unicode replacement character (U+FFFD).
Regex Reading (binary-safe pattern matching via regex::bytes::Regex) scans a bounded window (capped at 8192 bytes per GNU file's FILE_REGEX_MAX) for POSIX-extended regular expressions. Multi-line mode is always enabled so ^/$ match at line boundaries. The /c flag controls case sensitivity, /s controls anchor advance (match-start vs match-end), and /l interprets count as lines instead of bytes. Zero-width matches (e.g., ^, a*, lookaheads) are preserved as Some(Value::String("")) and distinguished from genuine misses (None).
Search Reading (bounded literal scan via memchr::memmem::find) looks for a literal byte pattern within a mandatory range. Unlike TypeKind::String, which only matches at the exact offset, search scans forward up to range bytes for the first occurrence. The anchor advances to match-end (GNU file's FILE_SEARCH semantics).
Operator Application#
The operators/ submodule implements comparison and bitwise operations. The implementation is organized into separate files by operation category, with operators/mod.rs providing the apply_operator() dispatcher function.
Equality with Cross-Type Coercion (from equality.rs):
pub fn apply_equal(left: &Value, right: &Value) -> bool {
// Float epsilon-aware equality: |a - b| <= f64::EPSILON
if let (Value::Float(a), Value::Float(b)) = (left, right) {
if a.is_nan() || b.is_nan() {
return false; // NaN != anything (including NaN)
}
if a.is_infinite() || b.is_infinite() {
return a == b; // Infinities exact-match
}
return (a - b).abs() <= f64::EPSILON;
}
// Cross-type String/Bytes equality: when the parser produces
// Value::Bytes for backslash-escape patterns like \177ELF and
// read_string_exact returns Value::String, they must compare equal
// by underlying byte sequence so real-world rules match correctly.
match (left, right) {
(Value::String(s), Value::Bytes(b)) | (Value::Bytes(b), Value::String(s)) => {
return s.as_bytes() == b.as_slice();
}
_ => {}
}
match (left, right) {
(Value::Uint(a), Value::Uint(b)) => a == b,
(Value::Int(a), Value::Int(b)) => a == b,
// Cross-type integer coercion using i128 to avoid overflow
(Value::Uint(a), Value::Int(b)) => i128::from(*a) == i128::from(*b),
(Value::Int(a), Value::Uint(b)) => i128::from(*a) == i128::from(*b),
_ => false,
}
}
Float comparisons use epsilon-aware equality: two floats are considered equal if |a - b| <= f64::EPSILON. Special values are handled explicitly: NaN is never equal to anything (including itself), and infinities use exact bit-pattern comparison. The implementation uses i128 for cross-type integer comparisons to safely handle the full range of both u64 and i64 values without overflow. Cross-type Value::String and Value::Bytes equality allows string rules with backslash-escaped patterns to match file data correctly (the parser produces Value::Bytes for patterns like \177ELF, while read_string_exact returns Value::String).
Comparison Operators (from comparison.rs):
pub fn compare_values(left: &Value, right: &Value) -> Option<Ordering> {
match (left, right) {
(Value::Uint(a), Value::Uint(b)) => Some(a.cmp(b)),
(Value::Int(a), Value::Int(b)) => Some(a.cmp(b)),
// Cross-type integer comparisons via i128 coercion
(Value::Uint(a), Value::Int(b)) => Some(i128::from(*a).cmp(&i128::from(*b))),
(Value::Int(a), Value::Uint(b)) => Some(i128::from(*a).cmp(&i128::from(*b))),
// Float comparisons use partial_cmp (IEEE 754 semantics)
(Value::Float(a), Value::Float(b)) => a.partial_cmp(b),
(Value::String(a), Value::String(b)) => Some(a.cmp(b)),
(Value::Bytes(a), Value::Bytes(b)) => Some(a.cmp(b)),
_ => None,
}
}
Comparison operators (<, >, <=, >=) use compare_values to perform ordering comparisons. Same-type comparisons use native Rust ordering, while cross-type integer comparisons use i128 coercion to safely compare u64 and i64 values without overflow. Float comparisons use partial_cmp(), which returns None for NaN operands (per IEEE 754 semantics). Incomparable type combinations (e.g., integers and strings) return None and evaluate to false.
Bitwise AND Operations (from bitwise.rs):
pub fn apply_bitwise_and(left: &Value, right: &Value) -> bool {
match (left, right) {
(Value::Uint(a), Value::Uint(b)) => (a & b) != 0,
(Value::Int(a), Value::Int(b)) => ((*a as u64) & (*b as u64)) != 0,
_ => false,
}
}
Returns true if ANY bits are set after the AND operation, enabling pattern matching for file signatures.
BitwiseAndMask Operator:
Operator::BitwiseAndMask(mask) => {
let masked_left = match left {
Value::Uint(val) => Value::Uint(val & mask),
Value::Int(val) => Value::Int(val & i64_mask),
_ => return false,
};
apply_equal(&masked_left, right)
}
This operator applies a mask to the file data before comparison, useful for checking specific bit patterns (e.g., byte&0xF0 to check upper nibble).
Bitwise XOR Operations (from bitwise.rs):
pub fn apply_bitwise_xor(left: &Value, right: &Value) -> bool {
match (left, right) {
(Value::Uint(a), Value::Uint(b)) => (a ^ b) != 0,
(Value::Int(a), Value::Int(b)) => ((*a as u64) ^ (*b as u64)) != 0,
(Value::Uint(a), Value::Int(b)) => (a ^ (*b as u64)) != 0,
(Value::Int(a), Value::Uint(b)) => ((*a as u64) ^ b) != 0,
_ => false,
}
}
Returns true if the XOR result is non-zero, enabling pattern matching for differing bits.
Bitwise NOT Operations (from bitwise.rs):
pub fn apply_bitwise_not(left: &Value, right: &Value) -> bool {
let complemented = match left {
Value::Uint(val) => Value::Uint(!val),
Value::Int(val) => Value::Int(!*val),
_ => return false,
};
apply_equal(&complemented, right)
}
Computes the bitwise complement of the left (file) value, then checks equality with the right value.
AnyValue Operator (from operators/mod.rs):
pub fn apply_any_value(_left: &Value, _right: &Value) -> bool {
true
}
The AnyValue operator (x) always returns true, providing unconditional matching. Useful for displaying information about file data without performing comparisons.
Data Flow and Module Interactions#
Complete Pipeline Flow#
The end-to-end data flow follows this architecture:
┌─────────────┐
│ Magic File │ (Text DSL)
└──────┬──────┘
│
▼
┌─────────────────────────────────────────┐
│ Parser Module │
│ ┌───────────────┐ ┌───────────────┐ │
│ │ Preprocessing │ → │ Grammar (nom) │ │
│ └───────────────┘ └───────┬───────┘ │
│ ▼ │
│ ┌───────────────┐ │
│ │ Hierarchy │ │
│ └───────┬───────┘ │
└──────────────────────────────┼──────────┘
│
▼
┌──────────────────┐
│ AST (MagicRule)│ (Intermediate Representation)
└─────────┬────────┘
│
▼
┌──────────────────────────────────────────┐
│ Evaluator Module │
│ ┌──────────┐ ┌───────┐ ┌──────────┐│
│ │ Offset │ → │ Types │ → │Operators ││
│ │Resolution│ │Reading│ │ ││
│ └──────────┘ └───────┘ └────┬─────┘│
└───────────────────────────────────┼──────┘
│
▼
┌─────────────────┐
│ Match Results │
└─────────────────┘
Integration via lib.rs#
The lib.rs module (624 lines, down from 916 after extracting config.rs) provides the orchestration layer through the MagicDatabase struct:
pub struct MagicDatabase {
rules: Vec<MagicRule>, // Parsed AST from parser
config: EvaluationConfig, // Evaluation parameters
source_path: Option<PathBuf>, // Origin of rules
mime_mapper: mime::MimeMapper, // MIME type conversion
}
The extracted config.rs module (307 lines) contains EvaluationConfig with security limits (max recursion depth, max string length, timeout) and validation methods to prevent resource exhaustion attacks.
Loading Magic Rules (Parser invocation):
pub fn load_from_file<P: AsRef<Path>>(path: P) -> Result<Self> {
let rules = parser::load_magic_file(path.as_ref())?; // Parse → AST
Ok(Self {
rules,
config: EvaluationConfig::default(),
...
})
}
Evaluating Files (Evaluator invocation):
pub fn evaluate_file<P: AsRef<Path>>(&self, path: P) -> Result<EvaluationResult> {
let file_buffer = FileBuffer::new(path)?;
let buffer = file_buffer.as_slice();
let matches = evaluate_rules_with_config(&self.rules, buffer, &self.config)?;
Ok(self.build_result(matches, file_size, start_time))
}
AST as Contract#
The AST serves as the formal contract between parser and evaluator:
- Parser side: Produces
Vec<MagicRule>with no knowledge of evaluation logic - Evaluator side: Consumes
MagicRulereferences without knowledge of parsing implementation - Contract: All AST types (OffsetSpec, TypeKind, Operator, Value) are defined in
parser::astand imported by evaluator
This decoupling enables:
- Independent evolution of parser grammar and evaluation strategies
- Serialization of parsed rules for caching or distribution
- Alternative evaluator implementations (e.g., JIT compilation)
- Comprehensive testing of each module in isolation
Usage and Examples#
Basic Usage Pattern#
use libmagic_rs::MagicDatabase;
// Load and parse magic file (parser module)
let db = MagicDatabase::load_from_file("/usr/share/misc/magic")?;
// Evaluate file (evaluator module)
let result = db.evaluate_file("sample.bin")?;
println!("File type: {}", result.description);
Parsing Magic Rules Directly#
use libmagic_rs::parser::parse_text_magic_file;
let magic_content = r#"
0 string \x7fELF ELF executable
>4 byte 1 32-bit
>4 byte 2 64-bit
"#;
let rules = parse_text_magic_file(magic_content)?;
// Returns: Vec<MagicRule> with hierarchical structure
Manual Evaluation Pipeline#
use libmagic_rs::evaluator::{offset, types, operators};
use libmagic_rs::parser::ast::{OffsetSpec, TypeKind, Operator, Value, Endianness};
let buffer = b"\x7fELF\x02\x01\x01\x00";
// Stage 1: Resolve offset
let offset_pos = offset::resolve_offset(&OffsetSpec::Absolute(0), buffer)?;
// Stage 2: Read typed value (32-bit long)
let type_spec = TypeKind::Long { endian: Endianness::Little, signed: false };
let read_value = types::read_typed_value(buffer, offset_pos, &type_spec)?;
// Stage 3: Apply operator
let expected = Value::Uint(0x464C457F); // "\x7fELF" as little-endian u32
let matches = operators::apply_operator(&Operator::Equal, &read_value, &expected);
// Example with BitwiseXor
let matches_xor = operators::apply_operator(&Operator::BitwiseXor, &Value::Uint(0xFF), &Value::Uint(0x0F));
// Example with AnyValue (always matches)
let always_matches = operators::apply_operator(&Operator::AnyValue, &read_value, &Value::Uint(0));
Example with Quad (64-bit) type:
// Reading a 64-bit value
let buffer = &[0xef, 0xcd, 0xab, 0x90, 0x78, 0x56, 0x34, 0x12];
let type_spec = TypeKind::Quad { endian: Endianness::Little, signed: false };
let offset_pos = offset::resolve_offset(&OffsetSpec::Absolute(0), buffer)?;
let read_value = types::read_typed_value(buffer, offset_pos, &type_spec)?;
assert_eq!(read_value, Value::Uint(0x1234_5678_90ab_cdef));
Example with Float and Double types:
// Reading a 32-bit float (IEEE 754)
let buffer = &[0x00, 0x00, 0x80, 0x3f]; // 1.0f32 in little-endian
let type_spec = TypeKind::Float { endian: Endianness::Little };
let offset_pos = offset::resolve_offset(&OffsetSpec::Absolute(0), buffer)?;
let read_value = types::read_typed_value(buffer, offset_pos, &type_spec)?;
assert_eq!(read_value, Value::Float(1.0));
// Reading a 64-bit double (IEEE 754)
let buffer = &[0x3f, 0xf0, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00]; // 1.0f64 in big-endian
let type_spec = TypeKind::Double { endian: Endianness::Big };
let offset_pos = offset::resolve_offset(&OffsetSpec::Absolute(0), buffer)?;
let read_value = types::read_typed_value(buffer, offset_pos, &type_spec)?;
assert_eq!(read_value, Value::Float(1.0));
Hierarchical Rule Evaluation#
// Rules with children are evaluated hierarchically
// Child rules only execute if parent matches
let parent_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Long { endian: Endianness::Little, signed: false },
op: Operator::Equal,
value: Value::Uint(0x464C457F), // ELF magic
message: "ELF executable".to_string(),
children: vec![
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(2),
message: "64-bit".to_string(),
level: 1,
...
}
],
level: 0,
...
};
Relevant Code Files#
| File Path | Description | Lines of Code |
|---|---|---|
src/parser/mod.rs | Parser module public API and pipeline orchestration | ~200 |
src/parser/ast.rs | AST node definitions (MagicRule, OffsetSpec, TypeKind, etc.) | ~700 |
src/parser/grammar/mod.rs | Main grammar logic (offset, type, operator, rule parsing) | 796 |
src/parser/grammar/numbers.rs | Numeric literal parsing (decimal, hexadecimal, signed, unsigned) | 149 |
src/parser/grammar/value.rs | Value literal parsing (strings, hex bytes, floats, numeric values) | 344 |
src/parser/preprocessing.rs | Line-level transformations (comments, continuations) | ~200 |
src/parser/hierarchy.rs | Stack-based hierarchy building from indentation | ~180 |
src/parser/format.rs | Magic file format detection | ~90 |
src/parser/loader.rs | File system loading and directory merging | ~310 |
src/evaluator/mod.rs | Public API surface (EvaluationContext, RuleMatch) and re-exports | ~720 |
src/evaluator/engine.rs | Core evaluation engine (evaluate_single_rule, evaluate_rules, evaluate_rules_with_config) | ~2,096 |
src/evaluator/offset/ | Offset resolution submodule (absolute, relative, indirect) | ~290 |
src/evaluator/types.rs | Type interpretation with endianness support | 1,505 |
src/evaluator/operators/ | Operator application submodule (equality, comparison, bitwise) | ~1,930 |
src/evaluator/strength.rs | Rule strength calculation for prioritization | 874 |
src/config.rs | Evaluation configuration with security limits and validation | 307 |
src/lib.rs | Integration layer (MagicDatabase) coordinating parser and evaluator | 624 |
Related Topics#
- Nom Parser Combinators: The parser module extensively uses nom 8.0.0 for composable, type-safe parsing of the magic file DSL
- Abstract Syntax Trees: The AST serves as the intermediate representation and contract between parser and evaluator
- Memory-Mapped File I/O: libmagic-rs uses memory mapping for efficient access to target files during evaluation
- File Type Detection: The broader context of identifying file types based on content signatures
- Endianness Handling: Multi-byte type interpretation requires careful handling of byte order (little-endian, big-endian, native)
- Safe Rust Patterns: The implementation demonstrates bounds checking, safe arithmetic, and compile-time memory safety guarantees
Design Principles#
The Parser-Evaluator Architecture embodies several key design principles:
- Separation of Concerns: Parsing and evaluation logic are completely independent
- Single Responsibility: Each sub-module handles one specific aspect (offset resolution, type reading, operators)
- Type Safety: Rust's enum and struct system ensures compile-time correctness
- Testability: Each module can be tested in isolation with comprehensive test coverage
- Extensibility: The AST can be extended with new node types without affecting existing code
- Memory Safety: All buffer access uses safe Rust methods with bounds checking
- Error Handling: Graceful degradation for non-critical errors (e.g., individual rule failures) while propagating critical errors