AST Data Structures#
The Abstract Syntax Tree (AST) is the core representation of magic rules in libmagic-rs. This chapter provides detailed documentation of the fully implemented AST data structures with comprehensive test coverage (29 unit tests) and their usage patterns.
Overview#
The AST consists of several key types that work together to represent magic rules:
MagicRule: The main rule structure containing all componentsOffsetSpec: Specifies where to read data in filesTypeKind: Defines how to interpret bytesOperator: Comparison and bitwise operationsValue: Expected values for matchingEndianness: Byte order specifications
MagicRule Structure#
The MagicRule struct is the primary AST node representing a complete magic rule:
pub struct MagicRule {
pub offset: OffsetSpec, // Where to read data
pub typ: TypeKind, // How to interpret bytes
pub op: Operator, // Comparison operation
pub value: Value, // Expected value
pub message: String, // Human-readable description
pub children: Vec<MagicRule>, // Nested rules
pub level: u32, // Indentation level
}
Example Usage#
use libmagic_rs::parser::ast::*;
// ELF magic number rule
let elf_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Long {
endian: Endianness::Little,
signed: false
},
op: Operator::Equal,
value: Value::Bytes(vec![0x7f, 0x45, 0x4c, 0x46]), // "\x7fELF"
message: "ELF executable".to_string(),
children: vec![],
level: 0,
};
Hierarchical Rules#
Magic rules can contain child rules that are evaluated when the parent matches:
let parent_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(0x7f),
message: "ELF".to_string(),
children: vec![
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(1),
message: "32-bit".to_string(),
children: vec![],
level: 1,
},
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(2),
message: "64-bit".to_string(),
children: vec![],
level: 1,
},
],
level: 0,
};
OffsetSpec Variants#
The OffsetSpec enum defines where to read data within a file:
Absolute Offsets#
pub enum OffsetSpec {
/// Absolute offset from file start
Absolute(i64),
// ... other variants
}
Examples:
// Read at byte 0 (file start)
let start = OffsetSpec::Absolute(0);
// Read at byte 16
let offset_16 = OffsetSpec::Absolute(16);
// Read 4 bytes before current position (negative offset)
let relative_back = OffsetSpec::Absolute(-4);
Indirect Offsets#
Indirect offsets read a pointer value and use it as the actual offset:
Indirect {
base_offset: i64, // Where to read the pointer
pointer_type: TypeKind, // How to interpret the pointer
adjustment: i64, // Value to add to pointer
endian: Endianness, // Byte order for pointer
}
Example:
// Read a 32-bit little-endian pointer at offset 0x20,
// then read data at (pointer_value + 4)
let indirect = OffsetSpec::Indirect {
base_offset: 0x20,
pointer_type: TypeKind::Long {
endian: Endianness::Little,
signed: false
},
adjustment: 4,
endian: Endianness::Little,
};
Relative and FromEnd Offsets#
// Relative to previous match position
Relative(i64),
// Relative to end of file
FromEnd(i64),
Examples:
// 8 bytes after previous match
let relative = OffsetSpec::Relative(8);
// 16 bytes before end of file
let from_end = OffsetSpec::FromEnd(-16);
TypeKind Variants#
The TypeKind enum specifies how to interpret bytes at the given offset:
Breaking Change in v0.2.0: The
Bytevariant changed from a unit variant (Byte) to a struct variant (Byte { signed: bool }). Code that pattern-matches exhaustively onTypeKindrequires updates.
Numeric Types#
pub enum TypeKind {
/// Single byte (8-bit)
Byte { signed: bool },
/// 16-bit integer
Short { endian: Endianness, signed: bool },
/// 32-bit integer
Long { endian: Endianness, signed: bool },
/// 64-bit integer
Quad { endian: Endianness, signed: bool },
/// String data
String { max_length: Option<usize> },
/// Pascal string (length-prefixed)
PString {
max_length: Option<usize>,
length_width: PStringLengthWidth,
length_includes_itself: bool,
},
/// Regular expression pattern matching
Regex {
flags: RegexFlags,
count: RegexCount,
},
/// Bounded literal byte sequence search
Search {
range: NonZeroUsize,
},
}
Examples:
// Single unsigned byte
let byte_type = TypeKind::Byte { signed: false };
// Single signed byte
let signed_byte_type = TypeKind::Byte { signed: true };
// 16-bit little-endian unsigned integer
let short_le = TypeKind::Short {
endian: Endianness::Little,
signed: false
};
// 32-bit big-endian signed integer
let long_be = TypeKind::Long {
endian: Endianness::Big,
signed: true
};
// 64-bit little-endian unsigned integer
let quad_le = TypeKind::Quad {
endian: Endianness::Little,
signed: false
};
// 64-bit big-endian signed integer
let quad_be = TypeKind::Quad {
endian: Endianness::Big,
signed: true
};
// Null-terminated string, max 256 bytes
let string_type = TypeKind::String {
max_length: Some(256)
};
PString (Pascal String)#
Pascal-style length-prefixed strings where the length prefix can be 1, 2, or 4 bytes depending on the length_width field.
Structure:
- Length prefix: 1, 2, or 4 bytes indicating string length, with configurable endianness
- String data: The number of bytes specified by the length prefix
Example:
0 pstring JPEG
0 pstring/H JPEG
The first line reads a 1-byte length prefix (default), then reads that many bytes as a string. The second line reads a 2-byte big-endian length prefix.
Behavior:
- Returns
Value::Stringcontaining the string data (without the length prefix) - Performs bounds checking on both the length prefix and the string data
- Supports all string comparison operators
- Length prefix width controlled by
PStringLengthWidthenum - Optional
/Jflag indicates JPEG-style self-inclusive length (stored length includes the prefix itself)
PStringLengthWidth Enum#
The PStringLengthWidth enum specifies the width and endianness of the length prefix:
pub enum PStringLengthWidth {
/// 1-byte length prefix (default, `/B` suffix)
OneByte,
/// 2-byte big-endian length prefix (`/H` suffix)
TwoByteBE,
/// 2-byte little-endian length prefix (`/h` suffix)
TwoByteLE,
/// 4-byte big-endian length prefix (`/L` suffix)
FourByteBE,
/// 4-byte little-endian length prefix (`/l` suffix)
FourByteLE,
}
Suffix conventions:
/B- 1-byte length prefix (default if no suffix specified)/H- 2-byte big-endian length prefix/h- 2-byte little-endian length prefix/L- 4-byte big-endian length prefix/l- 4-byte little-endian length prefix/J- Length includes the prefix width itself (combinable:/HJ,/lJ, etc.)
Examples:
use libmagic_rs::parser::ast::{TypeKind, PStringLengthWidth};
// 1-byte length prefix (default)
let pstring_default = TypeKind::PString {
max_length: None,
length_width: PStringLengthWidth::OneByte,
length_includes_itself: false,
};
// 2-byte big-endian length prefix
let pstring_be = TypeKind::PString {
max_length: None,
length_width: PStringLengthWidth::TwoByteBE,
length_includes_itself: false,
};
// 4-byte little-endian length prefix
let pstring_le = TypeKind::PString {
max_length: None,
length_width: PStringLengthWidth::FourByteLE,
length_includes_itself: false,
};
// 2-byte big-endian with /J flag (JPEG-style self-inclusive length)
let pstring_jpeg = TypeKind::PString {
max_length: None,
length_width: PStringLengthWidth::TwoByteBE,
length_includes_itself: true,
};
// Maximum 64-byte limit with 1-byte prefix
let limited_pstring = TypeKind::PString {
max_length: Some(64),
length_width: PStringLengthWidth::OneByte,
length_includes_itself: false,
};
Regex (Regular Expression Pattern Matching)#
The Regex variant matches POSIX-extended regular expression patterns against file buffers. Patterns are binary-safe and always compiled with multi-line mode enabled (matching ^ and $ at line boundaries). The scan window is capped at 8192 bytes (FILE_REGEX_MAX) regardless of the count variant.
Structure:
Regex {
flags: RegexFlags,
count: RegexCount,
}
Fields:
flags: Modifier flags from the/cand/ssuffixes (case-insensitive, start-offset). The/lsuffix is NOT a flag — it selects theRegexCount::Linesvariant below.count: Scan window specifier, one of three variants (seeRegexCountbelow).
Example:
0 regex [0-9]+ Numeric content
0 regex/1l ^#!/ Shebang on first line
0 regex/cs json Case-insensitive "json", anchor at match-start
Behavior:
- Returns
Value::Stringcontaining the matched text - Scan window capped at 8192 bytes (GNU
fileFILE_REGEX_MAX) - Multi-line mode unconditional (
^/$match line boundaries,.does not match newlines) - Zero-width matches (e.g.,
^,a*) returnValue::String("")and are distinguished from no-match - Only supports
EqualandNotEqualoperators; other comparison operators returnTypeReadError::UnsupportedType
RegexFlags Struct#
The RegexFlags struct specifies regex behavior modifiers. All flags default to false via RegexFlags::default.
pub struct RegexFlags {
/// `/c` - case-insensitive matching
pub case_insensitive: bool,
/// `/s` - advance anchor to match-start instead of match-end
pub start_offset: bool,
}
The /l modifier is encoded by the RegexCount::Lines variant rather than a flag field, so "line count" and "byte count" are mutually exclusive at the type level.
RegexCount Enum#
The RegexCount enum specifies the scan window for a regex rule:
pub enum RegexCount {
/// Plain `regex` with no suffix: full 8192-byte default window.
Default,
/// `regex/N`: scan at most `N` bytes, capped at 8192.
Bytes(NonZeroU32),
/// `regex/Nl` or `regex/l`: scan N line terminators (or the full
/// capped window if `None`), capped at 8192 bytes.
Lines(Option<NonZeroU32>),
}
Variant mapping:
regex→RegexCount::Defaultregex/N→RegexCount::Bytes(N)regex/Nl→RegexCount::Lines(Some(N))regex/l→RegexCount::Lines(None)(behaviorally equivalent toDefault: both walk the full 8192-byte capped window)
Examples:
use libmagic_rs::parser::ast::{RegexCount, RegexFlags, TypeKind};
use std::num::NonZeroU32;
// Plain regex with 8192-byte default scan window
let plain_regex = TypeKind::Regex {
flags: RegexFlags::default(),
count: RegexCount::Default,
};
// First line only (1 line, capped at 8192 bytes)
let first_line = TypeKind::Regex {
flags: RegexFlags::default(),
count: RegexCount::Lines(NonZeroU32::new(1)),
};
// Case-insensitive with anchor at match-start
let case_start = TypeKind::Regex {
flags: RegexFlags {
case_insensitive: true,
start_offset: true,
},
count: RegexCount::Default,
};
Meta-types (Control Directives)#
The Meta variant represents pseudo-types that do not read bytes from the buffer. They encode control-flow directives inherited from the libmagic magic(5) format.
Structure:
/// Control-flow directives that do not read bytes from the buffer.
Meta(MetaType),
TypeKind::Meta(_) returns None from bit_width() because meta-types consume zero on-disk bytes.
MetaType Enum:
pub enum MetaType {
/// `default` — fires only when no sibling at the same level has matched.
Default,
/// `clear` — resets the per-level sibling-matched flag.
Clear,
/// `name <id>` — defines a named subroutine (hoisted out of the rule list at load time).
Name(String),
/// `use <id>` — invokes a named subroutine at the resolved offset.
Use(String),
/// `indirect` — re-applies the full rule database at the resolved offset.
Indirect,
/// `offset` — emits the resolved file position as `Value::Uint` for
/// printf-style format substitution (e.g. `%lld`).
Offset,
}
Examples:
use libmagic_rs::parser::ast::{TypeKind, MetaType};
// Default fallback rule
let default_rule = TypeKind::Meta(MetaType::Default);
// Clear sibling-matched flag
let clear_rule = TypeKind::Meta(MetaType::Clear);
// Named subroutine declaration
let name_rule = TypeKind::Meta(MetaType::Name("part2".to_string()));
// Subroutine invocation
let use_rule = TypeKind::Meta(MetaType::Use("part2".to_string()));
// Re-entry into root rules
let indirect_rule = TypeKind::Meta(MetaType::Indirect);
// Report the resolved file offset for format substitution
let offset_rule = TypeKind::Meta(MetaType::Offset);
Parse-time Name Extraction:
Top-level name <id> rules are hoisted out of ParsedMagic::rules by parser::name_table::extract_name_table and placed into the name_table: NameTable field of ParsedMagic keyed by identifier. As a result, MetaType::Name variants in the final parsed rule list are expected only as an internal intermediate representation — name rules do not survive past the load boundary in normal operation.
Search (Bounded Literal Byte Sequence Search)#
The Search variant scans for a literal byte pattern within a bounded range. Unlike String, which matches only at the exact offset, Search scans forward up to range bytes for the first occurrence.
Structure:
Search {
range: NonZeroUsize,
}
Fields:
range: Mandatory scan window width in bytes (must be non-zero per GNUfilemagic(5) specification)
Example:
0 search/256 PK\003\004 ZIP archive within first 256 bytes
Behavior:
- Returns
Value::Stringcontaining the matched bytes if found within range - Anchor advances to
match_idx + pattern.len()(the byte position just past the matched needle), matching GNUfile'ssoftmagic.cFILE_SEARCHpath wherems->search.offset += idxand thenmoffset()addsvlen = m->vallen. An earlier implementation incorrectly advanced by the full window size (range), but this caused relative-offset children to land far past the intended byte. - Only supports
EqualandNotEqualoperators - Range is mandatory;
search/0or baresearchare parse errors
Examples:
use libmagic_rs::parser::ast::TypeKind;
use std::num::NonZeroUsize;
// Scan up to 256 bytes for the pattern
let bounded_search = TypeKind::Search {
range: NonZeroUsize::new(256).unwrap(),
};
// Scan up to 1024 bytes
let wide_search = TypeKind::Search {
range: NonZeroUsize::new(1024).unwrap(),
};
Endianness Options#
pub enum Endianness {
Little, // Little-endian (x86, ARM in little mode)
Big, // Big-endian (network byte order, PowerPC)
Native, // Host system byte order
}
Operator Types#
The Operator enum defines comparison and bitwise operations:
pub enum Operator {
Equal, // == (equality comparison)
NotEqual, // != (inequality comparison)
LessThan, // < (less-than comparison)
GreaterThan, // > (greater-than comparison)
LessEqual, // <= (less-than-or-equal comparison)
GreaterEqual, // >= (greater-than-or-equal comparison)
BitwiseAnd, // & (bitwise AND for pattern matching)
BitwiseAndMask(u64), // & (bitwise AND with mask value)
}
Added in v0.2.0: The comparison operators
LessThan,GreaterThan,LessEqual, andGreaterEqualwere added. This is a breaking change for exhaustive matches onOperator.
Usage Examples:
// Exact match
let equal_op = Operator::Equal;
// Not equal
let not_equal_op = Operator::NotEqual;
// Less than comparison
let less_op = Operator::LessThan;
// Greater than comparison
let greater_op = Operator::GreaterThan;
// Less than or equal
let less_equal_op = Operator::LessEqual;
// Greater than or equal
let greater_equal_op = Operator::GreaterEqual;
// Bitwise AND (useful for flag checking)
let bitwise_op = Operator::BitwiseAnd;
// Bitwise AND with mask
let bitwise_mask_op = Operator::BitwiseAndMask(0xFF00);
Value Types#
The Value enum represents expected values for comparison:
pub enum Value {
Uint(u64), // Unsigned integer
Int(i64), // Signed integer
Bytes(Vec<u8>), // Byte sequence
String(String), // String value
}
Examples:
// Unsigned integer value
let uint_val = Value::Uint(0x464c457f);
// Signed integer value
let int_val = Value::Int(-1);
// Byte sequence (magic numbers)
let bytes_val = Value::Bytes(vec![0x50, 0x4b, 0x03, 0x04]); // ZIP signature
// String value
let string_val = Value::String("#!/bin/sh".to_string());
Serialization Support#
All AST types implement Serialize and Deserialize for caching and interchange with comprehensive test coverage:
use serde_json;
// Serialize a rule to JSON (fully tested)
let rule = MagicRule { /* ... */ };
let json = serde_json::to_string(&rule)?;
// Deserialize from JSON (fully tested)
let rule: MagicRule = serde_json::from_str(&json)?;
// All edge cases are tested including:
// - Empty collections (Vec::new(), String::new())
// - Extreme values (u64::MAX, i64::MIN, i64::MAX)
// - Complex nested structures with multiple levels
// - All enum variants and their serialization round-trips
Implementation Status:
- ✅ Complete serialization for all AST types
- ✅ Comprehensive testing with edge cases and boundary values
- ✅ JSON compatibility for rule caching and interchange
- ✅ Round-trip validation ensuring data integrity
Common Patterns#
ELF File Detection#
let elf_rules = vec![
MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Long { endian: Endianness::Little, signed: false },
op: Operator::Equal,
value: Value::Bytes(vec![0x7f, 0x45, 0x4c, 0x46]),
message: "ELF".to_string(),
children: vec![
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(1),
message: "32-bit".to_string(),
children: vec![],
level: 1,
},
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(2),
message: "64-bit".to_string(),
children: vec![],
level: 1,
},
],
level: 0,
}
];
ZIP Archive Detection#
let zip_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Long { endian: Endianness::Little, signed: false },
op: Operator::Equal,
value: Value::Bytes(vec![0x50, 0x4b, 0x03, 0x04]),
message: "ZIP archive".to_string(),
children: vec![],
level: 0,
};
Script Detection with String Matching#
let script_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::String { max_length: Some(32) },
op: Operator::Equal,
value: Value::String("#!/bin/bash".to_string()),
message: "Bash script".to_string(),
children: vec![],
level: 0,
};
Best Practices#
Rule Organization#
- Start with broad patterns and use child rules for specifics
- Order rules by probability of matching (most common first)
- Use appropriate types for the data being checked
- Minimize indirection for performance
Type Selection#
- Use
Byte { signed }for single-byte values and flags, specifying signedness - Use
Short/Long/Quadwith explicit endianness and signedness for multi-byte integers - Use
Stringwith length limits for text patterns at exact offsets - Use
PStringfor Pascal-style length-prefixed strings - Use
Regexfor pattern matching (complex patterns, line-based checks, case-insensitive matching) - Use
Searchfor simple substring matching within a bounded range (faster than regex for literal patterns) - Use
Bytesfor exact binary sequences
Performance Considerations#
- Prefer absolute offsets over indirect when possible
- Use bitwise AND for flag checking instead of multiple equality rules
- Limit string lengths to prevent excessive reading
- Structure hierarchies to fail fast on non-matches
The AST provides a flexible, type-safe foundation for representing magic rules while maintaining compatibility with existing magic file formats.