AST Data Structures#
The Abstract Syntax Tree (AST) is the core representation of magic rules in libmagic-rs. This chapter provides detailed documentation of the fully implemented AST data structures with comprehensive test coverage (29 unit tests) and their usage patterns.
Overview#
The AST consists of several key types that work together to represent magic rules:
MagicRule: The main rule structure containing all componentsOffsetSpec: Specifies where to read data in filesTypeKind: Defines how to interpret bytesOperator: Comparison and bitwise operationsValue: Expected values for matchingEndianness: Byte order specifications
MagicRule Structure#
The MagicRule struct is the primary AST node representing a complete magic rule:
pub struct MagicRule {
pub offset: OffsetSpec, // Where to read data
pub typ: TypeKind, // How to interpret bytes
pub op: Operator, // Comparison operation
pub value: Value, // Expected value
pub message: String, // Human-readable description
pub children: Vec<MagicRule>, // Nested rules
pub level: u32, // Indentation level
}
Example Usage#
use libmagic_rs::parser::ast::*;
// ELF magic number rule
let elf_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Long {
endian: Endianness::Little,
signed: false
},
op: Operator::Equal,
value: Value::Bytes(vec![0x7f, 0x45, 0x4c, 0x46]), // "\x7fELF"
message: "ELF executable".to_string(),
children: vec![],
level: 0,
};
Hierarchical Rules#
Magic rules can contain child rules that are evaluated when the parent matches:
let parent_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(0x7f),
message: "ELF".to_string(),
children: vec![
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(1),
message: "32-bit".to_string(),
children: vec![],
level: 1,
},
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(2),
message: "64-bit".to_string(),
children: vec![],
level: 1,
},
],
level: 0,
};
OffsetSpec Variants#
The OffsetSpec enum defines where to read data within a file:
Absolute Offsets#
pub enum OffsetSpec {
/// Absolute offset from file start
Absolute(i64),
// ... other variants
}
Examples:
// Read at byte 0 (file start)
let start = OffsetSpec::Absolute(0);
// Read at byte 16
let offset_16 = OffsetSpec::Absolute(16);
// Read 4 bytes before current position (negative offset)
let relative_back = OffsetSpec::Absolute(-4);
Indirect Offsets#
Indirect offsets read a pointer value and use it as the actual offset:
Indirect {
base_offset: i64, // Where to read the pointer
pointer_type: TypeKind, // How to interpret the pointer
adjustment: i64, // Value to add to pointer
endian: Endianness, // Byte order for pointer
}
Example:
// Read a 32-bit little-endian pointer at offset 0x20,
// then read data at (pointer_value + 4)
let indirect = OffsetSpec::Indirect {
base_offset: 0x20,
pointer_type: TypeKind::Long {
endian: Endianness::Little,
signed: false
},
adjustment: 4,
endian: Endianness::Little,
};
Relative and FromEnd Offsets#
// Relative to previous match position
Relative(i64),
// Relative to end of file
FromEnd(i64),
Examples:
// 8 bytes after previous match
let relative = OffsetSpec::Relative(8);
// 16 bytes before end of file
let from_end = OffsetSpec::FromEnd(-16);
TypeKind Variants#
The TypeKind enum specifies how to interpret bytes at the given offset:
Breaking Change in v0.2.0: The
Bytevariant changed from a unit variant (Byte) to a struct variant (Byte { signed: bool }). Code that pattern-matches exhaustively onTypeKindrequires updates.
Numeric Types#
pub enum TypeKind {
/// Single byte (8-bit)
Byte { signed: bool },
/// 16-bit integer
Short { endian: Endianness, signed: bool },
/// 32-bit integer
Long { endian: Endianness, signed: bool },
/// 64-bit integer
Quad { endian: Endianness, signed: bool },
/// String data
String { max_length: Option<usize> },
/// Pascal string (length-prefixed)
PString {
max_length: Option<usize>,
length_width: PStringLengthWidth,
length_includes_itself: bool,
},
}
Examples:
// Single unsigned byte
let byte_type = TypeKind::Byte { signed: false };
// Single signed byte
let signed_byte_type = TypeKind::Byte { signed: true };
// 16-bit little-endian unsigned integer
let short_le = TypeKind::Short {
endian: Endianness::Little,
signed: false
};
// 32-bit big-endian signed integer
let long_be = TypeKind::Long {
endian: Endianness::Big,
signed: true
};
// 64-bit little-endian unsigned integer
let quad_le = TypeKind::Quad {
endian: Endianness::Little,
signed: false
};
// 64-bit big-endian signed integer
let quad_be = TypeKind::Quad {
endian: Endianness::Big,
signed: true
};
// Null-terminated string, max 256 bytes
let string_type = TypeKind::String {
max_length: Some(256)
};
PString (Pascal String)#
Pascal-style length-prefixed strings where the length prefix can be 1, 2, or 4 bytes depending on the length_width field.
Structure:
- Length prefix: 1, 2, or 4 bytes indicating string length, with configurable endianness
- String data: The number of bytes specified by the length prefix
Example:
0 pstring JPEG
0 pstring/H JPEG
The first line reads a 1-byte length prefix (default), then reads that many bytes as a string. The second line reads a 2-byte big-endian length prefix.
Behavior:
- Returns
Value::Stringcontaining the string data (without the length prefix) - Performs bounds checking on both the length prefix and the string data
- Supports all string comparison operators
- Length prefix width controlled by
PStringLengthWidthenum - Optional
/Jflag indicates JPEG-style self-inclusive length (stored length includes the prefix itself)
PStringLengthWidth Enum#
The PStringLengthWidth enum specifies the width and endianness of the length prefix:
pub enum PStringLengthWidth {
/// 1-byte length prefix (default, `/B` suffix)
OneByte,
/// 2-byte big-endian length prefix (`/H` suffix)
TwoByteBE,
/// 2-byte little-endian length prefix (`/h` suffix)
TwoByteLE,
/// 4-byte big-endian length prefix (`/L` suffix)
FourByteBE,
/// 4-byte little-endian length prefix (`/l` suffix)
FourByteLE,
}
Suffix conventions:
/B- 1-byte length prefix (default if no suffix specified)/H- 2-byte big-endian length prefix/h- 2-byte little-endian length prefix/L- 4-byte big-endian length prefix/l- 4-byte little-endian length prefix/J- Length includes the prefix width itself (combinable:/HJ,/lJ, etc.)
Examples:
use libmagic_rs::parser::ast::{TypeKind, PStringLengthWidth};
// 1-byte length prefix (default)
let pstring_default = TypeKind::PString {
max_length: None,
length_width: PStringLengthWidth::OneByte,
length_includes_itself: false,
};
// 2-byte big-endian length prefix
let pstring_be = TypeKind::PString {
max_length: None,
length_width: PStringLengthWidth::TwoByteBE,
length_includes_itself: false,
};
// 4-byte little-endian length prefix
let pstring_le = TypeKind::PString {
max_length: None,
length_width: PStringLengthWidth::FourByteLE,
length_includes_itself: false,
};
// 2-byte big-endian with /J flag (JPEG-style self-inclusive length)
let pstring_jpeg = TypeKind::PString {
max_length: None,
length_width: PStringLengthWidth::TwoByteBE,
length_includes_itself: true,
};
// Maximum 64-byte limit with 1-byte prefix
let limited_pstring = TypeKind::PString {
max_length: Some(64),
length_width: PStringLengthWidth::OneByte,
length_includes_itself: false,
};
Endianness Options#
pub enum Endianness {
Little, // Little-endian (x86, ARM in little mode)
Big, // Big-endian (network byte order, PowerPC)
Native, // Host system byte order
}
Operator Types#
The Operator enum defines comparison and bitwise operations:
pub enum Operator {
Equal, // == (equality comparison)
NotEqual, // != (inequality comparison)
LessThan, // < (less-than comparison)
GreaterThan, // > (greater-than comparison)
LessEqual, // <= (less-than-or-equal comparison)
GreaterEqual, // >= (greater-than-or-equal comparison)
BitwiseAnd, // & (bitwise AND for pattern matching)
BitwiseAndMask(u64), // & (bitwise AND with mask value)
}
Added in v0.2.0: The comparison operators
LessThan,GreaterThan,LessEqual, andGreaterEqualwere added. This is a breaking change for exhaustive matches onOperator.
Usage Examples:
// Exact match
let equal_op = Operator::Equal;
// Not equal
let not_equal_op = Operator::NotEqual;
// Less than comparison
let less_op = Operator::LessThan;
// Greater than comparison
let greater_op = Operator::GreaterThan;
// Less than or equal
let less_equal_op = Operator::LessEqual;
// Greater than or equal
let greater_equal_op = Operator::GreaterEqual;
// Bitwise AND (useful for flag checking)
let bitwise_op = Operator::BitwiseAnd;
// Bitwise AND with mask
let bitwise_mask_op = Operator::BitwiseAndMask(0xFF00);
Value Types#
The Value enum represents expected values for comparison:
pub enum Value {
Uint(u64), // Unsigned integer
Int(i64), // Signed integer
Bytes(Vec<u8>), // Byte sequence
String(String), // String value
}
Examples:
// Unsigned integer value
let uint_val = Value::Uint(0x464c457f);
// Signed integer value
let int_val = Value::Int(-1);
// Byte sequence (magic numbers)
let bytes_val = Value::Bytes(vec![0x50, 0x4b, 0x03, 0x04]); // ZIP signature
// String value
let string_val = Value::String("#!/bin/sh".to_string());
Serialization Support#
All AST types implement Serialize and Deserialize for caching and interchange with comprehensive test coverage:
use serde_json;
// Serialize a rule to JSON (fully tested)
let rule = MagicRule { /* ... */ };
let json = serde_json::to_string(&rule)?;
// Deserialize from JSON (fully tested)
let rule: MagicRule = serde_json::from_str(&json)?;
// All edge cases are tested including:
// - Empty collections (Vec::new(), String::new())
// - Extreme values (u64::MAX, i64::MIN, i64::MAX)
// - Complex nested structures with multiple levels
// - All enum variants and their serialization round-trips
Implementation Status:
- ✅ Complete serialization for all AST types
- ✅ Comprehensive testing with edge cases and boundary values
- ✅ JSON compatibility for rule caching and interchange
- ✅ Round-trip validation ensuring data integrity
Common Patterns#
ELF File Detection#
let elf_rules = vec![
MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Long { endian: Endianness::Little, signed: false },
op: Operator::Equal,
value: Value::Bytes(vec![0x7f, 0x45, 0x4c, 0x46]),
message: "ELF".to_string(),
children: vec![
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(1),
message: "32-bit".to_string(),
children: vec![],
level: 1,
},
MagicRule {
offset: OffsetSpec::Absolute(4),
typ: TypeKind::Byte { signed: false },
op: Operator::Equal,
value: Value::Uint(2),
message: "64-bit".to_string(),
children: vec![],
level: 1,
},
],
level: 0,
}
];
ZIP Archive Detection#
let zip_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::Long { endian: Endianness::Little, signed: false },
op: Operator::Equal,
value: Value::Bytes(vec![0x50, 0x4b, 0x03, 0x04]),
message: "ZIP archive".to_string(),
children: vec![],
level: 0,
};
Script Detection with String Matching#
let script_rule = MagicRule {
offset: OffsetSpec::Absolute(0),
typ: TypeKind::String { max_length: Some(32) },
op: Operator::Equal,
value: Value::String("#!/bin/bash".to_string()),
message: "Bash script".to_string(),
children: vec![],
level: 0,
};
Best Practices#
Rule Organization#
- Start with broad patterns and use child rules for specifics
- Order rules by probability of matching (most common first)
- Use appropriate types for the data being checked
- Minimize indirection for performance
Type Selection#
- Use
Byte { signed }for single-byte values and flags, specifying signedness - Use
Short/Long/Quadwith explicit endianness and signedness for multi-byte integers - Use
Stringwith length limits for text patterns - Use
PStringfor Pascal-style length-prefixed strings - Use
Bytesfor exact binary sequences
Performance Considerations#
- Prefer absolute offsets over indirect when possible
- Use bitwise AND for flag checking instead of multiple equality rules
- Limit string lengths to prevent excessive reading
- Structure hierarchies to fail fast on non-matches
The AST provides a flexible, type-safe foundation for representing magic rules while maintaining compatibility with existing magic file formats.