Parser Implementation#

The parser is built on the nom parser combinator library and produces the AST defined in src/parser/ast.rs.

Architecture Overview#

The parser follows a modular design where individual components are implemented and tested separately, then composed into higher-level parsers:

Magic File Text → Individual Parsers → Combined Parsers → Complete AST
                      ↓
              Numbers, Offsets, Operators, Values → Rules → Rule Hierarchies

Implemented Components#

Number Parsing (`parse_number`)#

Handles both decimal and hexadecimal number formats with comprehensive overflow protection:

// Decimal numbers
parse_number("123") // Ok(("", 123))
parse_number("-456") // Ok(("", -456))

// Hexadecimal numbers
parse_number("0x1a") // Ok(("", 26))
parse_number("-0xFF") // Ok(("", -255))

Features:

✅ Decimal and hexadecimal format support
✅ Signed and unsigned number handling
✅ Overflow protection with proper error reporting
✅ Comprehensive test coverage (15+ test cases)

Offset Parsing (`parse_offset`)#

Converts numeric values into OffsetSpec::Absolute variants:

// Basic offsets
parse_offset("0") // Ok(("", OffsetSpec::Absolute(0)))
parse_offset("0x10") // Ok(("", OffsetSpec::Absolute(16)))
parse_offset("-4") // Ok(("", OffsetSpec::Absolute(-4)))

// With whitespace handling
parse_offset(" 123 ") // Ok(("", OffsetSpec::Absolute(123)))

Features:

✅ Absolute offset parsing with full number format support
✅ Whitespace handling (leading and trailing)
✅ Negative offset support for relative positioning
📋 Indirect offset parsing (planned)
📋 Relative offset parsing (planned)

Operator Parsing (`parse_operator`)#

Parses comparison and bitwise operators with multiple syntax variants:

// Equality operators
parse_operator("=") // Ok(("", Operator::Equal))
parse_operator("==") // Ok(("", Operator::Equal))

// Inequality operators
parse_operator("!=") // Ok(("", Operator::NotEqual))
parse_operator("<>") // Ok(("", Operator::NotEqual))

// Comparison operators (v0.2.0+)
parse_operator("<") // Ok(("", Operator::LessThan))
parse_operator(">") // Ok(("", Operator::GreaterThan))
parse_operator("<=") // Ok(("", Operator::LessEqual))
parse_operator(">=") // Ok(("", Operator::GreaterEqual))

// Bitwise operators
parse_operator("&") // Ok(("", Operator::BitwiseAnd))
parse_operator("^") // Ok(("", Operator::BitwiseXor))
parse_operator("~") // Ok(("", Operator::BitwiseNot))

// Any-value operator (always matches)
parse_operator("x") // Ok(("", Operator::AnyValue))

Features:

✅ Multiple syntax variants for compatibility
✅ Precedence handling (longer operators matched first)
✅ Whitespace tolerance
✅ Invalid operator rejection with clear errors
✅ Ten comparison and bitwise operators supported, plus AnyValue (x)

Note: Comparison operators (<, >, <=, >=) were implemented in v0.2.0 via #104.

Value Parsing (`parse_value`)#

Handles multiple value types with intelligent type detection:

// String literals with escape sequences
parse_value("\"Hello\"") // Value::String("Hello".to_string())
parse_value("\"Line1\\nLine2\"") // Value::String("Line1\nLine2".to_string())

// Floating-point literals
parse_value("3.14") // Value::Float(3.14)
parse_value("-1.0") // Value::Float(-1.0)
parse_value("2.5e10") // Value::Float(2.5e10)

// Numeric values
parse_value("123") // Value::Uint(123)
parse_value("-456") // Value::Int(-456)
parse_value("0x1a") // Value::Uint(26)

// Hex byte sequences
parse_value("\\x7f\\x45") // Value::Bytes(vec![0x7f, 0x45])
parse_value("7f454c46") // Value::Bytes(vec![0x7f, 0x45, 0x4c, 0x46])

Features:

✅ Quoted string parsing with escape sequence support
✅ Floating-point literal parsing with scientific notation support
✅ Numeric literal parsing (decimal and hexadecimal)
✅ Hex byte sequence parsing (with and without \x prefix)
✅ Intelligent type precedence to avoid parsing conflicts
✅ Comprehensive escape sequence handling (\n, \t, \r, \\, \", \', \0)

Float and Double Type Parsing (`parse_float_value`)#

Parses floating-point type specifiers and literals for IEEE 754 single (32-bit) and double-precision (64-bit) values:

// Float literals
parse_float_value("3.14") // Ok(("", 3.14))
parse_float_value("-0.5") // Ok(("", -0.5))
parse_float_value("1.0e-10") // Ok(("", 1.0e-10))
parse_float_value("2.5E+3") // Ok(("", 2.5e+3))

Type Keywords:

Six floating-point type keywords are supported, each mapping to TypeKind::Float or TypeKind::Double with an Endianness field:

float - 32-bit IEEE 754, native endianness → TypeKind::Float { endian: Endianness::Native }
befloat - 32-bit IEEE 754, big-endian → TypeKind::Float { endian: Endianness::Big }
lefloat - 32-bit IEEE 754, little-endian → TypeKind::Float { endian: Endianness::Little }
double - 64-bit IEEE 754, native endianness → TypeKind::Double { endian: Endianness::Native }
bedouble - 64-bit IEEE 754, big-endian → TypeKind::Double { endian: Endianness::Big }
ledouble - 64-bit IEEE 754, little-endian → TypeKind::Double { endian: Endianness::Little }

Float Literal Grammar:

The parse_float_value function recognizes standard floating-point notation with a mandatory decimal point to distinguish floats from integers:

[-]digits.digits[{e|E}[{+|-}]digits]

Examples: 3.14, -0.5, 1.0e-10, 2.5E+3

Parsed literals are stored as Value::Float(f64) in the AST, regardless of whether the rule uses float or double (the type determines buffer read size, not literal representation).

Usage in Magic Rules:

// Native-endian float comparison
0 float x // Match any float value
0 float =3.14 // Match if float equals 3.14

// Big-endian double comparison
0 bedouble >1.5 // Match if big-endian double > 1.5

Features:

✅ Six type keywords for float and double with endianness variants
✅ Float literal parsing with decimal point, negative values, scientific notation
✅ Value::Float(f64) AST variant for floating-point literals
✅ Type precedence ensures floats parsed before integers (decimal point disambiguates)
✅ Comprehensive test coverage for all endianness variants and literal formats

Note: Float and double types do not have signed/unsigned variants. IEEE 754 handles sign internally via the sign bit, so all float types use a single TypeKind variant with only an endian field (no signed: bool field).

Pascal String (pstring) Type#

The parser supports Pascal-style length-prefixed strings through the pstring keyword with multiple length prefix width variants:

Type Keyword:

pstring - Length-prefixed string → TypeKind::PString { max_length: None, length_width: PStringLengthWidth::OneByte, length_includes_itself: false }

Length Prefix Width Variants:

Pascal strings support multiple length prefix widths via suffix modifiers:

/B - 1-byte length prefix (default) → PStringLengthWidth::OneByte
/H - 2-byte big-endian length prefix → PStringLengthWidth::TwoByteBE
/h - 2-byte little-endian length prefix → PStringLengthWidth::TwoByteLE
/L - 4-byte big-endian length prefix → PStringLengthWidth::FourByteBE
/l - 4-byte little-endian length prefix → PStringLengthWidth::FourByteLE

Self-Inclusive Length Flag (/J):

The /J flag indicates JPEG-style self-inclusive length, where the stored length value includes the length prefix bytes themselves. The evaluator subtracts the prefix width from the stored length to determine the actual string data length.

The /J flag can be combined with any width variant:

/J - 1-byte self-inclusive (default width)
/BJ - 1-byte self-inclusive (explicit)
/HJ - 2-byte big-endian self-inclusive
/hJ - 2-byte little-endian self-inclusive
/LJ - 4-byte big-endian self-inclusive
/lJ - 4-byte little-endian self-inclusive

Format:

Pascal strings store the length as a prefix (1, 2, or 4 bytes depending on the variant), followed by that many bytes of string data. Unlike C strings, they are not null-terminated. When the /J flag is used, the length value includes the prefix size itself.

Parser Implementation:

Recognized by parse_type_keyword() in src/parser/types.rs
Suffix parsing handled by parse_pstring_suffix() in src/parser/grammar/mod.rs
Maps to TypeKind::PString in the AST with length_width and length_includes_itself fields
Evaluator reads length prefix using appropriate byte order (from_be_bytes or from_le_bytes)
Stored as Value::String for comparison with string operators
Supports optional max_length field to cap the length value

Usage in Magic Rules:

// Basic pstring matching (1-byte length prefix)
0 pstring =Hello // Match if pstring equals "Hello"
0 pstring x // Match any pstring value

// Multi-byte length prefix variants
0 pstring/H =Test // 2-byte big-endian length prefix
0 pstring/h =Test // 2-byte little-endian length prefix
0 pstring/L =Test // 4-byte big-endian length prefix
0 pstring/l =Test // 4-byte little-endian length prefix

// JPEG-style self-inclusive length
0 pstring/J x // 1-byte self-inclusive length
0 pstring/HJ =Data // 2-byte big-endian self-inclusive length
0 pstring/lJ =Data // 4-byte little-endian self-inclusive length

// With max_length constraint
0 pstring/H/64 x // 2-byte prefix, limit read to 64 bytes

Features:

✅ Five length prefix width variants (1-byte, 2-byte BE/LE, 4-byte BE/LE)
✅ Self-inclusive length flag (/J) for JPEG-style length encoding
✅ Combinable suffix syntax (/HJ, /lJ, etc.)
✅ Bounds checking for both length prefix and string data
✅ Proper endianness handling via from_be_bytes / from_le_bytes
✅ UTF-8 validation with replacement character for invalid sequences
✅ Optional max_length parameter to limit string reads
✅ String comparison operators work with pstring values

Date and Timestamp Types#

The parser supports date and timestamp types for parsing Unix timestamps (signed seconds since epoch). There are 12 type keywords:

32-bit timestamps (Date):

date - Native endian, UTC
ldate - Native endian, local time
bedate - Big-endian, UTC
beldate - Big-endian, local time
ledate - Little-endian, UTC
leldate - Little-endian, local time

64-bit timestamps (QDate):

qdate - Native endian, UTC
qldate - Native endian, local time
beqdate - Big-endian, UTC
beqldate - Big-endian, local time
leqdate - Little-endian, UTC
leqldate - Little-endian, local time

The parser creates TypeKind::Date or TypeKind::QDate variants with appropriate endianness and UTC flags. During evaluation, timestamps are formatted as strings in the format "Www Mmm DD HH:MM YYYY" to match GNU file output.

Regex Type#

The parser supports regular expression matching through the regex keyword, enabling POSIX-extended regex patterns against file contents:

Type Keyword:

regex - Regular expression match → TypeKind::Regex { flags, count }

Flag Support:

Regex rules accept three modifier flags via the /[csl] suffix:

/c - Case-insensitive matching → RegexFlags::case_insensitive = true
/s - Advance anchor to match-start instead of match-end → RegexFlags::start_offset = true
/l - Line-based scan window → collapsed into RegexCount::Lines(count) by the grammar layer (it is NOT a flag field)

Flags can be combined in any order (/cl, /lc, /csl are all equivalent). The parser also accepts interleaved flag-and-count syntax matching GNU file semantics: regex/1l and regex/l1 both parse identically. Duplicate counts (regex/1l2l, regex/1c2l, regex/l1l2) are parse errors.

Count and Scan Window:

The count (if any) and the /l flag collapse into a single RegexCount enum variant:

regex → RegexCount::Default — scan 8192 bytes (default) or until buffer ends
regex/N → RegexCount::Bytes(N) — scan at most N bytes, clamped at 8192
regex/Nl → RegexCount::Lines(Some(N)) — scan from offset through the end of the Nth line terminator (LF, CRLF, or bare CR), capped at 8192 bytes
regex/l → RegexCount::Lines(None) — behaviorally equivalent to Default (walks the full 8192-byte capped window)

The 8192-byte hard cap matches GNU file's FILE_REGEX_MAX constant and prevents runaway regex scans against large buffers.

Parsing Examples:

// Plain regex (no flags, default 8192-byte scan window)
parse_type_and_operator("regex")
// → TypeKind::Regex { flags: RegexFlags::default(), count: RegexCount::Default }

// Case-insensitive flag
parse_type_and_operator("regex/c")
// → TypeKind::Regex {
// flags: RegexFlags { case_insensitive: true, start_offset: false },
// count: RegexCount::Default,
// }

// Line-based with explicit count
parse_type_and_operator("regex/1l")
// → TypeKind::Regex {
// flags: RegexFlags::default(),
// count: RegexCount::Lines(NonZeroU32::new(1)),
// }

// Byte count with case-insensitive + start-offset flags
parse_type_and_operator("regex/cs256")
// → TypeKind::Regex {
// flags: RegexFlags { case_insensitive: true, start_offset: true },
// count: RegexCount::Bytes(NonZeroU32::new(256).unwrap()),
// }

Usage in Magic Rules:

// Match lines starting with a digit
0 regex "^[0-9]" numeric prefix

// Case-insensitive JSON detection
0 regex/c "\\{.*\"[^\"]+\"" possible JSON

// Scan first line only for version string
>1 regex/1l "version [0-9]+" version line

Regex Semantics:

Multi-line regex mode is always enabled (matching libmagic's unconditional REG_NEWLINE), so ^ and $ match at line boundaries and . does not match \n. This behavior is independent of the /l flag; /l controls the scan window (line-based vs byte-based), not the regex compilation mode.
The scan window is always capped at 8192 bytes regardless of the count value.
Zero-width matches (for example ^, a*, or .{0}) are preserved as Value::String("") and distinguished from genuine misses. The Rust regex crate does not support look-around assertions (lookaheads or lookbehinds) -- those are deliberately excluded to preserve its linear-time matching guarantees.
Regex rules only support Operator::Equal and Operator::NotEqual; other comparison operators are rejected at evaluation time.

Features:

✅ regex keyword recognition with suffix parsing
✅ Three modifier flags (/c, /s, /l) with arbitrary combination order
✅ Optional numeric count parameter (interleaved with flags per GNU file semantics)
✅ Duplicate regex counts rejected with clear parse errors
✅ 8192-byte scan window cap matching FILE_REGEX_MAX
✅ Bare regex/ with no valid modifier is a parse error
✅ regex/0 is rejected (zero count has no valid semantics)
✅ RegexFlags struct representation for clean flag management

Search Type#

The parser supports bounded literal byte sequence searching through the search keyword with optional modifier flags:

Type Keyword:

search - Multi-byte pattern search within bounded range → TypeKind::Search { range, flags }

Mandatory Range Parameter:

Search rules require a decimal range suffix specifying the scan window width in bytes:

/N - Scan up to N bytes for the literal pattern, stored as NonZeroUsize

Per GNU file magic(5) specification, the range is mandatory. Bare search (no /N suffix) and search/0 are both rejected at parse time.

SearchFlags Structure:

The SearchFlags struct contains nine boolean fields corresponding to the flag suffixes. Each field defaults to false (byte-exact comparison, match-END anchor):

start_anchor (/s) - Anchor advances to match-START instead of match-END (required for TGA footer patterns, sfnt name tables)
ignore_lowercase (/c) - Asymmetric case-insensitive: lowercase pattern bytes match either case in buffer
ignore_uppercase (/C) - Asymmetric case-insensitive: uppercase pattern bytes match either case in buffer
compact_optional_whitespace (/w) - Pattern whitespace matches zero-or-more buffer whitespace
compact_whitespace (/W) - Pattern whitespace requires ≥1 buffer whitespace, then absorbs greedily
trim (/T) - Trim leading/trailing ASCII whitespace from pattern at evaluation time
full_word (/f) - Post-match word-boundary check (byte after match must be non-word or end-of-buffer)
text_test (/t) - Hint for text files (captured for MIME-output integration, no current comparison effect)
bin_test (/b) - Hint for binary files (captured for MIME-output integration, no current comparison effect)

The /B flag is accepted as a synonym for /b in search rules (distinct from pstring's /B which is the 1-byte length-width letter).

Flags can be combined in any order (/cs, /sWcT, etc.). Duplicate letters are accepted idempotently (for example search/256/cc sets ignore_lowercase once with no side effect).

Parsing Examples:

// 256-byte search window, no flags
parse_type_and_operator("search/256")
// → TypeKind::Search {
// range: NonZeroUsize(256),
// flags: SearchFlags::default(),
// }

// Match-start anchor flag
parse_type_and_operator("search/256/s")
// → TypeKind::Search {
// range: NonZeroUsize(256),
// flags: SearchFlags {
// start_anchor: true,
// ..Default::default()
// },
// }

// Multiple flags: case-insensitive lowercase + start anchor
parse_type_and_operator("search/256/cs")
// → TypeKind::Search {
// range: NonZeroUsize(256),
// flags: SearchFlags {
// ignore_lowercase: true,
// start_anchor: true,
// ..Default::default()
// },
// }

// Bare search is a parse error (range is mandatory)
parse_type_and_operator("search")
// → Err(...)

// Zero-range search is rejected
parse_type_and_operator("search/0")
// → Err(...)

Usage in Magic Rules:

// Scan up to 256 bytes for DOS MZ header
0 search/256 "MZ" DOS executable

// Match-start anchor for TGA footer (signature at end, anchor at start)
0 search/18/s TRUEVISION-XFILE TGA image

// Case-insensitive search
0 search/1024/c "content-type:" HTTP header

// Multiple flags: optional whitespace + trim
0 search/512/wT "version" version string

Evaluation Semantics:

The parser stores flag letters in the SearchFlags struct via parse_search_suffix, which returns (NonZeroUsize, SearchFlags).
Unlike TypeKind::String (which only matches at the exact offset), search scans forward up to range bytes for the first occurrence of the literal pattern.
When SearchFlags::needs_byte_compare() returns true (any of /c, /C, /w, /W, /T, /f is set), the evaluator uses a byte-by-byte walk through compare_string_with_flags. When only anchor-only or metadata-only flags (/s, /t, /b) are set, the SIMD-accelerated memchr::memmem::find fast path is preserved.
The anchor advance is controlled by start_anchor: when true, the anchor lands at the match-START index (matching libmagic's STRING_SEARCHEND behavior); when false, the anchor lands at match-END (default, matching FILE_SEARCH in softmagic.c::moffset()).
Search rules only support Operator::Equal and Operator::NotEqual; other comparison operators are rejected at evaluation time.

Features:

✅ search keyword recognition with mandatory /N suffix
✅ NonZeroUsize range representation (zero-width scan unrepresentable)
✅ Bare search and search/0 rejected at parse time
✅ Nine modifier flags (/s, /c, /C, /w, /W, /T, /t, /b, /f) parsed into SearchFlags struct
✅ Duplicate flag letters accepted idempotently
✅ Fast-path preservation for anchor-only flags (SIMD memchr when needs_byte_compare() is false)
✅ Binary-safe literal matching via memchr::memmem::find or compare_string_with_flags

Meta-type Directives (`name`, `use`, `default`, `clear`, `indirect`, `offset`)#

The parser supports six meta-type directives that represent control-flow rather than buffer reads. They all parse into the TypeKind::Meta(MetaType) AST variant and carry no endianness or width.

Type Keywords and MetaType Variants:

Keyword	`MetaType` Variant	Role
`name <id>`	`MetaType::Name(String)`	Declares a named subroutine; children form the subroutine body
`use <id>`	`MetaType::Use(String)`	Invokes a named subroutine at the resolved offset
`default`	`MetaType::Default`	Fires only when no sibling at the same level has matched
`clear`	`MetaType::Clear`	Resets the per-level sibling-matched flag
`indirect`	`MetaType::Indirect`	Re-applies the root rule set at the resolved offset
`offset`	`MetaType::Offset`	Emits the resolved file position as `Value::Uint` for printf-style formatting

Meta-types have bit_width() == None because they consume zero on-disk bytes.

ParsedMagic Return Type (Breaking Change):

parse_text_magic_file, load_magic_file, and load_magic_directory now return Result<ParsedMagic, ParseError> (not Result<Vec<MagicRule>, ParseError>). The ParsedMagic struct carries both the top-level rules and a name table:

pub struct ParsedMagic {
    pub rules: Vec<MagicRule>,
    pub(crate) name_table: NameTable,
}

Callers must destructure at the boundary:

use libmagic_rs::parser::parse_text_magic_file;

let magic = r#"0 string \x7fELF ELF file
>4 byte 1 32-bit"#;

let parsed = parse_text_magic_file(magic)?;
assert_eq!(parsed.rules.len(), 1); // One root rule
assert_eq!(parsed.rules[0].children.len(), 1); // One child rule
// parsed.name_table holds any `name <id>` blocks extracted at load time

Load-time Name Extraction:

Top-level name <id> rules are hoisted out of ParsedMagic::rules by parser::name_table::extract_name_table and placed into name_table keyed by identifier. As a result:

name rules do not appear in ParsedMagic::rules at all — only use <id> invocations remain to drive subroutine dispatch at evaluation time.
Duplicate name declarations keep the first definition and emit a warn!.
name rules that appear as children (not at level 0) are not well-defined in magic(5); they are scrubbed from the tree with a warn! during extraction.

Features:

✅ All six keywords recognized by parse_type_keyword + type_keyword_to_kind
✅ Round-trip through serialize_type_kind in codegen.rs
✅ Top-level name extraction into NameTable
✅ Defensive scrubbing of misplaced nested name rules
✅ First-wins merge across directory loads

Parser Design Principles#

Error Handling#

All parsers use nom's IResult type for consistent error handling:

pub fn parse_number(input: &str) -> IResult<&str, i64> {
    // Implementation with proper error propagation
}

Error Categories:

Syntax Errors: Invalid characters or malformed input
Overflow Errors: Numbers too large for target type
Format Errors: Invalid hex digits, unterminated strings, etc.

Memory Safety#

All parsing operations are memory-safe with no unsafe code:

Bounds Checking: All buffer access is bounds-checked
Overflow Protection: Numeric parsing includes overflow detection
Resource Management: No manual memory management required

Performance Optimization#

The parser is designed for efficiency:

Zero-Copy: String slices used where possible to avoid allocations
Early Termination: Parsers fail fast on invalid input
Minimal Backtracking: Parser combinators designed to minimize backtracking

Testing Strategy#

Each parser component has comprehensive test coverage:

Test Categories#

Basic Functionality: Core parsing behavior
Edge Cases: Boundary values, empty input, etc.
Error Conditions: Invalid input handling
Whitespace Handling: Leading/trailing whitespace tolerance
Remaining Input: Proper handling of unconsumed input

Example Test Structure#

#[test]
fn test_parse_number_positive() {
    assert_eq!(parse_number("123"), Ok(("", 123)));
    assert_eq!(parse_number("0x1a"), Ok(("", 26)));
}

#[test]
fn test_parse_number_with_remaining_input() {
    assert_eq!(parse_number("123abc"), Ok(("abc", 123)));
    assert_eq!(parse_number("0xFF rest"), Ok((" rest", 255)));
}

#[test]
fn test_parse_number_edge_cases() {
    assert_eq!(parse_number("0"), Ok(("", 0)));
    assert_eq!(parse_number("-0"), Ok(("", 0)));
    assert!(parse_number("").is_err());
    assert!(parse_number("abc").is_err());
}

Complete Magic File Parsing#

The parser provides complete magic file parsing through the parse_text_magic_file() function:

use libmagic_rs::parser::parse_text_magic_file;

let magic_content = r#"
# ELF file format
0 string \x7fELF ELF executable
>4 byte 1 32-bit
>4 byte 2 64-bit
"#;

let parsed = parse_text_magic_file(magic_content)?;
assert_eq!(parsed.rules.len(), 1); // One root rule
assert_eq!(parsed.rules[0].children.len(), 2); // Two child rules
// parsed.name_table holds any top-level `name <id>` subroutine blocks

The parser distinguishes between signed and unsigned type variants (e.g., byte vs ubyte, leshort vs uleshort), mapping them to the signed field in TypeKind::Byte { signed: bool } and similar type variants. Unprefixed types default to signed in accordance with libmagic conventions. Float and double types do not have signed/unsigned variants; IEEE 754 handles sign internally.

Format Detection#

The parser automatically detects magic file formats:

use libmagic_rs::parser::{detect_format, MagicFileFormat};

match detect_format(path)? {
    MagicFileFormat::Text => // Parse as text magic file
    MagicFileFormat::Directory => // Load all files from Magdir
    MagicFileFormat::Binary => // Show helpful error (not yet supported)
}

Current Limitations#

Not Yet Implemented#

Binary .mgc Format: Compiled magic database format

Planned Enhancements#

Better Error Messages: More descriptive error reporting with source locations
Performance Optimization: Specialized parsers for common patterns
Streaming Support: Incremental parsing for large magic files

Integration Points#

The parser provides a complete pipeline from text to AST:

use libmagic_rs::parser::{parse_text_magic_file, detect_format, MagicFileFormat};

// Detect format and parse accordingly
let parsed = match detect_format(path)? {
    MagicFileFormat::Text => {
        let content = std::fs::read_to_string(path)?;
        parse_text_magic_file(&content)?
    }
    MagicFileFormat::Directory => {
        // Load and merge all files in directory (rules + merged name table)
        load_magic_directory(path)?
    }
    MagicFileFormat::Binary => {
        return Err(ParseError::UnsupportedFormat { ... });
    }
};
// parsed.rules is the top-level rule list, parsed.name_table holds `name`/`use` subroutines

The hierarchical structure is automatically built from indentation levels (> prefixes), enabling parent-child rule relationships for detailed file type identification.

Parser Implementation#

Architecture Overview#

Implemented Components#

Number Parsing (parse_number)#

Offset Parsing (parse_offset)#

Operator Parsing (parse_operator)#

Value Parsing (parse_value)#

Float and Double Type Parsing (parse_float_value)#

Pascal String (pstring) Type#

Date and Timestamp Types#

Regex Type#

Search Type#

Meta-type Directives (name, use, default, clear, indirect, offset)#

Parser Design Principles#

Error Handling#

Memory Safety#

Performance Optimization#

Testing Strategy#

Test Categories#

Example Test Structure#

Complete Magic File Parsing#

Format Detection#

Current Limitations#

Not Yet Implemented#

Planned Enhancements#

Integration Points#

Number Parsing (`parse_number`)#

Offset Parsing (`parse_offset`)#

Operator Parsing (`parse_operator`)#

Value Parsing (`parse_value`)#

Float and Double Type Parsing (`parse_float_value`)#

Meta-type Directives (`name`, `use`, `default`, `clear`, `indirect`, `offset`)#