Magic File Compatibility Status#
libmagic-rs is a pure-Rust clean-room implementation of libmagic, the C library that powers the Unix file command for identifying file types. As of version 0.4.0 (released March 2026), the project is in early development with fundamental file identification capabilities operational. The implementation currently achieves 0% compatibility (0/81 tests passing) against the third-party GNU file test corpus, but follows a structured milestone-based roadmap targeting 95%+ compatibility with GNU file by version 1.0.0.
The implementation emphasizes memory safety with zero unsafe code, thread-safe design, and modern Rust error handling patterns. Development is organized through five strategic epics that address operator completeness, type system expansion, offset resolution mechanisms, specification compliance, and final compatibility validation. The project uses a multi-stage parser architecture built on nom combinators and implements hierarchical rule evaluation with graceful error handling.
This article provides comprehensive tracking of implemented versus unsupported features, magic file directive support status, version milestones spanning v0.1.0 through v1.0.0, and the detailed enhancement roadmap specific to the libmagic-rs implementation.
Current Implementation Status (v0.4.x)#
Implemented Features#
The v0.4.x releases provide foundational file identification capabilities across data types, operators, offsets, nested rules, and string matching. While limited in scope compared to GNU file, these features establish the architectural patterns for future expansion.
Data Types#
The evaluator currently supports fourteen basic data types with endianness variants:
- Byte: Single byte values (8-bit signed or unsigned)
- Short: 16-bit integers with native, little-endian, big-endian support (signed and unsigned)
- Long: 32-bit integers with endianness variants (signed and unsigned)
- Quad: 64-bit integers with endianness variants (signed and unsigned)
- Float: IEEE-754 32-bit floating point with endianness variants (float/befloat/lefloat). Implementation in
src/evaluator/types/float.rs(PR #162 / v0.3.0) with comparison tolerance for approximate matching. - Double: IEEE-754 64-bit floating point with endianness variants (double/bedouble/ledouble). Implementation in
src/evaluator/types/float.rs(PR #162 / v0.3.0) with comparison tolerance for approximate matching. - Date: 32-bit Unix timestamps (date/ldate/bedate/beldate/ledate/leldate) with endianness variants and UTC/local time formatting. Implementation in
src/evaluator/types/date.rsuseschronocrate for timestamp formatting with format string"%a %b %e %H:%M:%S %Y"matching GNU file output. - QDate: 64-bit Unix timestamps (qdate/qldate/beqdate/beqldate/leqdate/leqldate) with endianness variants and UTC/local time formatting. Shares formatting implementation with Date type for consistent output.
- String: Null-terminated or length-limited strings with UTF-8 conversion using SIMD-accelerated null scanning
- String16: UCS-2 (16-bit Unicode) strings with explicit byte order. Backs the magic(5)
lestring16(little-endian) andbestring16(big-endian) keywords. Each character occupies two bytes; the reader stops at a U+0000 terminator (encoded as0x00 0x00) or at the buffer end. Implementation insrc/evaluator/types/string.rsdecodes code units to RustStringwith surrogate pairs replaced by U+FFFD. - PString: Pascal-style length-prefixed strings (pstring) with 1/2/4-byte length prefixes plus
/Jself-inclusive-length flag support. Implementation insrc/evaluator/types/string.rssupports combinable flags (e.g.,pstring/HJfor 2-byte length prefix with self-inclusive semantics) and provides bounds checking for both the length field and string data. - Regex: POSIX-extended regular expression matching using
regex::bytes::Regexfor binary-safe pattern matching. Implementation insrc/evaluator/types/regex.rssupports case-insensitive (/c), start-offset (/s), and line-based (/l) flags with scan window capped at 8192 bytes matching GNUfilebehavior. - Search: Bounded literal byte pattern search within a mandatory range using
memchr::memmem::find. Implementation insrc/evaluator/types/search.rsscans forward from the rule offset up to the specified range for the first occurrence of the literal pattern.
Operators#
Eighteen operators are fully implemented in evaluation:
- Equal (=, ==): Equality comparison with cross-type integer coercion
- NotEqual (!, !=, <>): Inequality comparison. Accepts bare
!(magic(5) canonical form),!=, and<>as aliases. - LessThan (<): Less-than comparison with cross-type integer coercion
- GreaterThan (>): Greater-than comparison with cross-type integer coercion
- LessEqual (<=): Less-than-or-equal comparison with cross-type integer coercion
- GreaterEqual (>=): Greater-than-or-equal comparison with cross-type integer coercion
- BitwiseAnd (&): Bitwise AND pattern matching (returns true if result is non-zero)
- BitwiseAndMask (&0xMASK): Applies mask to value before equality comparison
- BitwiseXor (^): Bitwise XOR pattern matching (returns true if result is non-zero)
- BitwiseNot (~): Applies bitwise complement to value before equality comparison
- AnyValue (x): Unconditional match that always returns true
The comparison operators provide version checks and range matching capabilities. The bitwise operators (XOR, NOT, AnyValue) enable advanced binary pattern matching and unconditional match patterns.
Offsets#
Three offset types are fully operational with complete indirect and relative support:
- Absolute offsets: Positive offsets from file start; negative offsets treated as offsets from file end
- FromEnd offsets: Explicit offsets from file end with comprehensive bounds checking
- Relative offsets: Resolve as
last_match_end + deltaagainst the previous-match anchor, following GNUfile/libmagic semantics. The evaluator threads the anchor throughEvaluationContext, advancing it after each successful match by the bytes consumed (variable-width types include c-string NUL terminators and pstring length prefixes). Top-level relative offsets resolve from anchor 0. Fully evaluated and implemented in PR #211. - Indirect offsets: Pointer dereferencing where a value read from one offset specifies the test location. Critical for PE executables and Office documents. Fully evaluated and implemented in PR #42.
Nested Rules#
The evaluator implements hierarchical rule evaluation where child rules are only evaluated if parent rules match. The implementation includes:
- Stack-based hierarchy construction from flat rule lists
- Recursion depth tracking with configurable limits
- Graceful error handling where individual rule failures are skipped while critical errors propagate
String Matching#
String evaluation implements:
- Null-termination detection: Reads until the first NUL byte or buffer end
- Length constraints: Respects optional
max_lengthparameters - UTF-8 conversion: Uses
String::from_utf8_lossyto replace invalid sequences with replacement characters - SIMD optimization: Leverages the
memchrcrate for efficient null byte scanning - Byte-exact comparison: For rules with comparison values (e.g.,
0 string PATTERN), reads exactly the pattern length from the file with no NUL truncation, matching magic(5) semantics where embedded NULs in patterns must match corresponding bytes in the file
Parsed But Not Yet Evaluated#
This section previously tracked indirect offsets and relative offsets. Both offset types are fully implemented and evaluated in the evaluator/offset/ submodule:
- Indirect offsets: Working implementation in
src/evaluator/offset/indirect.rs(PR #42) - Relative offsets: Working implementation in
src/evaluator/offset/relative.rs(PR #211)
Directive Support Status#
Magic file directives control rule behavior and output formatting. The current implementation fully supports strength modification but lacks support for MIME type specification, file extension hints, and named test composition.
Fully Implemented: !#
The parser fully supports the ! directive for modifying rule confidence scores. Supported operations include:
- Addition:
!:strength +N - Subtraction:
!:strength -N - Multiplication:
!:strength *N - Division:
!:strength /N - Absolute set:
!:strength =Nor!:strength N
Implemented: Meta-Type Directives#
The parser and evaluator fully support six meta-type directives for control-flow and rule composition, implemented in PR #42 with evaluator dispatch and printf-style format substitution (%d, %u, %x, %X, %o, %s, %c) in message rendering:
- !: Named test definitions for rule composition
- use: References to named subroutines with endian-flip support
- default: Conditional execution when no sibling at the same level matched
- clear: Resets the default flag for subsequent siblings
- indirect: Re-applies root rules at the resolved offset
- offset: Reports the file position as a value (for message substitution)
Hex specifiers mask to the type's natural bit width, avoiding sign-extended renderings.
Not Yet Implemented#
The following directives are parsed and silently skipped at preprocessing time. Only !:strength is parsed and evaluated among !: directives:
- !: MIME type specification for structured output (planned for v0.6.0 Directive extension point)
- !: File extension suggestions (planned for v0.6.0 Directive extension point)
- !: Apple-specific metadata annotations (planned for v0.6.0 Directive extension point)
These directives parse cleanly (so magic files using them can load) but are not evaluated. They are silently dropped during preprocessing rather than causing errors, allowing system magic databases like /usr/share/file/magic/filesystems to load end-to-end.
Feature Gaps and Development Epics#
Development is organized into five epics that systematically address compatibility gaps. Each epic targets specific version milestones and has defined success criteria.
Epic #53: Operator Completeness (v0.2.0 + v0.4.0)#
Status: Complete. All 18 required operators are implemented.
Implemented operators:
- Comparison operators (<, >, <=, >=) implemented in PR #104 - These operators enable version checks and range matching, unlocking compatibility with magic rules that test numeric ranges. Released in v0.2.0.
- Bitwise XOR (^), NOT (~), and any-value (x) operators implemented in PR #145 - Resolves Issue #35. Required for advanced binary pattern matching and unconditional match patterns. Released in v0.4.0.
Epic #54: Type System Expansion (v0.2.0 + v0.3.0)#
Status: 14 of 33+ types implemented. The type system expansion is split across two releases to manage code complexity.
Version 0.2.0 targets:
- Quad (64-bit integer) with endian variants - Implemented. Supports signed and unsigned 64-bit integers with full endianness support for modern binary formats
- Date and timestamp types tracked in Issue #41 - Implemented. Supports 32-bit (Date: date/ldate/bedate/beldate/ledate/leldate) and 64-bit (QDate: qdate/qldate/beqdate/beqldate/leqdate/leqldate) Unix timestamps with endianness variants and UTC/local time formatting
- Pstring (Pascal string) tracked in Issue #43 - Implemented. Length-prefixed strings (pstring) with 1/2/4-byte length prefixes plus
/Jself-inclusive-length flag support
Version 0.3.0 targets:
- Float and double with endian variants tracked in Issue #40 - Implemented in PR #162. IEEE-754 32-bit (float/befloat/lefloat) and 64-bit (double/bedouble/ledouble) floating point with comparison tolerance for scientific data and image metadata
- Regex and search types tracked in Issue #39 - Implemented. Binary-safe regex matching via
regex::bytes::Regexwith/c,/s,/lflag support, and bounded literal search viamemchr::memmem::findfor text file detection (JSON, scripts, XML) - Meta-types (default, clear, name, use, indirect, offset) tracked in Issue #42 - Implemented in PR #42. Supports control-flow directives for rule composition, conditional execution, subroutines, and root rule re-entry
- UCS-2 string types (lestring16, bestring16) tracked in Issue #232 - Implemented. 16-bit Unicode strings with explicit byte order, NUL-termination detection, and surrogate pair handling
Epic #55: Offset Resolution (v0.3.0)#
Status: Complete. Both relative and indirect offset mechanisms are fully implemented.
- Relative offset resolution tracked in Issue #38 - Implemented in PR #211. Resolves
Relative(delta)aslast_match_end + deltawith bounds checking. Follows GNUfilesemantics where the anchor is global-monotonic across child recursion (no save/restore). Magic-file parser syntax (&+N/&-N) remains TODO. - Indirect offset resolution tracked in Issue #37 - Implemented in PR #42. Pointer dereferencing where a value read from one offset specifies the test location, critical for PE executables and Office documents.
Submodule files will be pre-created as placeholders (offset_indirect.rs, offset_relative.rs) to prevent the evaluator module from becoming oversized.
Epic #56: Core Flow Spec Compliance (v0.5.0)#
Status: 7 of 12 flows complete. This epic ensures the library and CLI behave exactly as documented in the Core Flows specification. Version 0.5.0 has been released, and v0.5.x is currently in flight (in development).
Completed flows:
- Flow 1 (CLI Single File), Flow 2 (CLI Multiple Files), Flow 4 (Library Simple Usage), Flow 6 (Public Evaluation APIs), Flow 7 (Error Communication), Flow 9 (Hierarchical Matching), Flow 10 (Stdin Input)
Open gaps:
- Flow 3: Magic Discovery tracked in Issue #49 - Error messages lack actionable suggestions when magic files cannot be found
- Flow 5: Advanced API tracked in Issue #45 - Builder pattern for configuration not yet implemented
- Flow 8: JSON Output tracked in Issue #46 - Metadata object missing from JSON output structure
- Flow 11: Corrupted Files tracked in Issue #47 - Parse warnings are silently swallowed instead of being reported
- Flow 12: Timeout Handling tracked in Issue #44 - Returns hard error instead of partial results when timeout occurs
Epic #57: Compatibility Validation & v1.0 (v1.0.0)#
Status: 0/81 tests passing (0%). This epic validates compatibility against the GNU file test corpus and serves as the final gate for production release. All prerequisites from epics #53-#56 must complete before validation can begin.
Current test corpus results:
- 76 tests return "data" indicating no matching rule
- 5 tests detect container format but miss specific subtype (e.g., ZIP detected but not DOCX)
Compatibility targets by format category:
| Category | Tests | Target Pass Rate | Current |
|---|---|---|---|
| Binary formats (RPM, zstd, PGP, etc.) | ~45 | 43+ (96%) | 0 |
| Text formats (JSON, scripts, PNM) | ~15 | 14+ (93%) | 0 |
| Audio formats (MP3, DSD) | ~5 | 5 (100%) | 0 |
| ZIP subtypes (DOCX, XLSX, HWP) | ~5 | 4+ (80%) | 0 |
| Custom magic tests | ~4 | 4 (100%) | 0 |
| Filesystem images | ~3 | 3 (100%) | 0 |
Version Milestones and Release History#
The project follows semantic versioning with incremental feature releases building toward full GNU file compatibility at version 1.0.0.
v0.1.0 - Released February 15, 2026#
The first public release established baseline functionality with these components:
- CLI tool (
rmagic) supporting file and stdin input - Built-in magic rules for 10 common file formats: ELF, PE, ZIP, TAR, GZIP, JPEG, PNG, GIF, BMP, and PDF
- Text magic file parser with basic data types and operators
- Core library API including
MagicDatabase,evaluate_file(), andevaluate_buffer() - Dual output formats: JSON and human-readable text
- Comprehensive test coverage at 94.32%
v0.1.1 - Released February 15, 2026#
An API-compatible maintenance release addressing miscellaneous tasks and regenerating the changelog to fix duplicate entries. No breaking changes or new features.
v0.2.0 - Released March 1, 2026#
Focus: Operator completeness and comparison capabilities. This release unlocks significant compatibility gains through enhanced numeric comparison capabilities.
Implemented features:
- Comparison operators (<, >, <=, >=) via PR #104 - Enables version checks and range matching for numeric values, unlocking magic rules that test numeric ranges
- Quad (64-bit integer) type via PR #133 - Provides signed and unsigned 64-bit integer support with endianness variants for modern binary formats
- Date and timestamp types via PR #165 - Provides 32-bit (Date) and 64-bit (QDate) Unix timestamp support with endianness variants and UTC/local time formatting for archives and filesystem images
v0.3.0 - Planned#
Focus: Offset resolution and type system expansion. This release adds support for indirect offsets, text file detection, and advanced numeric types.
Completed features:
- Relative offset resolution via Issue #38 - Implemented in PR #211. Resolves
Relative(delta)aslast_match_end + deltafollowing GNUfilesemantics. - Float and double with endian variants via Issue #40 - Implemented in PR #162. IEEE-754 32-bit (float/befloat/lefloat) and 64-bit (double/bedouble/ledouble) floating point with comparison tolerance for scientific data and image metadata.
- Regex and search type matching via Issue #39 - Implemented in PR #214. Binary-safe regex matching with case-insensitive, start-offset, and line-based flags, plus bounded literal search. Enables text file detection (JSON, XML, scripts).
- Indirect offsets and meta-type directives via Issue #37 and #42 - Implemented in PR #42. Pointer-chasing for PE headers and compound documents, plus control-flow directives (
default,clear,name,use,indirect,offset) for rule composition and conditional execution.
Planned features:
- ZIP content inspection enhancements
Mandatory prerequisite refactoring:
- Pre-create evaluator submodules via Issue #62 to prevent
operators.rsfrom exceeding the 1,620-line complexity threshold - Extract CLI tests to integration tests via Issue #61 to reduce
main.rsfrom 1,647 lines to under 600 lines - Convert evaluator/types.rs to directory module via Issue #63 to prevent the file from exceeding the 3,500-4,000 line complexity threshold before adding new types
v0.4.0 - Released March 6, 2026#
Focus: Bitwise operator completion. This release completes the operator implementation roadmap with advanced bitwise operations.
Implemented features:
- Bitwise XOR (^), NOT (~), and any-value (x) operators via PR #145 - Enables advanced binary pattern matching and unconditional match patterns
Compatibility impact:
- Increased operator coverage to 100% (18 of 18 operators implemented)
- Enhanced support for magic files using bitwise operations
- Breaking change: Added new Operator enum variants (requires handling in exhaustive matches)
v0.5.0 - Released#
Focus: Core flow specification compliance. This release ensures all documented behaviors work exactly as specified. Version 0.5.x is currently in flight (in development).
Implementation goals:
- Complete all 12 Core Flows as specified with no deviations
- Builder pattern API for advanced configuration
- Enhanced JSON output with metadata object
- Improved error messages with actionable troubleshooting suggestions
- Timeout handling that returns partial results instead of hard errors
v1.0.0 - Planned#
Focus: Production-ready compatibility validation. The 1.0.0 release marks the project as production-ready with validated GNU file compatibility.
Release criteria:
- Achieve 95%+ compatibility with GNU file (minimum 77/81 tests passing)
- All Core Flows implemented and validated per specification
- Maintain >85% test coverage across the codebase
- Stable, documented public API suitable for production use
Required dependencies:
- Completion of all epics #53 through #56
- ZIP content inspection implementation via Issue #51
- Compatibility baseline measurement and CI tracking via Issue #48
Enhancement Roadmap and Development Strategy#
The project follows a carefully sequenced development approach that prioritizes prerequisite work and code quality maintenance alongside feature implementation.
Sequential Development Phases#
The roadmap emphasizes mandatory prerequisite completion ordering to prevent technical debt:
- v0.1.0: Baseline release with core functionality - Completed February 2026
- v0.2.0: Comparison operators - Completed March 2026
- v0.4.0: Bitwise operators - Completed March 2026
- v0.3.0: Refactoring tasks (#61, #62, #63) must complete before implementing offset resolution (#37, #38) and additional type features
- v0.5.0: All prior feature phases must complete before beginning specification compliance validation
- v1.0.0: All epics (#53-#56) must complete before final compatibility validation and production release
Critical Path Features#
These features have the highest impact on overall compatibility and unblock the most test cases:
- Comparison operators via PR #104 - Released in v0.2.0. Enable version checks and range matching for numeric values.
- Relative offset resolution via PR #211 - Implemented. Resolves offsets relative to previous match locations, essential for nested structures.
- Regex and search types via PR #214 - Implemented. Binary-safe regex matching with
/c,/s,/lflags and bounded literal search, enabling text file detection (JSON, XML, scripts). - Meta-type directives via PR #42 - Implemented. Control-flow directives (
default,clear,name,use,indirect,offset) for rule composition, conditional execution, and subroutines. Required for advanced magic database patterns. - Indirect offsets via PR #42 - Implemented. Pointer dereferencing where a value read from one offset specifies the test location, required for detecting PE executables, Office documents, and formats using internal pointers.
Code Complexity Management#
The project enforces specific code complexity thresholds that trigger mandatory refactoring before feature additions:
src/evaluator/operators.rs: Must not exceed 1,620 linessrc/evaluator/types.rs: Must not exceed 3,500-4,000 linessrc/main.rs: Target maximum of 600 lines
These thresholds prevent individual files from becoming unmaintainable as features expand and encourage proper modularization.
Implementation Architecture#
The libmagic-rs codebase is organized into distinct modules handling parsing, evaluation, and rule management. Key implementation files and their current status:
| File | Purpose | Implementation Status |
|---|---|---|
| src/parser/grammar.rs | Magic file parsing using nom combinators | 2,448 lines; implements parsing for basic types, operators, offsets, and ! directive |
| src/parser/ast.rs | Abstract syntax tree definitions | Defines all offset types including indirect/relative; includes TODO notes for validation method additions |
| src/parser/hierarchy.rs | Stack-based hierarchy construction | Converts flat rule lists parsed from magic files into parent-child tree structures |
| src/evaluator/mod.rs | Core rule evaluation engine | 474 lines; implements hierarchical rule evaluation with graceful error handling and timeout checking |
| src/evaluator/offset.rs | Offset resolution logic | Implements absolute and FromEnd offsets; delegates to indirect.rs (PR #42) and relative.rs (PR #211) submodules for full offset evaluation |
| src/evaluator/types.rs | Type reading and coercion | Implements byte, short, long, quad, string with SIMD optimizations; target for directory module conversion |
| src/evaluator/operators.rs | Operator evaluation | Implements =, !=, <, >, <=, >=, &, &mask, ^, ~, x with cross-type integer coercion using i128 intermediate type |
| docs/src/compatibility.md | Compatibility tracking document | Comprehensive comparison of libmagic versus libmagic-rs features with implementation status |
Architectural Differences from Original libmagic#
libmagic-rs diverges from the original C implementation in several fundamental design decisions that improve safety and maintainability at the cost of current feature parity.
Memory Safety#
- Original libmagic: Manual memory management with potential for leaks and corruption
- libmagic-rs: Zero unsafe code with
unsafe_code = "forbid"lint enforced project-wide, automatic memory management via Rust's ownership system
Error Handling#
- Original libmagic: Integer error codes with global error state
- libmagic-rs: Typed Result-based errors with structured error types including ParseError, EvaluationError, ConfigError, and Timeout
Thread Safety#
- Original libmagic: Requires external synchronization for concurrent use
- libmagic-rs: MagicDatabase is safe to share across threads via Arc with no additional synchronization needed
Performance Characteristics#
Preliminary benchmarks show competitive or superior performance despite early development status:
- Single ELF file identification: 1.17× faster than GNU file
- Batch processing (1000 small files): 1.09× faster
- Large file processing (1GB): 1.07× faster
- Magic file database loading: 1.5× faster
- Base memory footprint: ~1.5MB versus ~2MB for GNU file
- Large file memory usage: ~2MB versus ~16MB (benefits from memory-mapped I/O)
Related Topics#
- Magic File Format Specification: The text-based format used to define file identification rules with offsets, types, operators, and messages
- GNU file Command: The reference implementation that libmagic-rs aims to replace while maintaining compatibility
- File Type Detection: Techniques for identifying file formats through magic numbers, headers, and structural patterns
- Rust Systems Programming: Memory-safe alternatives to traditional C system utilities
- nom Parser Combinators: The Rust parser combinator library used to implement the magic file grammar parser