String Rule Parser-Evaluator Bug Interaction Pattern

String Rule Parser-Evaluator Bug Interaction Pattern#

This pattern documents a class of multi-layer silent failures in libmagic-rs where independently correct components conspire to break end-to-end string rule matching. The canonical instance (PR #233) involved three bugs across parser, evaluator, and operator layers that caused every magic string rule containing backslash escapes — \0, \177ELF, \xNN hex sequences — to silently return "data" instead of the correct file type. No errors, no warnings, no log output at any level. Synthetic-fixture conformance against GNU file was 0/5 before the fix.

The pattern is distinct from the Comparison Operators and compare_values Helper Pattern (which covers operator architecture) and the Cross-Type Integer Coercion Pattern (integer-specific).

The Three-Bug Chain#

The DSL promise for string PATTERN is: do the first len(PATTERN) bytes of the buffer equal the byte sequence the rule author wrote? Three independent invariants must hold simultaneously to honor that contract:

Parser stores the right bytes. The pattern in the AST must equal the byte sequence the author wrote, including NULs encoded as \0, control bytes as \NNN octal, and arbitrary bytes as \xNN hex.
Evaluator reads the right number of bytes. The buffer-side read must produce exactly len(pattern) bytes, regardless of content.
Operator compares bytes, not Value discriminants. Equality must compare underlying byte sequences, not Rust enum variants.

Bug 1 — bareword escapes not interpreted (parse_bare_string_value): the original implementation used take_while to grab the raw token and stored literal text. \0 was stored as the two characters \ + 0 (length 10 for PNCIHISK\0) instead of a 9-byte sequence ending in a NUL. The fallback was originally introduced for unquoted ASCII identifiers in the regex/search work; escape interpretation was YAGNI'd out, not deliberately excluded. The fix rewrites the function to walk character-by-character, dispatching \xNN hex via value::parse_hex_byte_with_prefix (which parse_escape_sequence does not recognize), then octal/control escapes via parse_escape_sequence, falling through to a literal backslash for unrecognized sequences.

Bug 2 — read_string always NUL-truncated (read_string): read_string uses memchr::memchr(0, ...) to find the first NUL and truncate. For a 9-byte pattern ending in \0, this read 8 bytes from the buffer, producing a length mismatch before any byte comparison occurred. NUL-stopping was correct for the original quoted-string path (C-string-style comparison); it broke once bareword patterns started containing real NULs after Bug 1's fix. The fix adds read_string_exact that reads exactly length bytes via buffer.get(offset..end) with no NUL truncation. The dispatcher in read_typed_value_with_pattern selects between modes: (Some(n), _) → exact read; (None, Value::String(p)) → read_string_exact(buffer, offset, p.len()); (None, Value::Bytes(b)) → read_string_exact(buffer, offset, b.len()); (None, _) → NUL-stopping read_string (the x any-value scan path).

Bug 3 — strict-variant equality (apply_equal): parse_mixed_hex_ascii produces Value::Bytes for patterns like \177ELF (some byte sequences can't roundtrip through Rust String), while read_string_exact always returns Value::String. The old apply_equal delegated to compare_values, which returned None for cross-variant pairs — meaning Value::Bytes vs Value::String always compared unequal regardless of byte content. The fix adds a cross-variant byte comparison before falling through: (Value::String(s), Value::Bytes(b)) | (Value::Bytes(b), Value::String(s)) => s.as_bytes() == b.as_slice() .

How they chained: The Norton string PNCIHISK\0 rule exercised Bugs 1 and 2 (NUL in pattern). The ELF string \177ELF rule exercised Bugs 1 and 3 (non-UTF-8 byte, no NUL). A naive single-bug fix would have left the other failures undiagnosed.

Why Local Tests Passed#

Each layer had unit tests verifying its local contract, and those contracts were internally consistent:

Parser tests verified parse_bare_string_value("PNCIHISK\\0") returned Value::String("PNCIHISK\\0") — correct per the old contract.
Evaluator tests verified read_string stopped at NUL — correct behavior for that function.
Operator tests verified apply_equal(Bytes, String) returned false — correct under strict-variant semantics.

Every test passed because every test was checking the wrong thing. End-to-end conformance against GNU file would have caught all three immediately.

Dual Read Mode: The Sync Requirement#

GOTCHAS S6.4 documents the now-codified rule: any code path that needs string bytes from the buffer must choose consciously:

Pattern comparison → read_string_exact. No NUL truncation.
Printable-prefix scan / x operator → read_string. Stops at NUL.

read_string, read_string_exact, and string_bytes_consumed form a triad: adding a new read path or changing length semantics in one must be reflected in the others. This mirrors the "keep dual-purpose helpers in sync" rule from the pstring-anchor-poisoning pattern.

Cross-Value Byte Equality: Policy, Not Accident#

GOTCHAS S2.3 records this as an explicit design policy: apply_equal deliberately compares Value::String(s) against Value::Bytes(b) by underlying byte sequence. Any new Value variant that can hold the same byte sequence as Value::String must extend this cross-equality rather than locking into strict-variant comparison. S2.3 also notes the trichotomy requirement: compare_values in src/evaluator/operators/comparison.rs must mirror the cross-variant arms or apply_equal returns true while apply_less_than and apply_greater_than both return false, breaking the </==/> invariant.

Tests pinning this policy: test_apply_equal_bytes_vs_string and test_apply_equal_edge_cases.

Prevention: GNU `file` Conformance Harness#

The compatibility test suite (tests/compatibility_tests.rs) runs rmagic against synthetic fixture files and compares output against expected results. This is the primary gate that catches end-to-end regressions the unit test layers cannot see.

Additional prevention rules from the post-mortem:

Round-trip property tests: for any DSL value type, write a property test pairing (rule_text, matching_buffer) and assert that parsing + evaluating produces a match. The Norton bug would have failed this test immediately.
"Corpus loads cleanly" ≠ "rules match": a magic database that parses without errors is necessary but not sufficient. The next gate: every rule, when fed its canonical sample fixture, must produce non-"data" output.
Future Value variants: if Value::String is replaced or augmented with a byte-string variant uniformly, the parser should emit a single value type for all string-family rules, eliminating the Bytes/String divergence entirely (tracked as a v0.6.0 surface change).

Key Files#

File	Role
`src/parser/grammar/mod.rs`	`parse_bare_string_value` — bareword escape interpretation
`src/evaluator/types/string.rs`	`read_string` (NUL-stop) and `read_string_exact` (exact-length)
`src/evaluator/types/mod.rs`	`read_typed_value_with_pattern` dispatcher — selects read mode
`src/evaluator/operators/equality.rs`	`apply_equal` — cross-variant byte comparison
`GOTCHAS.md` S2.3	Cross-type `String`/`Bytes` equality policy
`GOTCHAS.md` S6.4	Dual read mode selection rule
`tests/compatibility_tests.rs`	GNU `file` conformance harness