Variable-Width TypeKind Dispatch Architecture#
Overview#
TypeKind::Regex and TypeKind::Search are the two variable-width, pattern-bearing type variants in libmagic-rs. They differ from every other TypeKind variant because evaluation requires both the buffer and the rule's value operand (the pattern) at read time — fixed-width types only need the buffer. This constraint makes the standard read_typed_value(buffer, offset, type_kind) signature insufficient, and drove a dedicated parallel function family.
The architecture was established when implementing these types for issue #39 . Five interlocking problems surfaced: a stale regex crate feature flag, an insufficient dispatch signature, a missing anchor-advance path for variable-width matches, a build.rs exhaustive-match failure that appears before library errors, and a clippy doc_markdown lint on module docs. See the solutions document for the full root-cause breakdown.
Primary source files:
| File | Role |
|---|---|
src/evaluator/types/mod.rs | Function entry points: read_typed_value_with_pattern, bytes_consumed_with_pattern, read_pattern_match |
src/evaluator/types/regex.rs | Regex reader: build_regex, compute_window, read_regex, regex_bytes_consumed |
src/evaluator/types/search.rs | Search reader: read_search, search_bytes_consumed |
src/evaluator/engine/mod.rs | Engine dispatch: evaluate_single_rule_with_anchor, evaluate_pattern_rule |
src/parser/codegen.rs | Build-time codegen: serialize_type_kind exhaustive match |
AST Shape#
Both variants live in TypeKind in src/parser/ast.rs :
TypeKind::Regex { flags: RegexFlags, count: RegexCount }
TypeKind::Search { range: NonZeroUsize }
RegexFlags carries the /c and /s suffix modifiers :
pub struct RegexFlags {
pub case_insensitive: bool, // /c
pub start_offset: bool, // /s — advance anchor to match-start instead of match-end
}
The /l modifier is not a flag. It selects the RegexCount::Lines variant so that byte-count and line-count are mutually exclusive at the type level:
pub enum RegexCount {
Default, // plain `regex` — 8192-byte window
Bytes(NonZeroU32), // `regex/N` — N bytes, capped at 8192
Lines(Option<NonZeroU32>), // `regex/Nl` / `regex/l`
}
RegexCount::Lines(None) is behaviorally equivalent to Default (both use the full 8192-byte capped window) but kept distinct at the AST level for round-tripping . RegexCount::Bytes and RegexCount::Lines are mutually exclusive — regex/1l2l is a parse error.
TypeKind::Search takes a mandatory NonZeroUsize range. Bare search (no suffix) and search/0 are both parse errors, making invalid states unrepresentable .
The Sibling Function Pattern#
The existing read_typed_value signature has no slot for a pattern operand. Extending it would churn ~30 call sites for fixed-width types that never need a pattern. The solution is a sibling family: new functions that carry the extra argument, with the original becoming a zero-cost wrapper .
Three siblings in src/evaluator/types/mod.rs:
read_typed_value_with_pattern — main dispatch for all types; the original read_typed_value forwards here with pattern: None:
pub fn read_typed_value_with_pattern(
buffer: &[u8],
offset: usize,
type_kind: &TypeKind,
pattern: Option<&Value>,
) -> Result<Value, TypeReadError>
bytes_consumed_with_pattern — anchor-advance computation; pattern-bearing types need the operand to re-run the match:
pub(crate) fn bytes_consumed_with_pattern(
buffer: &[u8],
offset: usize,
type_kind: &TypeKind,
pattern: Option<&Value>,
) -> usize
read_pattern_match — the engine's direct entry point for Regex/Search; returns Result<Option<Value>, TypeReadError> instead of Result<Value, _> so the caller can distinguish a zero-width match from a genuine miss. Non-pattern types return TypeReadError::UnsupportedType.
The sibling pattern keeps the addition narrow. Callers that only deal with fixed-width types are unaffected; the engine opts into the extended API where needed.
Option<Value> None-as-No-Match Sentinel#
Pattern-bearing types cannot use Value::String(String::new()) to signal "no match." A zero-width regex (e.g., ^, a*, .{0}) produces an empty string on a successful match — identical to the proposed sentinel. Using the same value for two meanings would make zero-width matches indistinguishable from genuine misses .
read_pattern_match returns Result<Option<Value>, TypeReadError>:
Ok(Some(value))— successful match, including zero-widthOk(None)— genuine miss (pattern not found in window)Err(TypeReadError::UnsupportedType)— called on a non-pattern type
read_typed_value_with_pattern still collapses both Some(empty) and None to Value::String(String::new()) to preserve the back-compatible Value return shape. The engine bypasses this and calls read_pattern_match directly, so it always has the full Option .
Engine Dispatch: Pattern vs Value Path#
Inside evaluate_single_rule_with_anchor, the engine splits on type kind before touching the buffer:
match &rule.typ {
TypeKind::Regex { .. } | TypeKind::Search { .. } => {
evaluate_pattern_rule(rule, buffer, absolute_offset)?
}
_ => evaluate_value_rule(rule, buffer, absolute_offset)?,
}
Pattern path (evaluate_pattern_rule): calls read_pattern_match, maps Some(_) → Equal and None → NotEqual. No call to apply_operator. Any operator other than Equal/NotEqual on a pattern type returns TypeReadError::UnsupportedType — running a pattern type through ordering operators would compare the matched text lexicographically against the pattern source string, which produces nonsense.
Value path (evaluate_value_rule): calls read_typed_value_with_pattern (with pattern: None effectively, since fixed-width types ignore it), then calls apply_operator to compare the read value with the rule's expected value.
After a successful match on either path, the engine advances the anchor :
let consumed = types::bytes_consumed_with_pattern(
buffer, absolute_offset, &rule.typ, Some(&rule.value),
);
context.set_last_match_end(absolute_offset.saturating_add(consumed));
Anchor-Advance Semantics#
bytes_consumed_with_pattern is the source of truth for EvaluationContext::last_match_end (GOTCHAS S3.8). Variable-width types must have explicit arms — a catch-all _ => fires debug_assert in dev/test builds but silently corrupts the anchor in release, breaking all subsequent relative-offset children.
Regex: match-end vs match-start#
regex_bytes_consumed re-runs the compiled regex inside compute_window (applying the 8192-byte FILE_REGEX_MAX cap) and dispatches on flags.start_offset:
regex.find(window).map_or(0, |m| {
if flags.start_offset { m.start() } else { m.end() }
})
- Default (
/snot set): advance tom.end()— the byte after the last matched character /sset: advance tom.start()— the byte where the match begins (libmagic'sREGEX_OFFSET_START)
The compiled Regex is retrieved from a thread-local cache populated by the preceding read_regex call — regex_bytes_consumed pays a cheap HashMap::get + Arc::clone instead of recompiling .
Search: match-end, not window-end#
search_bytes_consumed re-runs memchr::memmem::find and returns match_idx + pattern.len():
memchr::memmem::find(window, pattern).map_or(0, |idx| idx + pattern.len())
An earlier implementation advanced by the full window size (range). That was wrong — it caused relative-offset children to land far past the intended byte. The GNU file contract (softmagic.c FILE_SEARCH path) is anchor += match_idx + pattern.len(): the byte just past the matched needle, not the end of the scan window .
build.rs Codegen Exhaustive-Match Trap#
src/parser/codegen.rs is included by build.rs via #[path] (GOTCHAS S1.2). This means serialize_type_kind — the function that emits Rust source for built-in rule data — is part of a separate compilation unit from the library.
Adding a TypeKind variant without updating serialize_type_kind breaks the exhaustive match in that function. Cargo compiles build.rs first, so the error surfaces as a build-script failure before the library compiles at all. The error message points at the codegen crate, not the library, which is disorienting if you don't know about the #[path] include .
serialize_type_kind covers all TypeKind variants explicitly — no wildcards. Pattern-bearing types require serialization of RegexFlags and RegexCount, including special handling for NonZeroU32/NonZeroUsize wrappers to avoid unwrap() calls in generated code.
Mitigation: when adding a TypeKind variant, follow GOTCHAS S2.1 in order. The checklist calls out serialize_type_kind as an easy-to-forget site. Run cargo clean && cargo check early to surface build-script failures before sinking time into library-side changes.
Prevention Rules for Future Variable-Width Types#
Sourced from the solutions document and GOTCHAS S2.1:
- Verify crate features before adding them.
regexv1.12+ exposesregex::bytes::RegexBuilderunconditionally;features = ["bytes"]now references a nonexistent feature and will cause acargoerror. Checkhttps://docs.rs/<crate>/<version>/for the exact feature list. - Walk GOTCHAS S2.1 in order when adding a
TypeKindvariant. The hidden site isserialize_type_kindinsrc/parser/codegen.rs. Then runcargo clean && cargo checkto surface build-script failures early. - Every variable-width variant must have an explicit arm in
bytes_consumed_with_pattern. A catch-all_ =>firesdebug_assertin dev/test but silently corrupts the GNUfileanchor in release builds. - Use sibling functions, not signature extensions, when the new concern is narrow. The original 3-arg
read_typed_valueis undisturbed; ~30 existing call sites are unaffected. - Never overload
Value::String("")as a "no match" sentinel. UseResult<Option<Value>, _>. A zero-width regex match (^,a*) produces an empty string on success — identical to the sentinel. - Search advances by match-end, not window-end. The contract is
anchor += match_idx + pattern.len(). Advancing byrange(window size) is wrong and has been fixed. - Pattern-bearing types reject non-equality operators. Return
TypeReadError::UnsupportedTypefor anything other thanEqual/NotEqual. - Backtick every Rust identifier in doc comments individually. The
doc_markdownclippy lint flags unquoted names likeTypeKindorRegexCount.