Documents
Variable-Width TypeKind Dispatch Architecture
Variable-Width TypeKind Dispatch Architecture
Type
Topic
Status
Published
Created
Apr 25, 2026
Updated
Apr 25, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Variable-Width TypeKind Dispatch Architecture#

Overview#

TypeKind::Regex and TypeKind::Search are the two variable-width, pattern-bearing type variants in libmagic-rs. They differ from every other TypeKind variant because evaluation requires both the buffer and the rule's value operand (the pattern) at read time — fixed-width types only need the buffer. This constraint makes the standard read_typed_value(buffer, offset, type_kind) signature insufficient, and drove a dedicated parallel function family.

The architecture was established when implementing these types for issue #39 . Five interlocking problems surfaced: a stale regex crate feature flag, an insufficient dispatch signature, a missing anchor-advance path for variable-width matches, a build.rs exhaustive-match failure that appears before library errors, and a clippy doc_markdown lint on module docs. See the solutions document for the full root-cause breakdown.

Primary source files:

FileRole
src/evaluator/types/mod.rsFunction entry points: read_typed_value_with_pattern, bytes_consumed_with_pattern, read_pattern_match
src/evaluator/types/regex.rsRegex reader: build_regex, compute_window, read_regex, regex_bytes_consumed
src/evaluator/types/search.rsSearch reader: read_search, search_bytes_consumed
src/evaluator/engine/mod.rsEngine dispatch: evaluate_single_rule_with_anchor, evaluate_pattern_rule
src/parser/codegen.rsBuild-time codegen: serialize_type_kind exhaustive match

AST Shape#

Both variants live in TypeKind in src/parser/ast.rs :

TypeKind::Regex { flags: RegexFlags, count: RegexCount }
TypeKind::Search { range: NonZeroUsize }

RegexFlags carries the /c and /s suffix modifiers :

pub struct RegexFlags {
    pub case_insensitive: bool, // /c
    pub start_offset: bool, // /s — advance anchor to match-start instead of match-end
}

The /l modifier is not a flag. It selects the RegexCount::Lines variant so that byte-count and line-count are mutually exclusive at the type level:

pub enum RegexCount {
    Default, // plain `regex` — 8192-byte window
    Bytes(NonZeroU32), // `regex/N` — N bytes, capped at 8192
    Lines(Option<NonZeroU32>), // `regex/Nl` / `regex/l`
}

RegexCount::Lines(None) is behaviorally equivalent to Default (both use the full 8192-byte capped window) but kept distinct at the AST level for round-tripping . RegexCount::Bytes and RegexCount::Lines are mutually exclusive — regex/1l2l is a parse error.

TypeKind::Search takes a mandatory NonZeroUsize range. Bare search (no suffix) and search/0 are both parse errors, making invalid states unrepresentable .

The Sibling Function Pattern#

The existing read_typed_value signature has no slot for a pattern operand. Extending it would churn ~30 call sites for fixed-width types that never need a pattern. The solution is a sibling family: new functions that carry the extra argument, with the original becoming a zero-cost wrapper .

Three siblings in src/evaluator/types/mod.rs:

read_typed_value_with_pattern — main dispatch for all types; the original read_typed_value forwards here with pattern: None:

pub fn read_typed_value_with_pattern(
    buffer: &[u8],
    offset: usize,
    type_kind: &TypeKind,
    pattern: Option<&Value>,
) -> Result<Value, TypeReadError>

bytes_consumed_with_pattern — anchor-advance computation; pattern-bearing types need the operand to re-run the match:

pub(crate) fn bytes_consumed_with_pattern(
    buffer: &[u8],
    offset: usize,
    type_kind: &TypeKind,
    pattern: Option<&Value>,
) -> usize

read_pattern_match — the engine's direct entry point for Regex/Search; returns Result<Option<Value>, TypeReadError> instead of Result<Value, _> so the caller can distinguish a zero-width match from a genuine miss. Non-pattern types return TypeReadError::UnsupportedType.

The sibling pattern keeps the addition narrow. Callers that only deal with fixed-width types are unaffected; the engine opts into the extended API where needed.

Option<Value> None-as-No-Match Sentinel#

Pattern-bearing types cannot use Value::String(String::new()) to signal "no match." A zero-width regex (e.g., ^, a*, .{0}) produces an empty string on a successful match — identical to the proposed sentinel. Using the same value for two meanings would make zero-width matches indistinguishable from genuine misses .

read_pattern_match returns Result<Option<Value>, TypeReadError>:

  • Ok(Some(value)) — successful match, including zero-width
  • Ok(None) — genuine miss (pattern not found in window)
  • Err(TypeReadError::UnsupportedType) — called on a non-pattern type

read_typed_value_with_pattern still collapses both Some(empty) and None to Value::String(String::new()) to preserve the back-compatible Value return shape. The engine bypasses this and calls read_pattern_match directly, so it always has the full Option .

Engine Dispatch: Pattern vs Value Path#

Inside evaluate_single_rule_with_anchor, the engine splits on type kind before touching the buffer:

match &rule.typ {
    TypeKind::Regex { .. } | TypeKind::Search { .. } => {
        evaluate_pattern_rule(rule, buffer, absolute_offset)?
    }
    _ => evaluate_value_rule(rule, buffer, absolute_offset)?,
}

Pattern path (evaluate_pattern_rule): calls read_pattern_match, maps Some(_)Equal and NoneNotEqual. No call to apply_operator. Any operator other than Equal/NotEqual on a pattern type returns TypeReadError::UnsupportedType — running a pattern type through ordering operators would compare the matched text lexicographically against the pattern source string, which produces nonsense.

Value path (evaluate_value_rule): calls read_typed_value_with_pattern (with pattern: None effectively, since fixed-width types ignore it), then calls apply_operator to compare the read value with the rule's expected value.

After a successful match on either path, the engine advances the anchor :

let consumed = types::bytes_consumed_with_pattern(
    buffer, absolute_offset, &rule.typ, Some(&rule.value),
);
context.set_last_match_end(absolute_offset.saturating_add(consumed));

Anchor-Advance Semantics#

bytes_consumed_with_pattern is the source of truth for EvaluationContext::last_match_end (GOTCHAS S3.8). Variable-width types must have explicit arms — a catch-all _ => fires debug_assert in dev/test builds but silently corrupts the anchor in release, breaking all subsequent relative-offset children.

Regex: match-end vs match-start#

regex_bytes_consumed re-runs the compiled regex inside compute_window (applying the 8192-byte FILE_REGEX_MAX cap) and dispatches on flags.start_offset:

regex.find(window).map_or(0, |m| {
    if flags.start_offset { m.start() } else { m.end() }
})
  • Default (/s not set): advance to m.end() — the byte after the last matched character
  • /s set: advance to m.start() — the byte where the match begins (libmagic's REGEX_OFFSET_START)

The compiled Regex is retrieved from a thread-local cache populated by the preceding read_regex call — regex_bytes_consumed pays a cheap HashMap::get + Arc::clone instead of recompiling .

Search: match-end, not window-end#

search_bytes_consumed re-runs memchr::memmem::find and returns match_idx + pattern.len():

memchr::memmem::find(window, pattern).map_or(0, |idx| idx + pattern.len())

An earlier implementation advanced by the full window size (range). That was wrong — it caused relative-offset children to land far past the intended byte. The GNU file contract (softmagic.c FILE_SEARCH path) is anchor += match_idx + pattern.len(): the byte just past the matched needle, not the end of the scan window .

build.rs Codegen Exhaustive-Match Trap#

src/parser/codegen.rs is included by build.rs via #[path] (GOTCHAS S1.2). This means serialize_type_kind — the function that emits Rust source for built-in rule data — is part of a separate compilation unit from the library.

Adding a TypeKind variant without updating serialize_type_kind breaks the exhaustive match in that function. Cargo compiles build.rs first, so the error surfaces as a build-script failure before the library compiles at all. The error message points at the codegen crate, not the library, which is disorienting if you don't know about the #[path] include .

serialize_type_kind covers all TypeKind variants explicitly — no wildcards. Pattern-bearing types require serialization of RegexFlags and RegexCount, including special handling for NonZeroU32/NonZeroUsize wrappers to avoid unwrap() calls in generated code.

Mitigation: when adding a TypeKind variant, follow GOTCHAS S2.1 in order. The checklist calls out serialize_type_kind as an easy-to-forget site. Run cargo clean && cargo check early to surface build-script failures before sinking time into library-side changes.

Prevention Rules for Future Variable-Width Types#

Sourced from the solutions document and GOTCHAS S2.1:

  • Verify crate features before adding them. regex v1.12+ exposes regex::bytes::RegexBuilder unconditionally; features = ["bytes"] now references a nonexistent feature and will cause a cargo error. Check https://docs.rs/<crate>/<version>/ for the exact feature list.
  • Walk GOTCHAS S2.1 in order when adding a TypeKind variant. The hidden site is serialize_type_kind in src/parser/codegen.rs. Then run cargo clean && cargo check to surface build-script failures early.
  • Every variable-width variant must have an explicit arm in bytes_consumed_with_pattern. A catch-all _ => fires debug_assert in dev/test but silently corrupts the GNU file anchor in release builds.
  • Use sibling functions, not signature extensions, when the new concern is narrow. The original 3-arg read_typed_value is undisturbed; ~30 existing call sites are unaffected.
  • Never overload Value::String("") as a "no match" sentinel. Use Result<Option<Value>, _>. A zero-width regex match (^, a*) produces an empty string on success — identical to the sentinel.
  • Search advances by match-end, not window-end. The contract is anchor += match_idx + pattern.len(). Advancing by range (window size) is wrong and has been fixed.
  • Pattern-bearing types reject non-equality operators. Return TypeReadError::UnsupportedType for anything other than Equal/NotEqual.
  • Backtick every Rust identifier in doc comments individually. The doc_markdown clippy lint flags unquoted names like TypeKind or RegexCount.
Variable-Width TypeKind Dispatch Architecture | Dosu