Nom Alt Combinator Branch Limit And Nesting Pattern#
Lead Section#
The Nom Alt Combinator Branch Limit And Nesting Pattern is an architectural design pattern used in parser implementation to address potential scaling constraints when using the nom parser combinator library. The pattern involves organizing parser alternatives into hierarchical groups (typically by type family or bit-width) and nesting alt() combinators to ensure each individual alt() call stays within nom's compile-time limitations while maintaining parser correctness through careful ordering conventions.
In libmagic-rs, this pattern was identified as a critical architectural consideration during planning for type system expansion. The current implementation in src/parser/types.rs uses nested alt() combinators with 23 type name tokens organized into 5 family groups (64-bit, 32-bit, 16-bit, 8-bit, and string), but the roadmap targets expansion to ~31 types by v1.0.0, which would require adding additional family groups to avoid exceeding tuple size constraints.
The pattern emphasizes two ordering conventions: longer prefixes before shorter (to prevent premature matching, e.g., lelong before long) and unsigned before signed (for documentation clarity and consistent family grouping). These conventions ensure correct tokenization when type names share common prefixes and maintain logical organization within type families.
Nom's Alt Combinator Constraints#
The nom parser combinator library implements alt() using Rust tuples, which have compile-time size limits based on trait implementations. While the exact limit varies by nom version and Rust compiler version, practical limitations typically manifest around 21 alternatives due to tuple trait bounds in the nom implementation.
Evidence in libmagic-rs: No explicit documentation of hitting this limit exists in the current codebase. The type parsing implementation in src/parser/types.rs uses nested alt() calls with 23 type name tokens organized into 5 family groups, and the parse_operator function uses manual if let checks rather than alt(), though this appears to be for explicit precedence control rather than to work around tuple limits.
The constraint becomes relevant when considering the type system expansion roadmap:
- Current: 5 type variants (Byte, Short, Long, Quad, String) expanding to 23 type name tokens
- v0.3.0+: Add Float, Double, Date, Pstring, Regex, Search types with endianness variants (20+ additional tokens)
- Target: ~31 of 33 libmagic types by v1.0.0
This expansion would result in 40+ type name tokens to parse, far exceeding practical tuple limits for a single alt() call.
Family-Based Nesting Architecture#
Type Family Organization#
The nesting pattern groups types by bit-width families to create logical parser organization:
64-bit Family (currently implemented):
32-bit Family (currently implemented):
16-bit Family (currently implemented):
8-bit Family (currently implemented):
String Family (currently implemented):
Floating-Point Family (planned for v0.3.0):
Date/Time Family (planned for v0.3.0):
Current Implementation Structure#
The current type parsing implementation (in src/parser/types.rs) uses a nested structure with 23 type name tokens organized into family groups. The quad types (6 variants) were added as a nested alt() within the outer structure:
alt((
// 64-bit types (6 branches)
alt((
tag("ubequad"),
tag("ulequad"),
tag("uquad"),
tag("bequad"),
tag("lequad"),
tag("quad"),
)),
// 32-bit types (6 branches)
alt((
tag("ubelong"),
tag("ulelong"),
tag("ulong"),
tag("belong"),
tag("lelong"),
tag("long"),
)),
// 16-bit types (6 branches)
alt((
tag("ubeshort"),
tag("uleshort"),
tag("ushort"),
tag("beshort"),
tag("leshort"),
tag("short"),
)),
// 8-bit types (2 branches)
alt((tag("ubyte"), tag("byte"))),
// String types (1 branch)
tag("string"),
))
.parse(input)
This structure demonstrates the nesting pattern in action: the 64-bit quad family was added as its own nested alt() group, keeping each inner alt() to 6 or fewer branches and the outer alt() to 5 families.
Nested Pattern for Future Expansion#
When types exceed practical tuple limits, the pattern would restructure to:
let (input, type_name) = alt((
// 64-bit family (6 types)
alt((
tag("ubequad"),
tag("ulequad"),
tag("uquad"),
tag("bequad"),
tag("lequad"),
tag("quad"),
)),
// 32-bit integer family (6 types)
alt((
tag("ubelong"),
tag("ulelong"),
tag("ulong"),
tag("belong"),
tag("lelong"),
tag("long"),
)),
// Floating-point family (6 types)
alt((
tag("bedouble"),
tag("ledouble"),
tag("double"),
tag("befloat"),
tag("lefloat"),
tag("float"),
)),
// 16-bit family (6 types)
alt((
tag("ubeshort"),
tag("uleshort"),
tag("ushort"),
tag("beshort"),
tag("leshort"),
tag("short"),
)),
// Date family (8+ types)
alt((
tag("beqdate"),
tag("leqdate"),
tag("qdate"),
tag("beqldate"),
tag("leqldate"),
tag("qldate"),
tag("bedate"),
tag("ledate"),
tag("date"),
tag("beldate"),
tag("leldate"),
tag("ldate"),
)),
// 8-bit and string family (3 types)
alt((
tag("ubyte"),
tag("byte"),
tag("string"),
tag("pstring"),
)),
))
.parse(input)?;
This nested structure keeps each inner alt() to 6-12 branches and the outer alt() to 6 families, well within tuple limits.
Ordering Conventions#
Longer Prefixes Before Shorter#
Type names must be ordered longest-to-shortest within each family to prevent premature matching. The nom tag() combinator consumes exact string matches, but when alternatives are tried in sequence, the first successful match wins.
Example collision scenarios:
- Parsing
"lelong"would incorrectly matchtag("long")iflongcame first - Parsing
"ubeshort"would incorrectly matchtag("short")ifshortcame first - Parsing
"lequad"would incorrectly matchtag("quad")ifquadcame first
Correct ordering:
ubequad (7 chars) → bequad (6 chars) → quad (4 chars)
ulequad (7 chars) → lequad (6 chars) → quad (4 chars)
ubelong (7 chars) → belong (6 chars) → long (4 chars)
ulelong (7 chars) → lelong (6 chars) → long (4 chars)
ubeshort (8 chars) → beshort (7 chars) → short (5 chars)
uleshort (8 chars) → leshort (7 chars) → short (5 chars)
Unsigned Before Signed Within Families#
The current implementation places unsigned variants (u-prefixed) before signed variants within each bit-width family. This convention provides:
- Consistent family grouping: All unsigned variants for a bit-width appear together, then all signed variants
- Documentation clarity: Readers can quickly identify all variants of each signedness
- Future extension: New types can be added to the appropriate position without disrupting family structure
Note: Due to the longest-to-shortest rule, this is not strictly enforced character-by-character. For example, ubelong (7 chars) comes before lelong (6 chars) even though lelong is signed with a prefix. The practical implementation is: unsigned-with-endian → unsigned-native → signed-with-endian → signed-native within each bit-width family.
Combined Ordering Strategy#
Within each nested family alt():
- Group by signedness (unsigned, then signed)
- Within each signedness group, order by length (longest first)
- This ensures both prefix disambiguation and logical organization
Example for 32-bit family:
alt((
tag("ubelong"), // unsigned, big-endian, 7 chars
tag("ulelong"), // unsigned, little-endian, 7 chars
tag("ulong"), // unsigned, native, 5 chars
tag("belong"), // signed, big-endian, 6 chars
tag("lelong"), // signed, little-endian, 6 chars
tag("long"), // signed, native, 4 chars
))
Integration With Type System Architecture#
Exhaustive Match Synchronization#
The nesting pattern interacts with the Enum Extension And Exhaustive Match Synchronization pattern. When adding a new type variant:
- AST Definition: Add variant to
TypeKindenum in src/parser/ast.rs - Parser Grammar: Add type name tokens to appropriate family's nested
alt()in src/parser/types.rs, maintaining ordering conventions - Type Reading: Update read_typed_value() dispatch in src/evaluator/types.rs
- Strength Scoring: Update pattern match in
src/evaluator/strength.rs - Build-Time Serialization: Update serialization in
src/parser/codegen.rs - Property Tests: Extend test strategies in
tests/property_tests.rs
Rust's exhaustive pattern matching and Clippy warnings catch missed updates at compile time.
Type Name to Enum Mapping#
After parsing the type name string, a match statement maps it to TypeKind enum variants (lines 1587-1640):
let type_kind = match type_name {
"byte" => TypeKind::Byte { signed: true },
"ubyte" => TypeKind::Byte { signed: false },
"short" => TypeKind::Short {
endian: Endianness::Native,
signed: true,
},
"ushort" => TypeKind::Short {
endian: Endianness::Native,
signed: false,
},
// ... 13 more cases
_ => unreachable!("Parser should only match known types"),
};
The mapping must remain synchronized with the alt() alternatives. The nested pattern doesn't change this requirement, but it does mean that adding a new type to the appropriate family's nested alt() requires only local changes within that family's parser code.
Rationale and Benefits#
Separation of Concerns#
Grouping by type family creates logical boundaries:
- Related types stay together: All 64-bit types, all floating-point types, etc.
- Independent extension: Adding a new 64-bit type only touches the 64-bit family's nested
alt() - Clear intent: The nested structure documents which types belong to which families
Extensibility Without Refactoring#
The nested pattern allows the type system to scale incrementally:
- Quad types (64-bit) were added using the nested structure, creating a new family group without requiring a full parser refactor
- Adding Float/Double types (v0.3.0) requires introducing the floating-point family nested
alt() - Each addition stays within the existing nested structure without requiring a full refactor
Performance Characteristics#
Nested alt() has minimal performance impact:
- Early exit: nom's
alt()implementation tries alternatives in sequence and returns on first match - Family-level dispatch: The outer
alt()dispatches to families, short-circuiting unnecessary checks - Type name strings: Since type names begin with different characters (
qfor quad,l/b/ufor endian variants), string matching is fast
Compile-Time Guarantees#
The pattern maintains Rust's compile-time guarantees:
- Type safety: Each nested
alt()returns the same type (&strfor type name) - Tuple limits: Each
alt()level stays well within tuple trait bounds - Zero runtime cost: nom combinators compile to efficient matching code
When to Apply This Pattern#
Indicators That Nesting Is Required#
- Approaching tuple limits: More than 15-18 alternatives in a single
alt()call - Type system expansion plans: Roadmap indicates 25+ types to be added
- Logical groupings exist: Types naturally organize into families (bit-width, numeric/string/pattern)
- Prefix relationships: Many type names share common prefixes requiring ordering
Alternative Approaches#
Before adopting nested alt():
- Manual dispatch: Use
if letchains like parse_operator does (lines 173-248), but this sacrifices composability - Function pointers: Create arrays of parsing functions, but this loses compile-time type checking
- Macro generation: Generate nested
alt()calls via macros, but this reduces readability
Nested alt() provides the best balance of composability, type safety, and maintainability for the type parsing use case.
Related Topics#
- Enum Extension And Exhaustive Match Synchronization: Ensures type additions are synchronized across the codebase
- Type System And Operator Coverage: Tracks the roadmap from 4 types (12%) to 31 types (94%)
- Type Signedness Defaults And Unsigned Type Variants: Explains signedness conventions in type naming and parsing
- Parser-Evaluator Architecture: Describes the three-layer architecture (AST → Parser → Evaluator)
Relevant Code Files#
| File | Description | Key Lines |
|---|---|---|
| src/parser/types.rs | Type parsing implementation with nested alt() structure | 43-76 |
| src/parser/ast.rs | TypeKind enum definition with 5 current variants including Quad | 80-117 |
| src/evaluator/types.rs | Type reading dispatch requiring exhaustive matches | 350-361 |
| ROADMAP.md | Type system expansion plan showing 31-type target | 28-37 |
| docs/src/magic-format.md | Magic file format documentation listing not-yet-supported types | 449-456 |