Parse_Number Scope And Unsigned Value Parsing#
In the libmagic-rs parser, parse_number and parse_unsigned_number are two separate functions that enforce a critical architectural constraint: parse_number returns i64 and must never be widened to u64. This design separation mirrors the libmagic type system's signed-by-default semantics, where unprefixed type names (byte, short, long, quad) default to signed interpretation, while u-prefixed variants (ubyte, ushort, ulong, uquad) explicitly request unsigned interpretation.
parse_number is a public function that handles signed integers with optional minus signs, returning i64 values used throughout the parser for offset specifications and signed numeric literals. parse_unsigned_number is module-private and returns u64 values, supporting the full unsigned 64-bit range required for types like uquad and for parsing high-bit magic numbers in file format signatures—for example, JPEG (0xffd8) and PNG (0x89504e47).
This constraint prevents type confusion downstream and ensures that TypeKind enum's signed: bool fields maintain semantic correctness throughout the parsing and evaluation pipeline. The separation enforces a clear distinction: offsets are inherently signed (supporting negative values for offsets from file end), while numeric test values can span the full u64 range when paired with unsigned types.
Core Constraint: Why parse_number Must Stay i64#
Function Signatures and Scoping#
parse_number is a public function with signature:
pub fn parse_number(input: &str) -> IResult<&str, i64>
parse_unsigned_number is module-private with signature:
fn parse_unsigned_number(input: &str) -> IResult<&str, u64>
The visibility difference reinforces the design intent: parse_number is the general-purpose entry point for signed integer parsing (used by offset parsing, strength modifiers, and signed numeric values), while parse_unsigned_number is a specialized helper for contexts requiring the full u64 range.
Implementation Logic#
parse_number handles both positive and negative numbers:
- Optionally parses a leading minus sign using
opt(char('-')) - Determines format (hexadecimal vs decimal) after consuming the sign
- Uses
checked_neg()for safe negation with overflow detection - Returns
i64values suitable for offset specifications, which can be negative
parse_unsigned_number only handles non-negative numbers:
- Checks for
0xprefix to determine hexadecimal vs decimal format - Delegates to format-specific parsers with
u64overflow protection - Does not handle minus signs—callers detect signs before dispatching to this function
Why This Matters#
Widening parse_number to u64 would break the parser in three critical ways:
- Break offset semantics: Offsets feed into
OffsetSpec::Absolute(i64), which requires signed values for negative offsets from file end - Collapse type distinctions: The distinction between signed and unsigned numeric literals would be lost at the AST level, breaking the
Value::Int/Value::Uintseparation - Silently introduce bugs: Downstream code assumes
parse_numberreturns signed values; widening would require exhaustive pattern match updates across the evaluator
Offset Parsing: parse_number in Action#
Offset parsing follows a clean composition pattern that demonstrates why parse_number must return i64. The pipeline flows from high-level magic rule parsing down to low-level number extraction:
parse_rule_offset → parse_offset → parse_number → i64 → OffsetSpec::Absolute(i64)
The parse_offset function directly calls parse_number and wraps the result in the OffsetSpec::Absolute variant:
pub fn parse_offset(input: &str) -> IResult<&str, OffsetSpec> {
let (input, _) = multispace0(input)?;
let (input, offset_value) = parse_number(input)?;
let (input, _) = multispace0(input)?;
Ok((input, OffsetSpec::Absolute(offset_value)))
}
Examples#
Offset parsing handles both positive and negative values:
"0"→OffsetSpec::Absolute(0)"16"→OffsetSpec::Absolute(16)"0x10"→OffsetSpec::Absolute(16)"-4"→OffsetSpec::Absolute(-4)"-0xFF"→OffsetSpec::Absolute(-255)
Negative offsets are semantically meaningful in libmagic. For example, offsets from file end use negative values to position the read pointer relative to the end of the file. This makes i64 the correct return type for parse_number in offset contexts.
Value Parsing: Contextual Signed/Unsigned Dispatch#
While offset parsing always uses parse_number, value parsing in magic rules requires contextual dispatch between signed and unsigned paths. The parse_numeric_value function implements this logic by checking for a leading minus sign:
fn parse_numeric_value(input: &str) -> IResult<&str, Value> {
let (input, _) = multispace0(input)?;
let (input, value) = if input.starts_with('-') {
// Negative: parse as i64
let (input, number) = parse_number(input)?;
(input, Value::Int(number))
} else {
// Non-negative: parse as u64 to support full unsigned 64-bit range
let (input, number) = parse_unsigned_number(input)?;
(input, Value::Uint(number))
};
let (input, _) = multispace0(input)?;
Ok((input, value))
}
The dispatch logic:
- Checks if input starts with a minus sign
- For negative numbers: calls
parse_numberand wraps asValue::Int(i64) - For non-negative numbers: calls
parse_unsigned_numberand wraps asValue::Uint(u64)
This design ensures that non-negative literals can represent values up to u64::MAX (18,446,744,073,709,551,615), which is essential for uquad type matching against 64-bit unsigned file data.
Additional Usage: Bitmask Parsing#
parse_unsigned_number is also used in parse_type_and_operator for parsing mask values in bitwise AND operators:
// Example: lelong&0xf0000000
let (rest, mask) = parse_unsigned_number(after_amp).map_err(|_| {
nom::Err::Error(nom::error::Error::new(input, nom::error::ErrorKind::MapRes))
})?;
(rest, Some(Operator::BitwiseAndMask(mask)))
This ensures full u64 mask range support (e.g., 0xffffffffffffffff).
Libmagic Type Semantics: Signed by Default#
The parse_number / parse_unsigned_number separation mirrors the libmagic type system's signed-by-default convention. All integer TypeKind variants include explicit signed: bool fields that control how bytes are interpreted during evaluation:
pub enum TypeKind {
Byte { signed: bool },
Short { endian: Endianness, signed: bool },
Long { endian: Endianness, signed: bool },
Quad { endian: Endianness, signed: bool },
// ...
}
Type Name to TypeKind Mapping#
The type_keyword_to_kind function maps type name strings to TypeKind variants with appropriate signedness:
Signed Variants (unprefixed, default to signed: true):
"byte"→TypeKind::Byte { signed: true }"short"→TypeKind::Short { endian: Endianness::Native, signed: true }"long"→TypeKind::Long { endian: Endianness::Native, signed: true }"quad"→TypeKind::Quad { endian: Endianness::Native, signed: true }
Unsigned Variants (u-prefix, explicitly set signed: false):
"ubyte"→TypeKind::Byte { signed: false }"ushort"→TypeKind::Short { endian: Endianness::Native, signed: false }"ulong"→TypeKind::Long { endian: Endianness::Native, signed: false }"uquad"→TypeKind::Quad { endian: Endianness::Native, signed: false }
Endianness prefixes (le, be) combine with signedness—for example, uleshort specifies little-endian unsigned, while beshort specifies big-endian signed.
Why High-Bit Magic Numbers Require Unsigned Types#
File format signatures frequently contain bytes with high bits set. Without unsigned interpretation, these values would be misinterpreted as negative numbers:
- JPEG:
0 ubeshort 0xffd8— Both bytes have high bits set (0xFF = 255 decimal, exceeding signed byte range) - PNG:
0 ubelong 0x89504e47— Signature includes 0x89 (137 decimal), which exceeds the signed byte maximum of 127 - GZIP:
0 beshort 0x1f8b— Second byte 0x8B (139 decimal) requires unsigned handling
Without the u-prefix convention and the parse_unsigned_number path supporting full u64 range, these detections would fail. The parser must preserve signedness information from type names through to evaluation.
Design Rationale#
The separation of parse_number (i64) and parse_unsigned_number (u64) enforces architectural constraints that prevent subtle bugs and maintain semantic correctness:
- Preserve libmagic semantics: Signed-by-default types require infrastructure that distinguishes signed from unsigned parsing at the grammar level
- Prevent type confusion: Offsets (inherently signed) and values (contextually typed) operate in different semantic domains and must not be conflated
- Enable high-bit detection: u-prefix types require the full
u64range to correctly match file signatures with bytes exceeding 127 - Maintain exhaustive pattern matching: The
TypeKindenum'ssigned: boolfield propagates through the entire evaluator; collapsing the parser's distinction would silently break downstream assumptions
Relevant Code Files#
| File Path | Purpose |
|---|---|
| src/parser/grammar/mod.rs | parse_number (lines 133-154), parse_unsigned_number (lines 103-109), parse_numeric_value (lines 574-594), parse_offset (lines 179-185) |
| src/parser/ast.rs | TypeKind enum with signed fields (lines 80-151), Value enum (lines 307-329), OffsetSpec enum (lines 13-78) |
| src/parser/types.rs | parse_type_keyword (lines 43-87), type_keyword_to_kind mapping (lines 121-232) |
Related Topics#
- Type Signedness Defaults: The signed field pattern in TypeKind and its propagation through the evaluator
- Parser-Evaluator Architecture: How parse_number constraints flow through the AST to evaluation
- Magic File Format Specification: libmagic syntax conventions for type names and numeric literals