Documents
Parse_Number Scope And Unsigned Value Parsing
Parse_Number Scope And Unsigned Value Parsing
Type
Topic
Status
Published
Created
Mar 7, 2026
Updated
Mar 7, 2026
Created by
Dosu Bot
Updated by
Dosu Bot

Parse_Number Scope And Unsigned Value Parsing#

In the libmagic-rs parser, parse_number and parse_unsigned_number are two separate functions that enforce a critical architectural constraint: parse_number returns i64 and must never be widened to u64. This design separation mirrors the libmagic type system's signed-by-default semantics, where unprefixed type names (byte, short, long, quad) default to signed interpretation, while u-prefixed variants (ubyte, ushort, ulong, uquad) explicitly request unsigned interpretation.

parse_number is a public function that handles signed integers with optional minus signs, returning i64 values used throughout the parser for offset specifications and signed numeric literals. parse_unsigned_number is module-private and returns u64 values, supporting the full unsigned 64-bit range required for types like uquad and for parsing high-bit magic numbers in file format signatures—for example, JPEG (0xffd8) and PNG (0x89504e47).

This constraint prevents type confusion downstream and ensures that TypeKind enum's signed: bool fields maintain semantic correctness throughout the parsing and evaluation pipeline. The separation enforces a clear distinction: offsets are inherently signed (supporting negative values for offsets from file end), while numeric test values can span the full u64 range when paired with unsigned types.

Core Constraint: Why parse_number Must Stay i64#

Function Signatures and Scoping#

parse_number is a public function with signature:

pub fn parse_number(input: &str) -> IResult<&str, i64>

parse_unsigned_number is module-private with signature:

fn parse_unsigned_number(input: &str) -> IResult<&str, u64>

The visibility difference reinforces the design intent: parse_number is the general-purpose entry point for signed integer parsing (used by offset parsing, strength modifiers, and signed numeric values), while parse_unsigned_number is a specialized helper for contexts requiring the full u64 range.

Implementation Logic#

parse_number handles both positive and negative numbers:

  • Optionally parses a leading minus sign using opt(char('-'))
  • Determines format (hexadecimal vs decimal) after consuming the sign
  • Uses checked_neg() for safe negation with overflow detection
  • Returns i64 values suitable for offset specifications, which can be negative

parse_unsigned_number only handles non-negative numbers:

  • Checks for 0x prefix to determine hexadecimal vs decimal format
  • Delegates to format-specific parsers with u64 overflow protection
  • Does not handle minus signs—callers detect signs before dispatching to this function

Why This Matters#

Widening parse_number to u64 would break the parser in three critical ways:

  1. Break offset semantics: Offsets feed into OffsetSpec::Absolute(i64), which requires signed values for negative offsets from file end
  2. Collapse type distinctions: The distinction between signed and unsigned numeric literals would be lost at the AST level, breaking the Value::Int / Value::Uint separation
  3. Silently introduce bugs: Downstream code assumes parse_number returns signed values; widening would require exhaustive pattern match updates across the evaluator

Offset Parsing: parse_number in Action#

Offset parsing follows a clean composition pattern that demonstrates why parse_number must return i64. The pipeline flows from high-level magic rule parsing down to low-level number extraction:

parse_rule_offset → parse_offset → parse_number → i64 → OffsetSpec::Absolute(i64)

The parse_offset function directly calls parse_number and wraps the result in the OffsetSpec::Absolute variant:

pub fn parse_offset(input: &str) -> IResult<&str, OffsetSpec> {
    let (input, _) = multispace0(input)?;
    let (input, offset_value) = parse_number(input)?;
    let (input, _) = multispace0(input)?;
    Ok((input, OffsetSpec::Absolute(offset_value)))
}

Examples#

Offset parsing handles both positive and negative values:

  • "0"OffsetSpec::Absolute(0)
  • "16"OffsetSpec::Absolute(16)
  • "0x10"OffsetSpec::Absolute(16)
  • "-4"OffsetSpec::Absolute(-4)
  • "-0xFF"OffsetSpec::Absolute(-255)

Negative offsets are semantically meaningful in libmagic. For example, offsets from file end use negative values to position the read pointer relative to the end of the file. This makes i64 the correct return type for parse_number in offset contexts.

Value Parsing: Contextual Signed/Unsigned Dispatch#

While offset parsing always uses parse_number, value parsing in magic rules requires contextual dispatch between signed and unsigned paths. The parse_numeric_value function implements this logic by checking for a leading minus sign:

fn parse_numeric_value(input: &str) -> IResult<&str, Value> {
    let (input, _) = multispace0(input)?;

    let (input, value) = if input.starts_with('-') {
        // Negative: parse as i64
        let (input, number) = parse_number(input)?;
        (input, Value::Int(number))
    } else {
        // Non-negative: parse as u64 to support full unsigned 64-bit range
        let (input, number) = parse_unsigned_number(input)?;
        (input, Value::Uint(number))
    };

    let (input, _) = multispace0(input)?;
    Ok((input, value))
}

The dispatch logic:

  1. Checks if input starts with a minus sign
  2. For negative numbers: calls parse_number and wraps as Value::Int(i64)
  3. For non-negative numbers: calls parse_unsigned_number and wraps as Value::Uint(u64)

This design ensures that non-negative literals can represent values up to u64::MAX (18,446,744,073,709,551,615), which is essential for uquad type matching against 64-bit unsigned file data.

Additional Usage: Bitmask Parsing#

parse_unsigned_number is also used in parse_type_and_operator for parsing mask values in bitwise AND operators:

// Example: lelong&0xf0000000
let (rest, mask) = parse_unsigned_number(after_amp).map_err(|_| {
    nom::Err::Error(nom::error::Error::new(input, nom::error::ErrorKind::MapRes))
})?;
(rest, Some(Operator::BitwiseAndMask(mask)))

This ensures full u64 mask range support (e.g., 0xffffffffffffffff).

Libmagic Type Semantics: Signed by Default#

The parse_number / parse_unsigned_number separation mirrors the libmagic type system's signed-by-default convention. All integer TypeKind variants include explicit signed: bool fields that control how bytes are interpreted during evaluation:

pub enum TypeKind {
    Byte { signed: bool },
    Short { endian: Endianness, signed: bool },
    Long { endian: Endianness, signed: bool },
    Quad { endian: Endianness, signed: bool },
    // ...
}

Type Name to TypeKind Mapping#

The type_keyword_to_kind function maps type name strings to TypeKind variants with appropriate signedness:

Signed Variants (unprefixed, default to signed: true):

  • "byte"TypeKind::Byte { signed: true }
  • "short"TypeKind::Short { endian: Endianness::Native, signed: true }
  • "long"TypeKind::Long { endian: Endianness::Native, signed: true }
  • "quad"TypeKind::Quad { endian: Endianness::Native, signed: true }

Unsigned Variants (u-prefix, explicitly set signed: false):

  • "ubyte"TypeKind::Byte { signed: false }
  • "ushort"TypeKind::Short { endian: Endianness::Native, signed: false }
  • "ulong"TypeKind::Long { endian: Endianness::Native, signed: false }
  • "uquad"TypeKind::Quad { endian: Endianness::Native, signed: false }

Endianness prefixes (le, be) combine with signedness—for example, uleshort specifies little-endian unsigned, while beshort specifies big-endian signed.

Why High-Bit Magic Numbers Require Unsigned Types#

File format signatures frequently contain bytes with high bits set. Without unsigned interpretation, these values would be misinterpreted as negative numbers:

  • JPEG: 0 ubeshort 0xffd8 — Both bytes have high bits set (0xFF = 255 decimal, exceeding signed byte range)
  • PNG: 0 ubelong 0x89504e47 — Signature includes 0x89 (137 decimal), which exceeds the signed byte maximum of 127
  • GZIP: 0 beshort 0x1f8b — Second byte 0x8B (139 decimal) requires unsigned handling

Without the u-prefix convention and the parse_unsigned_number path supporting full u64 range, these detections would fail. The parser must preserve signedness information from type names through to evaluation.

Design Rationale#

The separation of parse_number (i64) and parse_unsigned_number (u64) enforces architectural constraints that prevent subtle bugs and maintain semantic correctness:

  1. Preserve libmagic semantics: Signed-by-default types require infrastructure that distinguishes signed from unsigned parsing at the grammar level
  2. Prevent type confusion: Offsets (inherently signed) and values (contextually typed) operate in different semantic domains and must not be conflated
  3. Enable high-bit detection: u-prefix types require the full u64 range to correctly match file signatures with bytes exceeding 127
  4. Maintain exhaustive pattern matching: The TypeKind enum's signed: bool field propagates through the entire evaluator; collapsing the parser's distinction would silently break downstream assumptions

Relevant Code Files#

File PathPurpose
src/parser/grammar/mod.rsparse_number (lines 133-154), parse_unsigned_number (lines 103-109), parse_numeric_value (lines 574-594), parse_offset (lines 179-185)
src/parser/ast.rsTypeKind enum with signed fields (lines 80-151), Value enum (lines 307-329), OffsetSpec enum (lines 13-78)
src/parser/types.rsparse_type_keyword (lines 43-87), type_keyword_to_kind mapping (lines 121-232)
  • Type Signedness Defaults: The signed field pattern in TypeKind and its propagation through the evaluator
  • Parser-Evaluator Architecture: How parse_number constraints flow through the AST to evaluation
  • Magic File Format Specification: libmagic syntax conventions for type names and numeric literals
Parse_Number Scope And Unsigned Value Parsing | Dosu