Type Signedness Defaults And Unsigned Type Variants#
The libmagic-rs type system implements explicit signedness control through TypeKind enum variants with signed: bool fields. This design provides libmagic-compatible type interpretation for multi-byte integers (Short, Long, and Quad types), enabling correct detection of file formats whose magic numbers have high bits set. The parser recognizes endian-prefixed type names (beshort, leshort, belong, lelong, bequad, lequad) and maps them to TypeKind variants with appropriate endianness and signedness settings.
All TypeKind variants (Byte, Short, Long, Quad) include explicit signed: bool fields that control whether values are interpreted via signed casts (i8/i16/i32/i64) or unsigned conversions (u8/u16/u32/u64). The read_byte function returns Value::Uint(u64) for unsigned bytes and Value::Int(i64) for signed bytes. This distinction matters for formats like JPEG (0xffd8) and PNG (0x89504e47), where unsigned interpretation prevents misreading high-bit magic numbers as negative values.
The signedness field cascades through multiple subsystems: type reading functions in src/evaluator/types.rs, strength scoring in src/evaluator/strength.rs, serialization in build.rs and src/build_helpers.rs, and property test strategies in tests/property_tests.rs. Rust's exhaustive pattern matching ensures all code paths handle signedness consistently.
TypeKind Enum Structure#
The TypeKind enum in src/parser/ast.rs defines five data type variants for interpreting bytes in magic rules:
pub enum TypeKind {
Byte {
signed: bool,
},
Short {
endian: Endianness,
signed: bool,
},
Long {
endian: Endianness,
signed: bool,
},
Quad {
endian: Endianness,
signed: bool,
},
String {
max_length: Option<usize>,
},
}
Variant Details#
Byte: Single 8-bit byte with signed: bool field. When signed: true, values are cast to i8 and returned as Value::Int(i64). When signed: false, values remain u8 and return as Value::Uint(u64).
Short: 16-bit integer with endian: Endianness and signed: bool fields. When signed: true, values are cast to i16 and returned as Value::Int(i64). When signed: false, values remain u16 and return as Value::Uint(u64).
Long: 32-bit integer with endian: Endianness and signed: bool fields. Similar to Short, signed: true triggers casting to i32 and returns Value::Int(i64), while signed: false uses u32 and returns Value::Uint(u64).
Quad: 64-bit integer with endian: Endianness and signed: bool fields. When signed: true, values are cast to i64 and returned as Value::Int(i64). When signed: false, values remain u64 and return as Value::Uint(u64).
String: Variable-length string with optional max_length constraint.
The Endianness enum provides three byte order options: Little (LSB first), Big (MSB first), and Native (system-dependent).
Parser Grammar and Type Name Mapping#
The parser in src/parser/grammar.rs recognizes eight type name keywords using nom's alt() and tag() combinators. Type names are listed from longest to shortest to prevent premature matching of prefixes:
let (input, type_name) = alt((
tag("lelong"),
tag("belong"),
tag("leshort"),
tag("beshort"),
tag("long"),
tag("short"),
tag("byte"),
tag("string"),
))
.parse(input)?;
Type Name to TypeKind Mapping#
The parser maps type names to TypeKind variants with the following signedness and endianness patterns:
| Type Name | TypeKind Variant | Endianness | Signedness |
|---|---|---|---|
byte | TypeKind::Byte | N/A | signed: true |
ubyte | TypeKind::Byte | N/A | signed: false |
short | TypeKind::Short | Native | signed: true |
ushort | TypeKind::Short | Native | signed: false |
leshort | TypeKind::Short | Little | signed: true |
uleshort | TypeKind::Short | Little | signed: false |
beshort | TypeKind::Short | Big | signed: true |
ubeshort | TypeKind::Short | Big | signed: false |
long | TypeKind::Long | Native | signed: true |
ulong | TypeKind::Long | Native | signed: false |
lelong | TypeKind::Long | Little | signed: true |
ulelong | TypeKind::Long | Little | signed: false |
belong | TypeKind::Long | Big | signed: true |
ubelong | TypeKind::Long | Big | signed: false |
quad | TypeKind::Quad | Native | signed: true |
uquad | TypeKind::Quad | Native | signed: false |
lequad | TypeKind::Quad | Little | signed: true |
ulequad | TypeKind::Quad | Little | signed: false |
bequad | TypeKind::Quad | Big | signed: true |
ubequad | TypeKind::Quad | Big | signed: false |
string | TypeKind::String | N/A | N/A |
Following libmagic conventions, unprefixed type names (byte, short, long, quad, beshort, belong, leshort, lelong, bequad, lequad) default to signed interpretation. The u-prefixed variants (ubyte, ushort, ulong, uquad, ubeshort, ubelong, uleshort, ulelong, ubequad, ulequad) explicitly request unsigned interpretation.
Type Reading and Signed/Unsigned Conversion#
The read_typed_value function in src/evaluator/types.rs dispatches to specialized reading functions based on TypeKind:
match type_kind {
TypeKind::Byte { signed } => read_byte(buffer, offset, *signed),
TypeKind::Short { endian, signed } => read_short(buffer, offset, *endian, *signed),
TypeKind::Long { endian, signed } => read_long(buffer, offset, *endian, *signed),
TypeKind::Quad { endian, signed } => read_quad(buffer, offset, *endian, *signed),
TypeKind::String { max_length } => read_string(buffer, offset, *max_length),
}
Signedness Conversion Pattern#
All four integer reading functions (read_byte, read_short, read_long, read_quad) follow an identical pattern:
- Read raw bytes as unsigned (
u8,u16,u32, oru64) using the appropriate endianness - If
signed == false: convert tou64and returnValue::Uint(u64) - If
signed == true: cast to signed type (value as i8/value as i16/value as i32/value as i64), convert toi64, and returnValue::Int(i64)
The casting approach reinterprets bit patterns as signed integers using two's complement. For example:
0xFFasu8= 255 → cast toi8= -10xFFFFasu16= 65,535 → cast toi16= -10xFFFFFFFFasu32= 4,294,967,295 → cast toi32= -10xFFFFFFFFFFFFFFFFasu64= 18,446,744,073,709,551,615 → cast toi64= -1
Return Value Types#
Functions return variants of the Value enum:
read_byte: ReturnsValue::Int(i64)ifsigned: true, otherwiseValue::Uint(u64)read_short: ReturnsValue::Int(i64)ifsigned: true, otherwiseValue::Uint(u64)read_long: ReturnsValue::Int(i64)ifsigned: true, otherwiseValue::Uint(u64)read_quad: ReturnsValue::Int(i64)ifsigned: true, otherwiseValue::Uint(u64)read_string: ReturnsValue::String(String)
Built-in Magic Rules and High-Bit Magic Numbers#
The src/builtin_rules.magic file contains detection rules for common file formats. Several formats require unsigned type interpretation because their magic numbers have high bits set, which would be misread as negative values with signed interpretation.
JPEG Detection#
0 ubeshort 0xffd8 JPEG image data
Uses ubeshort (big-endian unsigned short) to match the JPEG start-of-image marker 0xFFD8. Both bytes have high bits set (0xFF = 255 decimal), requiring unsigned interpretation.
PNG Detection#
0 ubelong 0x89504e47 PNG image data
Uses ubelong (big-endian unsigned long) to match the PNG signature 0x89504E47 (byte sequence 0x89 'P' 'N' 'G'). The first byte 0x89 (137 decimal) exceeds the signed byte maximum of 127, making unsigned interpretation essential.
GZIP Detection#
0 beshort 0x1f8b gzip compressed data
The GZIP magic 0x1F8B has bit 7 set in the second byte (0x8B = 139 decimal), requiring unsigned handling.
Magic rule entries follow the structure: offset type value message, where hierarchical rules use > prefix for nesting.
Cascading Implementation Impacts#
The signed: bool field in all TypeKind variants (Byte, Short, Long, Quad) requires exhaustive pattern match updates across multiple modules. Rust's compiler enforces this through exhaustiveness checking, preventing runtime failures.
Core Subsystems#
AST Definition: src/parser/ast.rs lines 80-104 defines the TypeKind enum with signed fields in all integer variants (Byte, Short, Long, Quad).
Parser Grammar: src/parser/grammar.rs lines 1460-1488 maps type name strings to TypeKind variants with explicit signed values.
Type Reading: src/evaluator/types.rs lines 122-209 implements read_byte, read_short, read_long, and read_quad with signed parameter handling.
Strength Scoring: src/evaluator/strength.rs lines 72-86 matches on TypeKind variants (signedness doesn't affect score, but pattern must match structure). Quad types receive a base strength of 16, the highest among integer types.
Build System Duplication#
The most critical synchronization requirement involves code generation:
- build.rs lines 270-290:
serialize_type_kindfunction generates Rust code strings for built-in rules - src/build_helpers.rs lines 212-232: Identical
serialize_type_kindfunction for testing
Both functions must be updated identically because Rust build scripts cannot import from the crate being built. The duplication exists to enable comprehensive testing of build logic through the #[cfg(any(test, doc))] conditionally-compiled module. Both serialization functions handle all four integer types (Byte, Short, Long, Quad).
Example serialization output:
TypeKind::Byte { signed: false }
TypeKind::Short { endian: Endianness::Big, signed: false }
TypeKind::Long { endian: Endianness::Little, signed: true }
TypeKind::Quad { endian: Endianness::Big, signed: false }
Testing Infrastructure#
Property Tests: tests/property_tests.rs lines 28-54 contains the arb_type_kind() strategy that generates all four integer type variants (Byte, Short, Long, Quad) with any::<bool>() for signedness, producing all combinations of endianness and signedness.
Unit Tests: src/evaluator/types.rs contains dedicated tests comparing signed vs unsigned interpretation (e.g., 0xFFFF as 65,535 unsigned vs -1 signed). Quad type tests verify reading of 64-bit values including values above i64::MAX.
Build Helper Tests: src/build_helpers.rs lines 439-467 validates serialization for Byte, Short, Long, and Quad variants.
Design Rationale and Compatibility#
libmagic Compatibility Requirements#
The signedness control design addresses several libmagic compatibility challenges:
High-bit magic numbers: File formats like JPEG (0xFFD8), PNG (0x89504E47), and GZIP (0x1F8B) use magic numbers with high bits set. Signed interpretation would incorrectly read these as negative values, breaking detection.
Two's complement interpretation: The casting pattern (value as i16 / value as i32) correctly handles two's complement representation, where 0xFFFF becomes -1 for signed shorts and 65,535 for unsigned shorts.
Cross-type comparison safety: The operator evaluation system uses i128 intermediate values to safely compare mixed Value::Int and Value::Uint operands without overflow.
Current Implementation Status#
The parser supports both signed and unsigned type variants through u-prefixed type names (ubyte, ushort, ubeshort, ulong, ubelong, ulelong, uleshort, uquad, ubequad, ulequad). Unprefixed type names default to signed interpretation following libmagic conventions. The built-in rules use unsigned types (ubeshort, ubelong) for file formats with high-bit magic numbers like JPEG and PNG.
Memory Safety and Type Safety#
Safe buffer access: All reading functions use .get() for bounds checking instead of direct indexing, preventing panics.
Exhaustive matching: Rust's compiler enforces exhaustive pattern matching on TypeKind variants, ensuring all code paths handle signedness fields.
Type preservation: The Value enum preserves signedness through the evaluation pipeline, with Value::Int(i64) for signed values and Value::Uint(u64) for unsigned values.
Relevant Code Files#
| File | Purpose | Key Content |
|---|---|---|
| src/parser/ast.rs | AST definitions | TypeKind enum with signed: bool fields in all integer variants (Byte, Short, Long, Quad) |
| src/parser/grammar.rs | Parser implementation | Type name parsing and mapping to TypeKind variants |
| src/evaluator/types.rs | Type reading functions | read_byte, read_short, read_long, read_quad with signedness handling |
| src/evaluator/operators.rs | Operator evaluation | Cross-type integer coercion via i128 |
| src/evaluator/strength.rs | Match scoring | TypeKind pattern matching for strength calculation |
| build.rs | Build script | serialize_type_kind code generation function |
| src/build_helpers.rs | Build testing | Duplicate serialize_type_kind for test coverage |
| tests/property_tests.rs | Property testing | arb_type_kind() strategy with signedness generation |
| src/builtin_rules.magic | Magic rules | Detection rules for JPEG, PNG, GZIP, and other formats |
Related Topics#
-
Value Enum and Type System: The four-variant Value enum (
Uint,Int,Bytes,String) that represents typed values throughout the evaluation pipeline -
Endianness Handling: The three endianness modes (Little, Big, Native) used with multi-byte integer types
-
MagicRule Structure: How TypeKind integrates into the complete magic rule AST
-
Operator Evaluation and Cross-Type Comparison: Integer coercion via
i128for safe signed/unsigned comparisons -
Enum Extension and Exhaustive Match Synchronization: Pattern for maintaining consistency when modifying enum variants across the codebase
See Also#
- Type System And Operator Coverage - Knowledge base article on type system implementation
- Enum Extension And Exhaustive Match Synchronization - Pattern for maintaining enum consistency
- ARCHITECTURE - Three-layer architecture documentation
- Magic File Compatibility Status - libmagic compatibility status
- AST Structures - Abstract syntax tree design