Pstring Max_Length Bounds Validation Pattern#
Lead Section#
The Pstring Max_Length Bounds Validation Pattern is a deliberate implementation design in libmagic-rs that validates Pascal-style length-prefixed string (pstring) bounds against a capped length constraint rather than the raw length byte value. This pattern ensures memory-safe buffer access while maintaining compatibility with GNU file's behavior for handling truncated or malformed data.
The core principle: when a max_length parameter is specified, bounds checking validates against min(length_byte, max_length) rather than the raw length byte. This allows magic rules to succeed even when the length byte indicates more data than actually exists in the buffer, provided the capped portion is available. For example, if a length byte claims 10 bytes but max_length caps to 3 and 3+ bytes are available, the read succeeds and returns the first 3 bytes.
This design matches GNU libmagic's behavior and enables file type detection on truncated files, partially downloaded data, and files where length fields may reference more data than exists. The pattern represents reusable architectural knowledge for implementing robust string type handling in file format parsers.
Background: Pascal Strings (pstring)#
A PString (Pascal string) is a length-prefixed string format where the first byte contains the string length (0-255), followed by that many bytes of string data. Unlike C-style null-terminated strings, Pascal strings store length explicitly and can contain null bytes within the string data.
Structure:
- Length byte: 1 byte indicating string length (0-255)
- String data: The number of bytes specified by the length byte
Example magic file syntax:
0 pstring =JPEG JPEG image (Pascal string)
The max_length parameter is an optional constraint that caps the maximum bytes read from the string data, regardless of what the length byte claims.
The Validation Pattern#
Design Principle#
The implementation validates pstring bounds using the minimum of two values:
length_byte: The value read from the first byte (0-255)max_length: An optional parameter capping the maximum string read
let actual_length = if let Some(max_len) = max_length {
std::cmp::min(string_length, max_len)
} else {
string_length
};
Rationale#
The code includes an explicit NOTE comment explaining this behavior:
"We intentionally validate bounds against actual_length (after capping), not against the raw length byte. This matches GNU file's behavior: if the length byte claims 10 bytes but max_length caps to 3 and 3+ bytes exist, the read succeeds. Validating against the raw length byte would reject valid magic rules where max_length is used precisely to handle truncated data."
This design handles real-world scenarios where:
- Files may be truncated or partially downloaded
- Length bytes may reference more data than actually exists
- Magic rules use
max_lengthas a safety mechanism for potentially malicious or malformed length fields - Format detection should succeed on incomplete but valid-enough data
Implementation Details#
The read_pstring() function in src/evaluator/types/string.rs follows this validation sequence:
-
Read length byte:
buffer.get(offset)reads the first byte at the given offset, returningTypeReadError::BufferOverrunif out of bounds -
Calculate capped length: Apply
min(length_byte, max_length)to getactual_length -
Validate string data bounds: Use checked arithmetic to verify
offset + 1 + actual_length ≤ buffer.len() -
Extract and convert: Safe slicing
&buffer[string_start..string_end]followed byString::from_utf8_lossy()to handle invalid UTF-8
Behavioral Examples#
Example 1: Truncated file with max_length
- Length byte: 10
max_length: 5- Available buffer: 5 bytes after length byte
- Result: Success - reads and returns 5 bytes
This is validated by the test case test_read_pstring_max_length_caps_when_buffer_short.
Example 2: Truncated file without max_length
- Length byte: 10
max_length: None- Available buffer: 3 bytes after length byte
- Result: Failure - returns
TypeReadError::BufferOverrun
This is validated by the test case test_read_pstring_buffer_overrun_length_exceeds_data.
Example 3: Well-formed string
- Length byte: 5
max_length: 10- Available buffer: 10 bytes after length byte
- Result: Success - reads and returns 5 bytes (capped by length byte)
GNU File Compatibility#
When a pstring length byte exceeds max_length, libmagic-rs truncates to max_length rather than returning an error. This truncation strategy matches GNU file's behavior for handling:
- Corrupted or truncated files: Files damaged in transit or storage
- Partially downloaded files: Network transfers interrupted mid-stream
- Malicious length fields: Files with intentionally oversized length claims
- Format probing with incomplete data: Detecting file types from file headers alone
By validating against the capped length, libmagic-rs maintains the same detection capabilities as GNU file on real-world imperfect data.
Architectural Context#
Type System Integration#
PString is part of libmagic-rs's broader type system, which includes:
- String types:
read_string()(null-terminated C strings) andread_pstring()(length-prefixed Pascal strings) - Numeric types:
read_byte(),read_short(),read_long(),read_quad() - Float types:
read_float(),read_double() - Date types:
read_date(),read_qdate()
All type implementations share the same TypeReadError enum with BufferOverrun and UnsupportedType variants.
Reusable Safety Pattern#
The pattern of using checked arithmetic (checked_add()) for all offset calculations is consistent across all type implementations. This prevents integer overflow vulnerabilities when computing buffer ranges:
let end = offset.checked_add(SIZE).ok_or(TypeReadError::BufferOverrun { ... })?;
let bytes = buffer.get(offset..end).ok_or(TypeReadError::BufferOverrun { ... })?;
PString applies this same pattern, adapted for variable-length data.
Multi-Byte Length Prefix Extensions#
pstring/B: 1-byte length prefix (0-255)pstring/H: 2-byte little-endian length prefix (0-65535), usingu16::from_le_bytespstring/L: 4-byte little-endian length prefix (0-4294967295), usingu32::from_le_bytes
Each variant applies the same validation pattern: bounds checking validates offset + prefix_width + min(length_value, max_length) ≤ buffer.len(), where prefix_width is 1, 2, or 4 bytes respectively.
Testing and Validation#
The implementation includes comprehensive test coverage ensuring:
- Edge cases: Empty strings (length byte = 0), maximum length (255), truncated files
- Boundary conditions: File too short, exact fit, extra data
- Overflow protection:
test_read_pstring_offset_overflowverifiesoffset=usize::MAXis caught bychecked_add - UTF-8 handling: Valid sequences, invalid bytes, mixed content with lossy conversion
- Operator behavior: All comparison operators (=, !=, <, >, <=, >=) across different string types
Usage Example#
In a magic file, a pstring rule with max_length might look like:
# Match JPEG files by reading Pascal-style length-prefixed marker
0 pstring/10 =JFIF JPEG image with JFIF header
The /10 suffix specifies max_length=10, meaning:
- Read the length byte at offset 0
- Cap the read to minimum of (length_byte, 10)
- If at least that many bytes are available, read them and compare to "JFIF"
- This succeeds even if the length byte claims more than 10 bytes
Design Trade-offs#
Benefits#
- Robustness: Handles real-world truncated and malformed files gracefully
- Compatibility: Matches GNU file behavior for established magic rule semantics
- Security:
max_lengthprovides a cap on potentially malicious length fields - Memory safety: Validated against capped length prevents buffer overruns
Considerations#
- Semantic ambiguity: A successful read doesn't guarantee the length byte was accurate
- Partial data matching: Magic rules may match on incomplete strings
- Testing complexity: Requires careful test design to validate capped vs. raw length behavior
- Documentation burden: The pattern is non-obvious and requires explicit explanation
Related Patterns#
- Null-Terminated String Validation:
read_string()uses a similar optionalmax_lengthpattern but searches for null terminators - Checked Arithmetic for Buffer Safety: All type readers use
checked_add()to prevent integer overflow - Shared Error Types: All type implementations use unified
TypeReadErrorfor consistent error handling
Relevant Code Files#
| File | Description | URL |
|---|---|---|
src/evaluator/types/string.rs | Implementation of read_pstring() with max_length bounds validation | View |
src/evaluator/types/mod.rs | Type system module with shared TypeReadError enum | View |
src/evaluator/types/numeric.rs | Numeric type implementations demonstrating reusable checked arithmetic pattern | View |
src/parser/ast.rs | AST definition of TypeKind::PString with optional max_length field | View |
See Also#
- Type System and Operator Coverage - Comprehensive overview of libmagic-rs type system
- Checked Arithmetic for Buffer Offset Safety - Project-wide pattern for preventing integer overflow
- Issue #43: Parser: implement pstring (Pascal string) type - Original pstring implementation discussion
- Issue #171: Parser: implement pstring multi-byte length prefix variants - Multi-byte length prefix extensions