Stringy Binary Analyzer Specification

Overview#

The Stringy Binary Analyzer is a modular, developer-focused tool designed to extract and classify meaningful strings from binary files. Unlike traditional tools, it leverages format-specific knowledge to distinguish valuable data from noise, supporting ELF, PE, and Mach-O formats. The system is organized into focused modules: container (format detection and parsing), extraction (string extraction algorithms), classification (semantic analysis and tagging), output (result formatting), and types (core data structures and error handling) [src/lib.rs].

Architecture and Major Modules#

The architecture is highly modular:

container: Handles binary format detection and parsing, with format-specific enhancements for ELF, PE, and Mach-O [src/container/mod.rs].
extraction: Framework for string extraction algorithms (planned).
classification: Types and infrastructure for semantic analysis and tagging (planned).
output: Interfaces for result formatting (planned).
types: Core data structures and error handling [src/types.rs].

Format Detection#

Format detection is performed using the goblin crate's Object::parse, which inspects the binary data and returns a BinaryFormat enum: Elf, Pe, MachO, or Unknown. This enables automatic routing to the appropriate parser [src/container/mod.rs].

pub fn detect_format(data: &[u8]) -> BinaryFormat {
    match Object::parse(data) {
        Ok(Object::Elf(_)) => BinaryFormat::Elf,
        Ok(Object::PE(_)) => BinaryFormat::Pe,
        Ok(Object::Mach(_)) => BinaryFormat::MachO,
        _ => BinaryFormat::Unknown,
    }
}

Section Classification and Numeric Weighting#

Sections are classified using the SectionType enum: StringData, ReadOnlyData, WritableData, Code, Debug, Resources, and Other. Each section is described by a SectionInfo struct, which includes its name, offset, size, RVA, type, executability, writability, and a numeric weight indicating the likelihood of containing meaningful strings [src/types.rs].

Section weights are assigned numerically based on section type and name. For example, string data sections like .rodata, .rdata, and __cstring receive the highest weights (up to 10.0), while code and debug sections receive lower weights. This prioritization guides downstream string extraction [src/container/elf.rs, src/container/pe.rs, src/container/macho.rs].

ELF, PE, and Mach-O Container Parsing and Symbol Extraction#

ELF#

The ElfParser uses goblin to parse ELF binaries. It classifies sections by name and flags, assigns weights, and extracts imports and exports using symbol and version tables. Enhancements include mapping symbols to libraries using version information and filtering symbols by type and visibility. Import and export extraction is robust, handling both dynamic and static symbol tables [src/container/elf.rs].

PE#

The PeParser parses PE binaries, classifies sections by name and characteristics, assigns weights, and extracts imports/exports from the import/export tables. Imports include symbol name, library, and address; exports include name, address, and ordinal [src/container/pe.rs].

Mach-O#

The MachoParser supports both single-architecture and fat binaries. It classifies sections by segment and section names, assigns weights, and extracts imports/exports from symbol tables. Import extraction identifies undefined symbols; export extraction filters for meaningful, defined symbols [src/container/macho.rs].

String Extraction#

String extraction is represented by the FoundString struct, which includes the string text, encoding (Ascii, Utf8, Utf16Le, Utf16Be), offset, RVA, section, length, semantic tags, a relevance score, and the source of the string. The extraction module's framework is in place, but the implementation of extraction algorithms is planned for future development [src/types.rs, src/lib.rs].

Semantic Classification and Ranking#

Semantic classification uses the Tag enum, which includes tags such as Url, Domain, IPv4, IPv6, FilePath, RegistryPath, Guid, Email, Base64, FormatString, UserAgent, Import, Export, Version, Manifest, and Resource. Each extracted string can be tagged with one or more semantic labels. Ranking is supported via a score field in FoundString, which can be used to order results by relevance. The infrastructure is defined, but the criteria and algorithms for semantic classification and ranking are planned [src/types.rs].

Output Formatting#

The output module is designed to support result formatting, with interfaces ready for future implementation. Planned output formats include user-friendly and machine-readable options such as JSON and tables. The details of output formatting are to be developed [src/lib.rs].

CLI Interface#

The CLI interface is defined using the clap crate. The command is named stringy and accepts a single input file argument. The main extraction pipeline is not yet implemented in main.rs, but the CLI scaffolding is in place [src/main.rs].

Example usage:

stringy <FILE>

Memory Mapping#

Parsing is performed on byte slices, which could be memory mapped externally for efficiency. Explicit memory mapping logic is not present in the parser code, but the design supports efficient file handling for large binaries [src/container/macho.rs].

Testing Strategies#

Comprehensive unit tests cover format detection, section classification, weight calculation, symbol filtering, and parser creation for all supported formats. Tests verify classification logic, weight assignment, and symbol extraction constants. Integration and property-based testing are not yet implemented but are natural extensions for future work [src/container/elf.rs, src/container/pe.rs, src/container/macho.rs].

Pipeline Orchestration#

The pipeline orchestration is modular and extensible. Each major analysis step is encapsulated in its own module, with clear interfaces for detection, parsing, extraction, classification, and output. Error handling is robust at the module level. The main orchestration logic is planned for future development, building on the current modular foundation [src/lib.rs].

Task Breakdown#

Completed:

Modular architecture and core data structures
Format detection for ELF, PE, Mach-O
Section classification and numeric weighting
Container parsing and symbol extraction for ELF, PE, Mach-O
CLI scaffolding
Unit tests for detection, classification, weighting, and symbol extraction

In Progress / Planned:

String extraction algorithms (framework ready)
Semantic classification and ranking (types defined)
Output formatting (interfaces ready)
Main pipeline orchestration and error propagation
Integration and property-based testing
Memory mapping optimizations

Example: Section Classification and Weighting#

// Example: ELF section classification and weighting
let section_type = ElfParser::classify_section(&section, &name);
let weight = ElfParser::calculate_section_weight(section_type, &name);

Example: CLI Usage#

stringy mybinary

Example: Data Structures#

pub struct SectionInfo {
    pub name: String,
    pub offset: u64,
    pub size: u64,
    pub rva: Option<u64>,
    pub section_type: SectionType,
    pub is_executable: bool,
    pub is_writable: bool,
    pub weight: f32,
}

For further details, see the StringyMcStringFace repository.