Numeric Section Weighting Heuristics and Implementation#
String extraction from binaries is guided by a numeric section weighting system that prioritizes areas most likely to contain meaningful strings. Each section type receives a base weight reflecting its relevance. Mach-O section weights are now normalized to a 0.0–1.0 scale for consistency, while ELF and PE currently use a 1–10 or 0–40 scale (normalization for those formats is planned).
| Section Type | Mach-O Weight | ELF/PE Weight | Rationale |
|---|---|---|---|
| StringData | 1.0 | 40 | Dedicated string storage (.rodata, __cstring) |
| Resources | 0.7 | 35 | PE resources, version info, manifests |
| ReadOnlyData | 0.4 | 25 | Read-only after loading (.data.rel.ro) |
| Debug | 0.2 | 15 | Debug symbols, build info |
| WritableData | 0.3 | 10 | Runtime data, less reliable |
| Code | 0.1 | 5 | Occasional embedded strings |
| Other | 0.1 | 0 | Unknown or irrelevant sections |
Mach-O High-Priority Sections:
__TEXT,__cstring(primary string section)__TEXT,__objc_methname(Objective-C method names)__TEXT,__objc_classname(Objective-C class names)__TEXT,__const(string constants)__TEXT,__ustring(Unicode string literals)__DATA_CONST,__cfstring(Core Foundation strings)
Format-specific bonuses further refine prioritization. For example, ELF .rodata.str1.1, PE .rsrc, and Mach-O __TEXT,__cstring and related sections each receive the highest priority due to their high likelihood of containing strings.
The final relevance score for each string is computed as:
Final Score = SectionWeight + EncodingConfidence + SemanticBoost - NoisePenalty
This score is clamped between 0 and 100 to ensure consistency. Encoding confidence is determined by the string's encoding and quality (e.g., ASCII/UTF-8 strings with high printability score 10, while low-confidence UTF-16 strings may score as low as 2). Semantic boosts are applied for strings matching patterns such as URLs (+25), GUIDs (+20), file paths (+15), code artifacts (+10), symbols (+8), and version information (+12). Context-aware boosts increase scores by 20% for strings in high-value sections and add 5 points if the string is in a symbol context. Noise penalties, ranging from -6 to -20, are applied for high entropy, excessive length, repeated patterns, or known noise patterns like padding or hex dumps. Performance optimizations include caching entropy calculations, regex matches, and batch processing with parallel iteration and sorting by score.
See detailed ranking heuristics and implementation.
Section-Aware String Extraction#
Extraction strategies are tailored to section priority. High-priority sections (e.g., ELF .rodata, PE .rdata, Mach-O __TEXT,__cstring, __TEXT,__objc_methname, __TEXT,__objc_classname, __TEXT,__const, __TEXT,__ustring, __DATA_CONST,__cfstring) use aggressive extraction with minimal noise filtering and shorter minimum string lengths. Medium-priority sections (e.g., ELF .data.rel.ro, PE .data) apply more conservative extraction and enhanced noise filtering. Low-priority sections (e.g., writable data) use strict filtering and longer minimum lengths to reduce false positives. Resource sections like PE .rsrc have specialized extraction routines for version info, string tables, and manifests.
For Mach-O binaries, in addition to section-based extraction, the system extracts and classifies load command strings, including library dependency paths, runtime search paths, and framework paths, using a dedicated extraction module. These strings are tagged according to their semantic role (e.g., DylibPath, Rpath, RpathVariable, FrameworkPath).
See section-aware extraction strategies.
Enhancements to ELF Import Discovery and Symbol Extraction#
ELF import discovery has been enhanced to provide comprehensive symbol extraction from both .dynsym and .symtab tables. The implementation supports multiple symbol types, including functions, objects, TLS variables, and indirect functions. Import detection identifies all undefined symbols (SHN_UNDEF) requiring runtime resolution, handling both global and weak bindings. Export detection extracts all globally visible defined symbols, filtering out hidden and internal symbols.
See ELF symbol extraction details.
Symbol-to-library mapping leverages ELF version tables (versym and verneed) to attribute imported symbols to their providing libraries. The process maps a symbol to its version index, then to a verneed entry, and finally to the library filename. For unversioned symbols, fallback heuristics are used. This approach improves the accuracy of library attribution, especially when version information is present.
See symbol-to-library mapping process.
Impact on Accuracy and Speed#
These enhancements improve downstream processing by focusing extraction and analysis on the most relevant binary sections, reducing noise and false positives. Numeric section weighting and section-aware extraction ensure that high-value strings are prioritized, while noise penalties and conservative strategies in low-priority sections minimize irrelevant results. Comprehensive symbol extraction and accurate library attribution enable more precise dependency analysis and string classification.
Performance optimizations include memory mapping for large files, parallel processing of sections, and regex caching. Planned improvements such as SIMD acceleration, incremental analysis, and GPU acceleration will further enhance speed and scalability.
See performance and optimization strategies.
Supported Binary Formats#
The system supports ELF (Linux/Unix), PE (Windows), and Mach-O (macOS/iOS) binaries. Each format has dedicated logic for section classification, string extraction, and symbol handling:
- ELF: Prioritizes
.rodata,.rodata.str1.1, and.data.rel.rofor string extraction; uses symbol tables and versioning for import/export analysis. - PE: Focuses on
.rdataand.rsrcfor strings and resources; processes import/export tables and resource directories. - Mach-O: Targets
__TEXT,__cstring,__TEXT,__objc_methname,__TEXT,__objc_classname,__TEXT,__const,__TEXT,__ustring, and__DATA_CONST,__cfstringfor string data. Additionally, parses load commands to extract library dependency paths (LC_LOAD_DYLIB,LC_LOAD_WEAK_DYLIB,LC_REEXPORT_DYLIB), runtime search paths (LC_RPATH), and framework paths, with robust tagging (DylibPath,Rpath,RpathVariable,FrameworkPath).
See format-specific extraction and classification.
Supported Binary Formats#
Mach-O Load Command String Extraction and Tagging#
Mach-O binaries contain important metadata in their load commands, including library dependency paths and runtime search paths. The system provides a dedicated extraction module for these strings, supporting robust classification and tagging.
Extracted Load Command Strings#
- Library Dependency Paths: Extracted from
LC_LOAD_DYLIB,LC_LOAD_WEAK_DYLIB, andLC_REEXPORT_DYLIBcommands. Tagged asDylibPathandFilePath. If the path contains.framework, also tagged asFrameworkPath. - Runtime Search Paths: Extracted from
LC_RPATHcommands. Tagged asRpath. If the path contains@rpath,@executable_path, or@loader_path, also tagged asRpathVariable. Framework paths are also tagged asFrameworkPath.
Tagging System#
The tagging system for load command strings includes:
DylibPath: Indicates a library dependency pathRpath: Indicates a runtime search pathRpathVariable: Indicates the presence of an@-variable in the pathFrameworkPath: Indicates a macOS framework pathFilePath: Indicates a file system path (always present forDylibPath)
Usage Example#
use stringy::extraction::extract_load_command_strings;
use stringy::types::Tag;
let macho_data = std::fs::read("example.dylib")?;
let strings = extract_load_command_strings(&macho_data);
// Filter dylib paths
let dylib_paths: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::DylibPath))
.collect();
// Filter rpaths
let rpaths: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::Rpath))
.collect();
// Filter framework paths
let framework_paths: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::FrameworkPath))
.collect();
Implementation Notes#
- Handles both single-architecture and universal (fat) Mach-O binaries
- All extracted strings are UTF-8 and have
StringSource::LoadCommand - Tagging is robust and validated by comprehensive tests
See the integration tests for real-world examples and expected output.
PE Resource Extraction and Section Classification (PE Format)#
PE Resource Extraction and Section Classification (PE Format)#
The PE (Portable Executable) parser now provides comprehensive support for resource extraction and section classification, enabling advanced string analysis and metadata extraction from Windows binaries.
Section Classification and Weighting#
PE sections are classified and weighted to prioritize string extraction:
| Section Type | Weight | Rationale |
|---|---|---|
| StringData (.rdata) | 10.0 | Primary string storage |
| Resources (.rsrc) | 9.0 | Version info, string tables |
| ReadOnlyData | 7.0 | May contain constants |
| WritableData (.data) | 5.0 | Runtime state, lower priority |
| Code (.text) | 1.0 | Unlikely to contain strings |
| Debug | 2.0 | Internal metadata |
| Other | 1.0 | Minimal priority |
Section classification uses both section names (e.g., .rdata, .rsrc, .data) and PE section flags to determine type and weight. Exception handling sections like .pdata and .xdata are classified as Debug.
Import/Export Extraction#
- Imports: Extracted from the PE import directory, including function name, DLL name, RVA, and ordinal (if present). Ordinal imports are named as
ordinal_{value}. - Exports: Extracted from the PE export directory, including function name (or synthesized name for unnamed exports), address, and ordinal (calculated from the export directory's base ordinal). Forwarded exports are detected and marked accordingly.
Resource Extraction (Phase 2 Complete)#
PE resources are a rich source of strings. The parser now extracts:
- VERSIONINFO: All StringFileInfo key-value pairs (e.g., CompanyName, FileDescription, ProductName, FileVersion, etc.) using pelite's
version_info()API. Supports multiple language variants. All strings are UTF-16LE and tagged asVersionandResource. - STRINGTABLE: Parses RT_STRING resources (type 6), extracting all non-empty strings from all blocks and languages. Strings are grouped in blocks of 16, each entry is UTF-16LE. Tagged as
Resource. - MANIFEST: Extracts RT_MANIFEST resources (type 24), automatically detecting encoding (UTF-8, UTF-16LE, UTF-16BE) and returning the full XML manifest content. Tagged as
ManifestandResource.
Usage Example#
use stringy::extraction::extract_resource_strings;
use stringy::types::Tag;
let pe_data = std::fs::read("example.exe")?;
let strings = extract_resource_strings(&pe_data);
// Filter version info strings
let version_strings: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::Version))
.collect();
// Filter string table entries
let ui_strings: Vec<_> = strings.iter()
.filter(|s| s.tags.contains(&Tag::Resource) && !s.tags.contains(&Tag::Version))
.collect();
Limitations and Future Enhancements#
- Dialog Resources: RT_DIALOG parsing not yet implemented
- Menu Resources: RT_MENU parsing not yet implemented
- Icon Strings: RT_ICON metadata extraction not yet implemented
Planned enhancements include dialog/menu resource parsing, icon/cursor metadata, and accelerator table string extraction.
Testing and Robustness#
The implementation includes comprehensive unit and integration tests covering malformed PE data, missing or empty resource directories, multiple language variants, and edge cases. All error paths degrade gracefully, returning empty results rather than panicking.
Summary#
These enhancements provide robust PE resource extraction and section-aware string analysis, enabling accurate extraction of version info, UI strings, and manifest data from Windows binaries. This improves both the quality and coverage of string analysis for PE files.
Future Plans for Library Attribution and Format Support#
Planned enhancements include extending support to additional formats such as WebAssembly (WASM), Java Class Files, and Android APK/DEX. Resource extraction will be enhanced across all formats, and architecture-specific features will be added for ARM64, x86-64, and RISC-V. Improvements to library attribution will focus on more robust symbol-to-library mapping, especially for binaries lacking version information or with complex linking scenarios.
See roadmap and future enhancements.