Architecture Overview#
Stringy is built as a modular Rust library with a clear separation of concerns. The architecture follows a pipeline approach where binary data flows through several processing stages.
High-Level Architecture#
Binary File → Format Detection → Container Parsing → String Extraction → Classification → Ranking → Output
Core Components#
1. Container Module (src/container/) ✅ Implemented#
Handles binary format detection and parsing using the goblin crate with comprehensive section analysis.
- Format Detection: Automatically identifies ELF, PE, and Mach-O formats via
goblin::Object::parse() - Section Classification: Categorizes sections by string likelihood with weighted scoring
- Metadata Extraction: Collects imports, exports, and detailed structural information
- Cross-Platform Support: Handles platform-specific section characteristics and naming conventions
Supported Formats#
| Format | Parser | Key Sections (Weight) | Import/Export Support |
|---|---|---|---|
| ELF | ElfParser | .rodata (10.0), .comment (9.0), .data.rel.ro (7.0) | ✅ Dynamic & Static |
| PE | PeParser | .rdata (10.0), .rsrc (9.0), read-only .data (7.0) | ✅ Import/Export Tables |
| Mach-O | MachoParser | __TEXT,__cstring (10.0), __TEXT,__const (9.0) | ✅ Symbol Tables |
Section Weight System#
The parsers implement intelligent section prioritization:
// Example: ELF section weights
".rodata" | ".rodata.str1.*" => 10.0 // Highest priority
".comment" | ".note.*" => 9.0 // Build info, very likely strings
".data.rel.ro" => 7.0 // Read-only data
".data" => 5.0 // Writable data
".text" => 1.0 // Code sections (low priority)
2. Extraction Module (src/extraction/) 🚧 Framework Ready#
Implements encoding-aware string extraction algorithms with configurable parameters.
- ASCII/UTF-8: Scans for printable character sequences with noise filtering
- UTF-16: Detects little-endian and big-endian wide strings with confidence scoring
- Deduplication: Canonicalizes strings while preserving complete metadata
- Section-Aware: Uses container parser weights to prioritize extraction areas
3. Classification Module (src/classification/) 🚧 Types Defined#
Applies semantic analysis to extracted strings with comprehensive tagging system.
- Pattern Matching: Uses regex to identify URLs, IPs, paths, GUIDs, etc.
- Symbol Processing: Demangles Rust symbols and processes imports/exports
- Context Analysis: Considers section context and source type for classification
- Extensible Tags: Supports 15+ semantic categories from network indicators to code artifacts
Supported Classification Tags#
| Category | Tags | Examples |
|---|---|---|
| Network | url, domain, ipv4, ipv6 | https://api.com, example.com, 192.168.1.1 |
| Filesystem | filepath, regpath | /usr/bin/app, HKEY_LOCAL_MACHINE\... |
| Identifiers | guid, email, user-agent | {12345678-...}, user@domain.com |
| Code | fmt, b64, import, export | Error: %s, SGVsbG8=, CreateFileW |
| Resources | version, manifest, resource | v1.2.3, XML config, UI strings |
4. Ranking Module (src/classification/ranking.rs) 🚧 Algorithm Designed#
Implements the scoring algorithm to prioritize relevant strings using multiple factors.
Score = SectionWeight + EncodingConfidence + SemanticBoost - NoisePenalty
Scoring Components:
- Section Weight: 1.0-10.0 based on section classification
- Encoding Confidence: Higher for clean UTF-8/ASCII vs. noisy UTF-16
- Semantic Boost: +20-50 points for URLs, GUIDs, imports/exports
- Noise Penalty: -10 to -30 for high entropy, excessive length, repeated patterns
5. Output Module (src/output/) 🚧 Interfaces Defined#
Formats results for different use cases with consistent data structures.
- Human-readable: Sorted tables with score, offset, section, tags, and truncated strings
- JSONL: Complete structured data including all metadata fields
- YARA: Properly escaped strings with hex alternatives and confidence grouping
Data Flow#
1. Binary Analysis Phase ✅ Implemented#
// Format detection using goblin
let format = detect_format(&data); // Returns BinaryFormat enum
let parser = create_parser(format)?; // Creates appropriate parser
// Container parsing with full metadata extraction
let container_info = parser.parse(&data)?;
// Returns: sections with weights, imports, exports, format info
Current Implementation:
- Automatic format detection via
goblin::Object::parse() - Trait-based parser creation with
Box<dyn ContainerParser> - Comprehensive section analysis with classification and weighting
- Complete import/export symbol extraction
2. String Extraction Phase 🚧 Framework Ready#
// Extract strings from prioritized sections (by weight)
let mut all_strings = Vec::new();
for section in container_info.sections.iter().filter(|s| s.weight > 5.0) {
let strings = extract_strings(&data, §ion, &config)?;
all_strings.extend(strings);
}
// Include import/export names as high-value strings
all_strings.extend(extract_symbol_strings(&container_info));
// Deduplicate while preserving all metadata
let unique_strings = deduplicate(all_strings);
3. Classification Phase 🚧 Types Ready#
// Apply semantic classification with context awareness
for string in &mut unique_strings {
let context = StringContext {
section_type: string.section_type,
source: string.source,
encoding: string.encoding,
};
string.tags = classify_string(&string.text, &context);
string.score = calculate_score(&string, &context);
}
4. Output Phase 🚧 Interfaces Defined#
// Sort by relevance score (descending)
unique_strings.sort_by_key(|s| std::cmp::Reverse(s.score));
// Apply user filters and limits
let filtered = apply_filters(&unique_strings, &config);
// Format according to requested output type
let output = match config.format {
OutputFormat::Human => format_human_readable(&filtered),
OutputFormat::Json => format_jsonl(&filtered),
OutputFormat::Yara => format_yara_rules(&filtered),
};
Current Implementation Details#
Container Parser Architecture#
The container parsing system is fully implemented with a trait-based design:
pub trait ContainerParser {
fn detect(data: &[u8]) -> bool
where
Self: Sized;
fn parse(&self, data: &[u8]) -> Result<ContainerInfo>;
}
Format Detection Pipeline:
detect_format()usesgoblin::Object::parse()to identify formatcreate_parser()returns appropriateBox<dyn ContainerParser>- Parser extracts sections, imports, exports with full metadata
Section Classification System#
Each parser implements intelligent section classification:
// ELF Example
fn classify_section(section: &SectionHeader, name: &str) -> SectionType {
if section.sh_flags & SHF_EXECINSTR != 0 {
return SectionType::Code;
}
match name {
".rodata" | ".rodata.str1.*" => SectionType::StringData,
".comment" | ".note.*" => SectionType::StringData,
".data.rel.ro" => SectionType::ReadOnlyData,
// ... more classifications
}
}
Weight Calculation:
- String data sections: 8.0-10.0 (highest priority)
- Read-only data: 7.0
- Resources: 8.0-9.0
- Writable data: 5.0
- Code: 1.0 (lowest priority)
Symbol Extraction#
All parsers extract import/export information:
- ELF: Dynamic symbol table (
dynsyms) and static symbols (syms) - PE: Import/export tables with library names and ordinals
- Mach-O: Symbol tables with undefined/defined symbol filtering
Data Structures#
Core types are fully defined and serializable:
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FoundString {
pub text: String,
pub encoding: Encoding,
pub offset: u64,
pub rva: Option<u64>,
pub section: Option<String>,
pub length: u32,
pub tags: Vec<Tag>,
pub score: i32,
pub source: StringSource,
}
Tag System: 15+ semantic categories ready for classification Error Handling: Comprehensive StringyError enum with context Cross-Platform: Handles platform-specific binary characteristics
Key Design Decisions#
Memory Efficiency#
- Uses memory mapping (
memmap2) for large files - Lazy evaluation for optional features
- Efficient regex compilation and caching
Error Handling#
- Comprehensive error types with context
- Graceful degradation for partially corrupted binaries
- Clear error messages for debugging
Extensibility#
- Trait-based architecture for easy format addition
- Pluggable classification systems
- Configurable output formats
Performance#
- Section-aware extraction reduces scan time
- Regex caching for repeated pattern matching
- Parallel processing where beneficial
Module Dependencies#
main.rs
├── lib.rs (public API)
├── types.rs (core data structures)
├── container/
│ ├── mod.rs (format detection)
│ ├── elf.rs (ELF parser)
│ ├── pe.rs (PE parser)
│ └── macho.rs (Mach-O parser)
├── extraction/
│ ├── mod.rs (extraction traits)
│ ├── ascii.rs (ASCII/UTF-8)
│ ├── utf16.rs (UTF-16LE/BE)
│ └── dedup.rs (deduplication)
├── classification/
│ ├── mod.rs (classification framework)
│ ├── semantic.rs (pattern matching)
│ ├── symbols.rs (symbol processing)
│ └── ranking.rs (scoring algorithm)
└── output/
├── mod.rs (output traits)
├── json.rs (JSONL format)
├── human.rs (table format)
└── yara.rs (YARA format)
External Dependencies#
Core Dependencies#
goblin: Multi-format binary parsingserde+serde_json: Serializationthiserror: Error handlingclap: CLI argument parsing
Optional Dependencies#
regex: Pattern matching for classificationrustc-demangle: Rust symbol demanglingmemmap2: Memory-mapped file I/Opelite: Enhanced PE resource extraction
Testing Strategy#
Unit Tests#
- Each module has comprehensive unit tests
- Mock data for parser testing
- Edge case coverage for string extraction
Integration Tests#
- End-to-end CLI functionality
- Real binary file testing
- Cross-platform validation
Performance Tests#
- Benchmarks for critical path components
- Memory usage profiling
- Large file handling validation
This architecture provides a solid foundation for reliable, efficient, and extensible binary string analysis.