String Extraction#
Stringy's string extraction engine is designed to find meaningful strings while avoiding noise and false positives. The extraction process is encoding-aware, section-aware, and configurable.
Extraction Pipeline#
Binary Data → Section Analysis → Encoding Detection → String Scanning → Deduplication → Classification
Encoding Support#
ASCII Extraction ✅#
The most common encoding in most binaries. ASCII extraction provides foundational string extraction with configurable minimum length thresholds.
UTF-16LE Extraction ✅#
UTF-16LE extraction is now implemented and available for Windows PE binary string extraction. It provides UTF-16LE string extraction with confidence scoring and noise filtering integration.
Algorithm#
- Scan for printable sequences: Characters in range 0x20-0x7E (strict printable ASCII)
- Length filtering: Configurable minimum length (default: 4 characters)
- Null termination: Respect null terminators but don't require them
- Section awareness: Integrate with section metadata for context-aware filtering
Basic Extraction#
use stringy::extraction::ascii::{extract_ascii_strings, AsciiExtractionConfig};
let data = b"Hello\0World\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);
for string in strings {
println!("Found: {} at offset {}", string.text, string.offset);
}
Configuration#
use stringy::extraction::ascii::AsciiExtractionConfig;
// Default configuration (min_length: 4, no max_length)
let config = AsciiExtractionConfig::default();
// Custom minimum length
let config = AsciiExtractionConfig::new(8);
// Custom minimum and maximum length
let mut config = AsciiExtractionConfig::default();
config.max_length = Some(256);
UTF-8 Extraction#
UTF-8 extraction builds on ASCII extraction and handles multi-byte characters. See the main extraction module for UTF-8 support.
Implementation Details#
fn extract_ascii_strings(data: &[u8], min_len: usize) -> Vec<RawString> {
let mut strings = Vec::new();
let mut current_string = Vec::new();
let mut start_offset = 0;
for (i, &byte) in data.iter().enumerate() {
if is_printable_ascii(byte) {
if current_string.is_empty() {
start_offset = i;
}
current_string.push(byte);
} else {
if current_string.len() >= min_len {
strings.push(RawString {
data: current_string.clone(),
offset: start_offset,
encoding: Encoding::Ascii,
});
}
current_string.clear();
}
}
strings
}
Noise Filtering#
Stringy implements a multi-layered heuristic filtering system to reduce false positives and identify noise in extracted strings. The filtering system uses a combination of entropy analysis, character distribution, linguistic patterns, length checks, repetition detection, and context-aware filtering.
Filter Architecture#
The noise filtering system consists of multiple independent filters that can be combined with configurable weights:
- Character Distribution Filter: Detects abnormal character frequency distributions
- Entropy Filter: Uses Shannon entropy to detect padding/repetition and random binary
- Linguistic Pattern Filter: Analyzes vowel-to-consonant ratios and common bigrams
- Length Filter: Penalizes excessively long strings and very short strings in low-weight sections
- Repetition Filter: Detects repeated character patterns and repeated substrings
- Context-Aware Filter: Boosts confidence for strings in high-weight sections
Character Distribution Analysis#
Detects strings with abnormal character distributions:
- Excessive punctuation (>80%): Low confidence (0.2)
- Excessive repetition (>90% same character): Very low confidence (0.1)
- Excessive non-alphanumeric (>70%): Low confidence (0.3)
- Reasonable distribution: High confidence (1.0)
Entropy-Based Filtering#
Uses Shannon entropy (bits per byte) to classify strings:
- Very low entropy (<1.5 bits/byte): Likely padding or repetition (confidence: 0.1)
- Very high entropy (>7.5 bits/byte): Likely random binary (confidence: 0.2)
- Optimal range (3.5-6.0 bits/byte): High confidence (1.0)
- Acceptable range (2.0-7.0 bits/byte): Moderate confidence (0.4-0.7)
Linguistic Pattern Detection#
Analyzes text for word-like patterns:
- Vowel-to-consonant ratio: Reasonable range 0.2-0.8 for English
- Common bigrams: Detects common English patterns (th, he, in, er, an, re, on, at, en, nd)
- Handles non-English: Gracefully handles non-English strings without over-penalizing
Length-Based Filtering#
Applies penalties based on string length:
- Excessively long (>200 characters): Low confidence (0.3) - likely table data
- Very short in low-weight sections (<4 chars, weight <0.5): Moderate confidence (0.5)
- Normal length (4-100 characters): High confidence (1.0)
Repetition Detection#
Identifies repetitive patterns:
- Repeated characters (e.g., "AAAA", "0000"): Very low confidence (0.1)
- Repeated substrings (e.g., "abcabcabc"): Low confidence (0.2)
- Normal strings: High confidence (1.0)
Context-Aware Filtering#
Boosts or reduces confidence based on section context:
- String data sections (.rodata, .rdata, __cstring): High confidence (0.9-1.0)
- Read-only data sections: High confidence (0.9)
- Resource sections: Maximum confidence (1.0) - known-good sources
- Code sections: Lower confidence (0.3-0.5)
- Writable data sections: Moderate confidence (0.6)
Configuration#
use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};
// Default configuration
let config = NoiseFilterConfig::default();
// Customize thresholds
let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;
// Customize filter weights
config.filter_weights = FilterWeights {
entropy_weight: 0.3,
char_distribution_weight: 0.25,
linguistic_weight: 0.2,
length_weight: 0.15,
repetition_weight: 0.05,
context_weight: 0.05,
};
Using Noise Filters#
use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};
use stringy::types::SectionType;
let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();
let confidence = filter.calculate_confidence("Hello, World!", &context);
if confidence >= 0.5 {
// String passed filtering threshold
}
Confidence Scoring#
Each string is assigned a confidence score (0.0-1.0) indicating how likely it is to be legitimate:
- 1.0: Maximum confidence (strings from known-good sources like imports, exports, resources)
- 0.7-0.9: High confidence (likely legitimate strings)
- 0.5-0.7: Moderate confidence (may need review)
- 0.0-0.5: Low confidence (likely noise, filtered out by default)
The confidence score is separate from the score field used for final ranking. Confidence specifically represents the noise filtering assessment.
Performance#
Noise filtering is designed to add minimal overhead (<10% per acceptance criteria). Individual filters are optimized for performance, and the composite filter allows enabling/disabling specific filters to balance accuracy and speed.
UTF-16 Extraction ✅#
Critical for Windows binaries and some resources. Supports both UTF-16LE (Little-Endian) and UTF-16BE (Big-Endian) with automatic byte order detection.
UTF-16LE (Little-Endian) ✅#
Most common on Windows platforms. Default 3 character minimum.
Detection heuristics:
- Even-length sequences (2-byte alignment required)
- Low byte printable, high byte mostly zero
- Null termination patterns (0x00 0x00)
- Advanced confidence scoring with multiple heuristics
UTF-16BE (Big-Endian) ✅#
Found in Java .class files, network protocols, some cross-platform binaries.
Detection heuristics:
- Even-length sequences
- High byte printable, low byte mostly zero
- Reverse byte order from UTF-16LE
- Same advanced confidence scoring as UTF-16LE
Automatic Byte Order Detection ✅#
The ByteOrder::Auto mode automatically detects and extracts both UTF-16LE and UTF-16BE strings from the same data, avoiding duplicates and correctly identifying the encoding of each string.
Implementation#
UTF-16 extraction is implemented in src/extraction/utf16.rs following the pattern established in the ASCII extractor. The implementation provides:
extract_utf16_strings(): Main extraction function supporting both byte ordersextract_utf16le_strings(): UTF-16LE-specific extraction (backward compatibility)extract_from_section(): Section-aware extraction with proper metadata populationUtf16ExtractionConfig: Configuration for minimum/maximum character count, byte order selection, and confidence thresholdsByteOrderenum: Control which byte order(s) to scan (LE, BE, Auto)
Usage Example:
use stringy::extraction::utf16::{extract_utf16_strings, Utf16ExtractionConfig, ByteOrder};
// Extract UTF-16LE strings from Windows PE binary
let config = Utf16ExtractionConfig {
byte_order: ByteOrder::LE,
min_length: 3,
confidence_threshold: 0.6,
..Default::default()
};
let strings = extract_utf16_strings(data, &config);
// Extract both UTF-16LE and UTF-16BE with auto-detection
let config = Utf16ExtractionConfig {
byte_order: ByteOrder::Auto,
..Default::default()
};
let strings = extract_utf16_strings(data, &config);
Configuration:
use stringy::extraction::utf16::{Utf16ExtractionConfig, ByteOrder};
// Default configuration (min_length: 3, byte_order: Auto, confidence_threshold: 0.5)
let config = Utf16ExtractionConfig::default();
// Custom minimum character length
let config = Utf16ExtractionConfig::new(5);
// Custom configuration
let mut config = Utf16ExtractionConfig::default();
config.min_length = 3;
config.max_length = Some(256);
config.byte_order = ByteOrder::LE;
config.confidence_threshold = 0.6;
UTF-16-Specific Confidence Scoring#
UTF-16 extraction uses advanced confidence scoring to detect false positives from null-interleaved binary data. The confidence score combines multiple heuristics:
-
Valid Unicode range check: Validates code points are in valid Unicode ranges (U+0020-U+D7FF, U+E000-U+FFFD, U+10000-U+10FFFF), penalizes private use areas and invalid surrogates
-
Printable character ratio: Calculates ratio of printable characters including common Unicode ranges
-
ASCII ratio: Boosts confidence for ASCII-heavy strings (>50% characters in ASCII printable range)
-
Null pattern detection: Flags suspicious patterns like:
- Excessive nulls (>30% of characters)
- Regular null intervals (every 2nd, 4th, 8th position)
- Fixed-offset nulls indicating structured binary data
-
Byte order consistency: Verifies byte order is consistent throughout the string (for Auto mode)
Confidence Formula:
confidence = (valid_unicode_weight × valid_ratio)
+ (printable_weight × printable_ratio)
+ (ascii_weight × ascii_ratio)
- (null_pattern_penalty)
- (invalid_range_penalty)
The result is clamped to 0.0-1.0 range.
Examples:
- High confidence: "Microsoft Corporation" (>90% printable, valid Unicode, no null patterns)
- Medium confidence: "Test123" (>70% printable, valid Unicode)
- Low confidence: Null-interleaved binary table data (excessive nulls, regular patterns)
The UTF-16-specific confidence score is combined with general noise filtering confidence when noise filtering is enabled, using the minimum of both scores.
False Positive Prevention#
UTF-16 extraction is prone to false positives because binary data with null bytes can look like UTF-16 strings. The confidence scoring system mitigates this by:
- Detecting null-interleaved patterns: Binary tables with numeric data (e.g.,
[0x01, 0x00, 0x02, 0x00]) are flagged as suspicious - Penalizing regular null patterns: Data with nulls at fixed intervals (every 2nd, 4th, 8th byte) receives lower confidence
- Validating Unicode ranges: Invalid code points and surrogate pairs reduce confidence
- Configurable threshold: The
utf16_confidence_threshold(default 0.5) can be tuned to balance recall and precision
Recommendations:
- For Windows PE binaries: Use
ByteOrder::LEwithconfidence_threshold: 0.6 - For Java .class files: Use
ByteOrder::BEwithconfidence_threshold: 0.5 - For unknown formats: Use
ByteOrder::Autowithconfidence_threshold: 0.5 - For high-precision extraction: Increase
confidence_thresholdto 0.7-0.8
Performance Considerations#
UTF-16 scanning adds overhead compared to ASCII/UTF-8 extraction:
- Scanning both byte orders: Auto mode doubles the work by scanning for both LE and BE
- Confidence scoring: The multi-heuristic confidence calculation adds computational cost
- Recommendations:
- Use specific byte order (LE or BE) when the target format is known
- Auto mode is best for unknown or mixed-format binaries
- Consider disabling UTF-16 extraction for formats that don't use it (e.g., pure ELF binaries)
Section-Aware Extraction#
Different sections have different string extraction strategies.
High-Priority Sections#
ELF: .rodata and variants#
- Strategy: Aggressive extraction, low noise filtering
- Encodings: ASCII/UTF-8 primary, UTF-16 secondary
- Minimum length: 3 characters
PE: .rdata#
- Strategy: Balanced extraction
- Encodings: ASCII and UTF-16LE equally
- Minimum length: 4 characters
Mach-O: __TEXT,__cstring#
- Strategy: High confidence, null-terminated focus
- Encodings: UTF-8 primary
- Minimum length: 3 characters
Medium-Priority Sections#
ELF: .data.rel.ro#
- Strategy: Conservative extraction
- Noise filtering: Enhanced
- Minimum length: 5 characters
PE: .data (read-only)#
- Strategy: Moderate extraction
- Context checking: Enhanced validation
Low-Priority Sections#
Writable data sections#
- Strategy: Very conservative
- High noise filtering: Skip obvious runtime data
- Minimum length: 6+ characters
Resource Sections#
PE Resources (.rsrc)#
- VERSIONINFO: Extract version strings, product names
- STRINGTABLE: Localized UI strings
- RT_MANIFEST: XML manifest data
fn extract_pe_resources(pe: &PE, data: &[u8]) -> Vec<RawString> {
let mut strings = Vec::new();
// Extract version info
if let Some(version_info) = extract_version_info(pe, data) {
strings.extend(version_info);
}
// Extract string tables
if let Some(string_tables) = extract_string_tables(pe, data) {
strings.extend(string_tables);
}
strings
}
Deduplication Strategy#
Canonicalization#
Strings are canonicalized while preserving important metadata:
- Normalize whitespace: Convert tabs/newlines to spaces
- Trim boundaries: Remove leading/trailing whitespace
- Case preservation: Maintain original case for analysis
- Encoding normalization: Convert to UTF-8 for comparison
Metadata Preservation#
When duplicates are found:
struct DeduplicatedString {
canonical_text: String,
occurrences: Vec<StringOccurrence>,
primary_encoding: Encoding,
best_section: Option<String>,
}
struct StringOccurrence {
offset: u64,
section: Option<String>,
encoding: Encoding,
length: u32,
}
Deduplication Algorithm#
fn deduplicate_strings(strings: Vec<RawString>) -> Vec<DeduplicatedString> {
let mut map: HashMap<String, DeduplicatedString> = HashMap::new();
for string in strings {
let canonical = canonicalize(&string.text);
map.entry(canonical.clone())
.or_insert_with(|| DeduplicatedString::new(canonical))
.add_occurrence(string);
}
map.into_values().collect()
}
Configuration Options#
Extraction Configuration#
use stringy::extraction::{ByteOrder, Encoding, ExtractionConfig};
pub struct ExtractionConfig {
pub min_ascii_length: usize, // Default: 4
pub min_wide_length: usize, // Default: 3 (for UTF-16)
pub enabled_encodings: Vec<Encoding>, // Default: ASCII, UTF-8
pub noise_filtering_enabled: bool, // Default: true
pub min_confidence_threshold: f32, // Default: 0.5
pub utf16_min_confidence: f32, // Default: 0.7 (for UTF-16LE)
pub utf16_byte_order: ByteOrder, // Default: Auto
pub utf16_confidence_threshold: f32, // Default: 0.5 (UTF-16-specific)
}
UTF-16 Configuration Examples:
use stringy::extraction::{ExtractionConfig, Encoding, ByteOrder};
// Extract UTF-16LE strings from Windows PE binary
let mut config = ExtractionConfig::default();
config.min_wide_length = 3;
config.utf16_confidence_threshold = 0.6;
config.utf16_byte_order = ByteOrder::LE;
config.enabled_encodings.push(Encoding::Utf16Le);
// Extract both UTF-16LE and UTF-16BE with auto-detection
let mut config = ExtractionConfig::default();
config.enabled_encodings.push(Encoding::Utf16Le);
config.enabled_encodings.push(Encoding::Utf16Be);
config.utf16_byte_order = ByteOrder::Auto;
Noise Filter Configuration#
use stringy::extraction::config::NoiseFilterConfig;
pub struct NoiseFilterConfig {
pub entropy_min: f32, // Default: 1.5
pub entropy_max: f32, // Default: 7.5
pub max_length: usize, // Default: 200
pub max_repetition_ratio: f32, // Default: 0.7
pub min_vowel_ratio: f32, // Default: 0.1
pub max_vowel_ratio: f32, // Default: 0.9
pub filter_weights: FilterWeights, // Default: balanced weights
}
Filter Weights#
use stringy::extraction::config::FilterWeights;
pub struct FilterWeights {
pub entropy_weight: f32, // Default: 0.25
pub char_distribution_weight: f32, // Default: 0.20
pub linguistic_weight: f32, // Default: 0.20
pub length_weight: f32, // Default: 0.15
pub repetition_weight: f32, // Default: 0.10
pub context_weight: f32, // Default: 0.10
}
All weights must sum to 1.0. The configuration validates this automatically.
Encoding Selection#
pub enum EncodingFilter {
All,
Specific(Vec<Encoding>),
AsciiOnly,
Utf16Only,
}
Section Filtering#
pub struct SectionFilter {
pub include_sections: Option<Vec<String>>,
pub exclude_sections: Option<Vec<String>>,
pub include_debug: bool,
pub include_resources: bool,
}
Performance Optimizations#
Memory Mapping#
Large files use memory mapping for efficient access:
use memmap2::Mmap;
fn extract_from_large_file(path: &Path) -> Result<Vec<RawString>> {
let file = File::open(path)?;
let mmap = unsafe { Mmap::map(&file)? };
extract_strings(&mmap[..])
}
Parallel Processing#
Section extraction can be parallelized:
use rayon::prelude::*;
fn extract_parallel(sections: &[SectionInfo], data: &[u8]) -> Vec<RawString> {
sections
.par_iter()
.flat_map(|section| extract_from_section(section, data))
.collect()
}
Regex Caching#
Pattern matching uses cached regex compilation:
lazy_static! {
static ref URL_REGEX: Regex = Regex::new(r"https?://[^\s]+").unwrap();
static ref GUID_REGEX: Regex = Regex::new(r"\{[0-9a-fA-F-]{36}\}").unwrap();
}
Quality Assurance#
Validation Heuristics#
The noise filtering system implements comprehensive validation:
- Entropy checking: Uses Shannon entropy to detect padding/repetition and random binary data
- Language detection: Analyzes vowel-to-consonant ratios and common bigrams
- Context validation: Considers section type, weight, and permissions
- Character distribution: Detects abnormal frequency distributions
- Repetition detection: Identifies repeated patterns and padding
False Positive Reduction#
The multi-layered filtering system targets common sources of false positives:
- Padding detection: Identifies repeated character sequences (e.g., "AAAA", "\x00\x00\x00\x00")
- Table data: Filters excessively long strings likely to be structured data
- Binary noise: High-entropy strings are flagged as likely random binary
- Context awareness: Strings in code sections receive lower confidence scores
Performance Characteristics#
Noise filtering is designed for minimal overhead:
- Target overhead: <10% compared to extraction without filtering
- Optimized filters: Each filter is independently optimized
- Configurable: Can enable/disable individual filters to balance accuracy and speed
- Scalable: Handles large binaries efficiently
Examples#
Basic Extraction with Filtering#
use stringy::extraction::ascii::{extract_ascii_strings, AsciiExtractionConfig};
use stringy::extraction::config::NoiseFilterConfig;
use stringy::extraction::filters::{CompositeNoiseFilter, FilterContext};
let data = b"Hello World\0AAAA\0Test123";
let config = AsciiExtractionConfig::default();
let strings = extract_ascii_strings(data, &config);
let filter_config = NoiseFilterConfig::default();
let filter = CompositeNoiseFilter::new(&filter_config);
let context = FilterContext::default();
let filtered: Vec<_> = strings
.into_iter()
.filter(|s| filter.calculate_confidence(&s.text, &context) >= 0.5)
.collect();
Custom Filter Configuration#
use stringy::extraction::config::{NoiseFilterConfig, FilterWeights};
let mut config = NoiseFilterConfig::default();
config.entropy_min = 2.0;
config.entropy_max = 7.0;
config.max_length = 150;
config.filter_weights = FilterWeights {
entropy_weight: 0.4,
char_distribution_weight: 0.3,
linguistic_weight: 0.15,
length_weight: 0.1,
repetition_weight: 0.03,
context_weight: 0.02,
};
This comprehensive extraction system ensures high-quality string extraction while maintaining performance and minimizing false positives through multi-layered noise filtering.