Ranking Algorithm#
Stringy's ranking system prioritizes strings by relevance, helping analysts focus on the most important findings first. The algorithm combines multiple factors to produce a comprehensive relevance score.
Scoring Formula#
Final Score = SectionWeight + EncodingConfidence + SemanticBoost - NoisePenalty
Each component contributes to the overall relevance assessment.
Section Weight#
Different sections have varying likelihood of containing meaningful strings.
Weight Values#
| Section Type | Weight | Rationale |
|---|---|---|
| StringData | 40 | Dedicated string storage (.rodata, __cstring) |
| Resources | 35 | PE resources, version info, manifests |
| ReadOnlyData | 25 | Read-only after loading (.data.rel.ro) |
| Debug | 15 | Debug symbols, build info |
| WritableData | 10 | Runtime data, less reliable |
| Code | 5 | Occasional embedded strings |
| Other | 0 | Unknown or irrelevant sections |
Format-Specific Adjustments#
fn calculate_section_weight(
section_type: SectionType,
format: BinaryFormat,
section_name: &str,
) -> i32 {
let base_weight = match section_type {
SectionType::StringData => 40,
SectionType::Resources => 35,
SectionType::ReadOnlyData => 25,
SectionType::Debug => 15,
SectionType::WritableData => 10,
SectionType::Code => 5,
SectionType::Other => 0,
};
// Format-specific bonuses
let format_bonus = match (format, section_name) {
(BinaryFormat::Elf, ".rodata.str1.1") => 5, // Aligned strings
(BinaryFormat::Pe, ".rsrc") => 5, // Rich resources
(BinaryFormat::MachO, "__TEXT,__cstring") => 5, // Dedicated strings
_ => 0,
};
base_weight + format_bonus
}
Encoding Confidence#
Different encodings have varying reliability indicators.
Confidence Factors#
ASCII/UTF-8#
- High confidence (10 points): All printable, reasonable length
- Medium confidence (7 points): Mostly printable, some control chars
- Low confidence (3 points): Mixed printable/non-printable
UTF-16#
- High confidence (8 points): >90% valid chars, proper null termination
- Medium confidence (5 points): >70% valid chars, reasonable length
- Low confidence (2 points): >50% valid chars, may be coincidental
fn calculate_encoding_confidence(string: &FoundString) -> i32 {
match string.encoding {
Encoding::Ascii | Encoding::Utf8 => {
let printable_ratio = calculate_printable_ratio(&string.text);
if printable_ratio > 0.95 {
10
} else if printable_ratio > 0.80 {
7
} else {
3
}
}
Encoding::Utf16Le | Encoding::Utf16Be => {
let confidence = calculate_utf16_confidence(&string);
if confidence > 0.90 {
8
} else if confidence > 0.70 {
5
} else {
2
}
}
}
}
Semantic Boost#
Strings with semantic meaning receive significant score boosts.
Boost Values#
| Tag Category | Boost | Examples |
|---|---|---|
| Network (URL, Domain, IP) | +25 | https://api.evil.com |
| Identifiers (GUID, Email) | +20 | {12345678-1234-...} |
| File System (Path, Registry) | +15 | C:\Windows\System32\evil.dll |
| Code Artifacts (Format, Base64) | +10 | Error: %s at line %d |
| Symbols (Import, Export) | +8 | CreateFileW, main |
| Version/Manifest | +12 | MyApp v1.2.3 |
Multi-Tag Bonuses#
Strings with multiple semantic tags receive additional boosts:
fn calculate_semantic_boost(tags: &[Tag]) -> i32 {
let mut boost = 0;
for tag in tags {
boost += match tag {
Tag::Url | Tag::Domain | Tag::IPv4 | Tag::IPv6 => 25,
Tag::Guid | Tag::Email => 20,
Tag::FilePath | Tag::RegistryPath => 15,
Tag::Version | Tag::Manifest => 12,
Tag::FormatString | Tag::Base64 => 10,
Tag::Import | Tag::Export => 8,
Tag::UserAgent => 15,
Tag::Resource => 5,
};
}
// Multi-tag bonus (diminishing returns)
if tags.len() > 1 {
boost += (tags.len() as i32 - 1) * 3;
}
boost
}
Context-Aware Boosts#
Semantic boosts are adjusted based on context:
fn apply_context_boost(base_boost: i32, context: &StringContext) -> i32 {
let mut adjusted_boost = base_boost;
// Boost for strings in high-value sections
if matches!(
context.section_type,
SectionType::StringData | SectionType::Resources
) {
adjusted_boost = (adjusted_boost as f32 * 1.2) as i32;
}
// Boost for import/export context
if context.is_symbol_context {
adjusted_boost += 5;
}
adjusted_boost
}
Noise Penalty#
Various factors indicate low-quality or noisy strings.
Penalty Categories#
High Entropy#
Strings with high randomness are likely binary data:
fn calculate_entropy_penalty(text: &str) -> i32 {
let entropy = calculate_shannon_entropy(text);
if entropy > 4.5 {
-15 // Very high entropy
} else if entropy > 3.8 {
-8 // High entropy
} else {
0 // Normal entropy
}
}
Excessive Length#
Very long strings are often noise:
fn calculate_length_penalty(length: usize) -> i32 {
match length {
0..=50 => 0,
51..=200 => -2,
201..=500 => -5,
501..=1000 => -10,
_ => -20,
}
}
Repeated Patterns#
Strings with excessive repetition:
fn calculate_repetition_penalty(text: &str) -> i32 {
let repetition_ratio = detect_repetition_ratio(text);
if repetition_ratio > 0.7 {
-12 // Highly repetitive
} else if repetition_ratio > 0.5 {
-6 // Moderately repetitive
} else {
0 // Normal variation
}
}
Common Noise Patterns#
Known noise patterns receive penalties:
fn calculate_noise_pattern_penalty(text: &str) -> i32 {
// Padding patterns
if text.chars().all(|c| c == ' ' || c == '\0' || c == '\x20') {
return -20;
}
// Hex dump patterns
if text.matches(char::is_ascii_hexdigit).count() as f32 / text.len() as f32 > 0.8 {
return -10;
}
// Table-like data
if text.matches('\t').count() > 3 || text.matches(',').count() > 5 {
return -8;
}
0
}
Complete Scoring Implementation#
pub struct RankingEngine {
config: RankingConfig,
}
impl RankingEngine {
pub fn calculate_score(&self, string: &FoundString, context: &StringContext) -> i32 {
let section_weight = self.calculate_section_weight(context);
let encoding_confidence = self.calculate_encoding_confidence(string);
let semantic_boost = self.calculate_semantic_boost(&string.tags, context);
let noise_penalty = self.calculate_noise_penalty(string);
let raw_score = section_weight + encoding_confidence + semantic_boost + noise_penalty;
// Clamp to valid range
raw_score.max(0).min(100)
}
fn calculate_noise_penalty(&self, string: &FoundString) -> i32 {
let entropy_penalty = calculate_entropy_penalty(&string.text);
let length_penalty = calculate_length_penalty(string.length as usize);
let repetition_penalty = calculate_repetition_penalty(&string.text);
let pattern_penalty = calculate_noise_pattern_penalty(&string.text);
entropy_penalty + length_penalty + repetition_penalty + pattern_penalty
}
}
Score Interpretation#
Score Ranges#
| Range | Interpretation | Typical Content |
|---|---|---|
| 90-100 | Extremely High | URLs, GUIDs in .rdata |
| 80-89 | Very High | File paths, API names |
| 70-79 | High | Format strings, version info |
| 60-69 | Medium-High | Import names, long strings |
| 50-59 | Medium | Short strings in good sections |
| 40-49 | Medium-Low | Strings in data sections |
| 30-39 | Low | Short or noisy strings |
| 0-29 | Very Low | Likely false positives |
Filtering Recommendations#
- Interactive analysis: Show scores ≥ 50
- Automated processing: Use scores ≥ 70
- YARA rules: Focus on scores ≥ 80
- High-confidence indicators: Scores ≥ 90
Configuration Options#
pub struct RankingConfig {
pub section_weights: HashMap<SectionType, i32>,
pub semantic_boosts: HashMap<Tag, i32>,
pub entropy_threshold: f32,
pub length_penalty_threshold: usize,
pub repetition_threshold: f32,
}
impl Default for RankingConfig {
fn default() -> Self {
Self {
section_weights: default_section_weights(),
semantic_boosts: default_semantic_boosts(),
entropy_threshold: 4.5,
length_penalty_threshold: 200,
repetition_threshold: 0.5,
}
}
}
Performance Considerations#
Caching#
- Pre-calculate entropy for reused strings
- Cache regex matches for pattern detection
- Memoize expensive calculations
Batch Processing#
pub fn rank_strings_batch(strings: &mut [FoundString], contexts: &[StringContext]) {
strings
.par_iter_mut()
.zip(contexts.par_iter())
.for_each(|(string, context)| {
string.score = self.calculate_score(string, context);
});
// Sort by score (highest first)
strings.sort_by(|a, b| b.score.cmp(&a.score));
}
This comprehensive ranking system ensures that the most relevant and actionable strings appear first in Stringy's output, dramatically improving analysis efficiency.