Classification System#
Stringy's classification system applies semantic analysis to extracted strings, identifying patterns that indicate specific types of data. This helps analysts quickly focus on the most relevant information.
Classification Pipeline#
Raw String → Pattern Matching → Context Analysis → Tag Assignment → Confidence Scoring
Semantic Categories#
Network Indicators#
URLs#
- Pattern:
https?://[^\s]+ - Examples:
https://api.example.com/v1/users,http://malware.com/payload - Confidence factors: Valid TLD, path structure, parameter format
- Security relevance: High - indicates network communication
Domain Names#
- Pattern:
[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} - Examples:
api.example.com,malware-c2.net - Validation: TLD checking, DNS format compliance
- Security relevance: High - C2 domains, legitimate services
IP Addresses#
- IPv4 Pattern:
\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b - IPv6 Pattern:
\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b - Examples:
192.168.1.1,2001:db8::1 - Validation: Range checking, reserved address detection
- Security relevance: High - infrastructure indicators
File System Indicators#
File Paths#
- POSIX Pattern:
/[^\0\n\r]* - Windows Pattern:
[A-Za-z]:\\[^\0\n\r]* - Examples:
/usr/bin/malware,C:\Windows\System32\evil.dll - Context: Section type, surrounding strings
- Security relevance: Medium-High - persistence locations
Registry Paths#
- Pattern:
HKEY_[A-Z_]+\\[^\0\n\r]* - Examples:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run - Security relevance: High - persistence mechanisms
Identifiers#
GUIDs/UUIDs#
- Pattern:
\{[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{12}\} - Examples:
{12345678-1234-1234-1234-123456789abc} - Validation: Format compliance, version checking
- Security relevance: Medium - component identification
Email Addresses#
- Pattern:
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,} - Examples:
admin@malware.com,support@legitimate.org - Validation: RFC compliance, domain validation
- Security relevance: Medium - contact information
Code Artifacts#
Format Strings#
- Pattern:
%[sdxo]|%\d+[sdxo]|\{\d+\} - Examples:
Error: %s at line %d,User {0} logged in - Context: Proximity to other format strings
- Security relevance: Low-Medium - debugging information
Base64 Data#
- Pattern:
[A-Za-z0-9+/]{20,}={0,2} - Examples:
SGVsbG8gV29ybGQ= - Validation: Length divisibility, padding correctness
- Security relevance: Variable - encoded payloads
User Agents#
- Pattern:
Mozilla/[0-9.]+|Chrome/[0-9.]+|Safari/[0-9.]+ - Examples:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) - Security relevance: Medium - network fingerprinting
Implementation Details#
Pattern Matching Engine#
pub struct SemanticClassifier {
url_regex: Regex,
domain_regex: Regex,
ipv4_regex: Regex,
ipv6_regex: Regex,
guid_regex: Regex,
email_regex: Regex,
format_regex: Regex,
base64_regex: Regex,
}
impl SemanticClassifier {
pub fn classify(&self, text: &str, context: &StringContext) -> Vec<Tag> {
let mut tags = Vec::new();
// Network indicators
if self.url_regex.is_match(text) {
tags.push(Tag::Url);
}
if self.domain_regex.is_match(text) && !tags.contains(&Tag::Url) {
tags.push(Tag::Domain);
}
// File system
if self.is_file_path(text) {
tags.push(Tag::FilePath);
}
if self.is_registry_path(text) {
tags.push(Tag::RegistryPath);
}
// Continue for other patterns...
tags
}
}
Context-Aware Classification#
Classification considers the context where strings are found:
pub struct StringContext {
pub section_type: SectionType,
pub section_name: Option<String>,
pub surrounding_strings: Vec<String>,
pub binary_format: BinaryFormat,
pub encoding: Encoding,
}
impl SemanticClassifier {
fn classify_with_context(&self, text: &str, context: &StringContext) -> Vec<Tag> {
let mut tags = self.classify_patterns(text);
// Boost confidence based on context
match context.section_type {
SectionType::Resources => {
if self.looks_like_version_string(text) {
tags.push(Tag::Version);
}
}
SectionType::StringData => {
// Higher confidence for semantic patterns
self.boost_pattern_confidence(&mut tags);
}
_ => {}
}
tags
}
}
Symbol Classification#
Import and export symbols get special handling:
pub struct SymbolClassifier {
known_apis: HashSet<String>,
crypto_apis: HashSet<String>,
network_apis: HashSet<String>,
}
impl SymbolClassifier {
pub fn classify_symbol(&self, name: &str, is_import: bool) -> Vec<Tag> {
let mut tags = Vec::new();
if is_import {
tags.push(Tag::Import);
} else {
tags.push(Tag::Export);
}
// Add semantic tags based on API name
if self.crypto_apis.contains(name) {
tags.push(Tag::Crypto);
}
if self.network_apis.contains(name) {
tags.push(Tag::Network);
}
tags
}
}
Rust Symbol Demangling#
use rustc_demangle::demangle;
pub fn classify_rust_symbol(mangled: &str) -> Vec<Tag> {
let mut tags = vec![Tag::Export];
if let Ok(demangled) = demangle(mangled) {
let demangled_str = demangled.to_string();
// Look for common Rust patterns
if demangled_str.contains("::main") {
tags.push(Tag::EntryPoint);
}
if demangled_str.contains("panic") {
tags.push(Tag::ErrorHandling);
}
}
tags
}
Confidence Scoring#
Each classification receives a confidence score:
pub struct ClassificationResult {
pub tag: Tag,
pub confidence: f32, // 0.0 to 1.0
pub evidence: Vec<String>,
}
impl SemanticClassifier {
fn calculate_confidence(&self, text: &str, tag: &Tag, context: &StringContext) -> f32 {
let mut confidence = 0.5; // Base confidence
match tag {
Tag::Url => {
if text.starts_with("https://") {
confidence += 0.3;
}
if self.has_valid_tld(text) {
confidence += 0.2;
}
}
Tag::FilePath => {
if context.section_type == SectionType::StringData {
confidence += 0.2;
}
if self.has_valid_path_structure(text) {
confidence += 0.2;
}
} // ... other tag-specific confidence calculations
}
confidence.min(1.0)
}
}
Advanced Classification Features#
Multi-Pattern Matching#
Some strings match multiple patterns:
fn classify_multi_pattern(&self, text: &str) -> Vec<Tag> {
let mut tags = Vec::new();
// A string can be both a URL and contain Base64
if self.url_regex.is_match(text) {
tags.push(Tag::Url);
// Check if URL contains Base64 parameters
if let Some(query) = self.extract_url_query(text) {
if self.base64_regex.is_match(query) {
tags.push(Tag::Base64);
}
}
}
tags
}
Language-Specific Patterns#
Different programming languages have distinct string patterns:
pub enum LanguageHint {
Rust,
Go,
DotNet,
Native,
}
impl SemanticClassifier {
fn classify_with_language_hint(&self, text: &str, hint: LanguageHint) -> Vec<Tag> {
match hint {
LanguageHint::Rust => self.classify_rust_patterns(text),
LanguageHint::Go => self.classify_go_patterns(text),
LanguageHint::DotNet => self.classify_dotnet_patterns(text),
LanguageHint::Native => self.classify_native_patterns(text),
}
}
}
False Positive Reduction#
Several techniques reduce false positives:
- Length thresholds: Very short matches are filtered out
- Context validation: Surrounding data must make sense
- Entropy checking: High-entropy strings are likely binary data
- Whitelist/blacklist: Known good/bad patterns
fn is_likely_false_positive(&self, text: &str, tag: &Tag) -> bool {
match tag {
Tag::Domain => {
// Too short or invalid TLD
text.len() < 4 || !self.has_valid_tld(text)
}
Tag::Base64 => {
// Too short or invalid padding
text.len() < 8 || !self.valid_base64_padding(text)
}
_ => false,
}
}
Performance Considerations#
Regex Compilation Caching#
lazy_static! {
static ref COMPILED_PATTERNS: SemanticClassifier = SemanticClassifier::new();
}
Parallel Classification#
use rayon::prelude::*;
fn classify_batch(strings: &[RawString]) -> Vec<ClassifiedString> {
strings.par_iter().map(|s| classify_single(s)).collect()
}
Memory Efficiency#
- Reuse regex objects across classifications
- Use string interning for common patterns
- Lazy evaluation for expensive validations
This comprehensive classification system enables Stringy to automatically identify and categorize the most relevant strings in binary files, significantly improving analysis efficiency.