Output Formatting And Tag Extraction#
Output Formatting and Tag Extraction in libmagic-rs is a multi-layered system that transforms raw file evaluation results into structured, human-readable text or machine-processable JSON formats. The system comprises three core components: output data structures (MatchResult, EvaluationResult, EvaluationMetadata), JSON serialization types with hex-encoded values, and GNU file-compatible text formatting. All output types derive serde Serialize and Deserialize traits, enabling integration with any serde-compatible format.
The tag enrichment algorithm extracts 16 semantic keywords (executable, archive, image, video, audio, document, compressed, encrypted, text, binary, data, script, font, database, spreadsheet, presentation) from match descriptions through case-insensitive substring matching. Tags populate the rule_path field in MatchResult, providing hierarchical classification for machine processing. When individual match messages do not yield tags, the system enriches the first match using tags extracted from the overall evaluation description.
JSON output includes hex-encoded value representations using little-endian byte ordering for integers. The system provides three JSON formatting modes: pretty-printed JSON for single-file human inspection (format_json_output), compact JSON for single-file machine processing (format_json_output_compact), and JSON Lines format for batch processing multiple files (format_json_line_output). The CLI tool rmagic automatically selects between single-file and JSON Lines formats based on the number of input files.
Architecture#
Module Organization#
The output system is organized across three primary modules:
src/output/mod.rs– Core data structures and conversion layer with staticLazyLock<TagExtractor>for tag enrichmentsrc/output/json.rs– JSON-specific types and formatting functions with hex encodingsrc/output/text.rs– Text formatting functions producingfilename: descriptionoutput
Data Structure Hierarchy#
The output layer defines three core data structures that mirror but extend the evaluator layer:
MatchResult represents a single magic rule match:
pub struct MatchResult {
pub message: String, // Human-readable description
pub offset: usize, // Byte offset where match occurred
pub length: usize, // Number of bytes examined
pub value: Value, // Matched value (Bytes, String, Uint, Int)
pub rule_path: Vec<String>, // Hierarchical tags from rule chain
pub confidence: u8, // Score 0-100 (clamped)
pub mime_type: Option<String>, // Optional MIME type
}
EvaluationResult wraps complete file evaluation:
pub struct EvaluationResult {
pub filename: PathBuf,
pub matches: Vec<MatchResult>,
pub metadata: EvaluationMetadata,
pub error: Option<String>,
}
EvaluationMetadata provides diagnostic information:
pub struct EvaluationMetadata {
pub file_size: u64,
pub evaluation_time_ms: f64,
pub rules_evaluated: u32,
pub rules_matched: u32,
}
Tag Enrichment System#
Keyword Set#
The tag enrichment system uses a static HashSet containing 16 classification keywords:
executable, archive, image, video, audio, document, compressed, encrypted, text, binary, data, script, font, database, spreadsheet, presentation
The TagExtractor struct wraps this keyword set and provides extraction methods. The output module instantiates a single TagExtractor via LazyLock to avoid repeated allocation.
Extraction Methods#
Two distinct methods handle tag extraction:
extract_tags(description: &str) -> Vec<String> performs case-insensitive substring matching:
pub fn extract_tags(&self, description: &str) -> Vec<String> {
let lower = description.to_lowercase();
let mut tags: Vec<String> = self
.keywords
.iter()
.filter(|keyword| lower.contains(keyword.as_str()))
.cloned()
.collect();
tags.sort();
tags
}
For the description "PNG image data, 1920 x 1080", this returns ["data", "image"].
extract_rule_path(messages) -> Vec<String> normalizes messages into hyphenated identifiers:
pub fn extract_rule_path<'a, I>(&self, messages: I) -> Vec<String>
where
I: IntoIterator<Item = &'a str>,
{
messages
.into_iter()
.map(|msg| {
msg.to_lowercase()
.replace(' ', "-")
.chars()
.filter(|c| c.is_alphanumeric() || *c == '-')
.collect()
})
.collect()
}
For "ELF 64-bit LSB executable", this returns ["elf-64-bit-lsb-executable"].
Enrichment Pipeline#
Tag extraction occurs at two critical conversion points:
- Per-match extraction:
MatchResult::from_evaluator_match()callsextract_rule_pathon each match message - Overall enrichment:
EvaluationResult::from_library_result()callsextract_tagson the overall description if the first match has an emptyrule_path
This two-stage approach ensures all matches receive classification tags, preferring specific rule path identifiers but falling back to keyword extraction.
Text Output Formatting#
Text formatting produces output compatible with the GNU file command. The system formats results as filename: description, with multiple matches joined by ", ".
Formatting Functions#
format_text_result(result: &MatchResult) -> String returns the match message:
pub fn format_text_result(result: &MatchResult) -> String {
result.message.clone()
}
format_text_output(results: &[MatchResult]) -> String joins multiple matches:
pub fn format_text_output(results: &[MatchResult]) -> String {
if results.is_empty() {
return "data".to_string(); // Default fallback for unknown files
}
results
.iter()
.map(|result| result.message.as_str())
.collect::<Vec<&str>>()
.join(", ")
}
format_evaluation_result(evaluation: &EvaluationResult) -> String produces complete output:
pub fn format_evaluation_result(evaluation: &EvaluationResult) -> String {
let filename = evaluation
.filename
.file_name()
.and_then(|name| name.to_str())
.unwrap_or("unknown");
let description = if evaluation.matches.is_empty() {
if let Some(ref error) = evaluation.error {
format!("ERROR: {error}")
} else {
"data".to_string()
}
} else {
format_text_output(&evaluation.matches)
};
format!("{filename}: {description}")
}
Text Output Examples#
Single file, single match:
photo.png: PNG image data
Single file, multiple matches:
ls: ELF 64-bit LSB executable, x86-64, dynamically linked
No matches (unknown file type):
unknown.bin: data
missing.txt: ERROR: File not found
JSON Output Formatting#
JSON Data Types#
JsonMatchResult represents a single match in JSON:
pub struct JsonMatchResult {
pub text: String, // Match description
pub offset: usize, // Byte offset
pub value: String, // Hex-encoded matched bytes
pub tags: Vec<String>, // Classification tags
pub score: u8, // Confidence 0-100
}
JsonOutput wraps matches for single-file output:
pub struct JsonOutput {
pub matches: Vec<JsonMatchResult>,
}
JsonLineOutput adds filename for batch processing:
pub struct JsonLineOutput {
pub filename: String,
pub matches: Vec<JsonMatchResult>,
}
Hex Encoding Strategy#
The format_value_as_hex(value: &Value) -> String function converts Value types to lowercase hex strings:
| Value Type | Encoding | Example |
|---|---|---|
Bytes(vec) | Direct hex encoding of each byte | [0x7f, 0x45, 0x4c, 0x46] → "7f454c46" |
String(s) | Hex encoding of UTF-8 bytes | "PNG" → "504e47" |
Uint(n) | Little-endian u64 bytes (16 hex chars) | 0x1234 → "3412000000000000" |
Int(n) | Little-endian i64 bytes (16 hex chars) | -1 → "ffffffffffffffff" |
Little-endian encoding ensures consistent cross-platform byte ordering.
JSON Formatting Functions#
format_json_output(match_results: &[MatchResult]) -> Result<String, serde_json::Error> produces pretty-printed JSON:
{
"matches": [
{
"text": "ELF 64-bit LSB executable",
"offset": 0,
"value": "7f454c46",
"tags": [
"executable",
"elf"
],
"score": 90
}
]
}
format_json_output_compact(match_results: &[MatchResult]) -> Result<String, serde_json::Error> produces single-line JSON:
{"matches":[{"text":"ELF 64-bit LSB executable","offset":0,"value":"7f454c46","tags":["executable","elf"],"score":90}]}
format_json_line_output(filename: &Path, match_results: &[MatchResult]) -> Result<String, serde_json::Error> produces JSON Lines format:
{"filename":"file1.bin","matches":[{"text":"ELF executable","offset":0,"value":"7f454c46","tags":["executable"],"score":90}]}
{"filename":"file2.bin","matches":[{"text":"PNG image data","offset":0,"value":"89504e47","tags":["image"],"score":85}]}
Each file produces exactly one line, making the output suitable for streaming and line-oriented processing.
Usage Examples#
Library Integration (Rust API)#
Converting evaluator results to output format:
use libmagic_rs::output::{EvaluationResult, MatchResult};
use libmagic_rs::MagicDatabase;
// Evaluate a file
let db = MagicDatabase::load_default()?;
let result = db.evaluate_file("example.bin")?;
// Convert to output format
let output = EvaluationResult::from_library_result(&result, Path::new("example.bin"));
// Format as text
let text = libmagic_rs::output::text::format_evaluation_result(&output);
println!("{}", text); // "example.bin: ELF 64-bit LSB executable"
// Format as JSON
let json = libmagic_rs::output::json::format_json_output(&output.matches)?;
println!("{}", json);
Creating custom output with confidence scoring:
use libmagic_rs::output::MatchResult;
use libmagic_rs::parser::ast::Value;
let mut match_result = MatchResult::new(
"PNG image data".to_string(),
0,
Value::Bytes(vec![0x89, 0x50, 0x4e, 0x47])
);
match_result.set_confidence(95);
match_result.add_rule_path("image".to_string());
match_result.add_rule_path("png".to_string());
CLI Usage (rmagic)#
Single file, text output (default):
rmagic photo.png
# Output: photo.png: PNG image data
Single file, JSON output:
rmagic --json photo.png
# Output (pretty-printed):
# {
# "matches": [
# {
# "text": "PNG image data",
# "offset": 0,
# "value": "89504e47",
# "tags": ["image"],
# "score": 85
# }
# ]
# }
Multiple files, JSON Lines output (automatic):
rmagic --json file1.bin file2.bin file3.bin
# Output (one line per file):
# {"filename":"file1.bin","matches":[...]}
# {"filename":"file2.bin","matches":[...]}
# {"filename":"file3.bin","matches":[...]}
Strict error handling mode:
rmagic --strict --json *.bin
# Exits with non-zero code if any file fails to evaluate
Conversion Pipeline#
Evaluator to Output Layer#
The conversion from evaluator types (crate::evaluator::MatchResult) to output types follows this structure:
evaluator::MatchResult ──from_evaluator_match()──> output::MatchResult
│ │
│ message: String │ message: String
│ offset: usize │ offset: usize
│ confidence: f64 (0.0-1.0) ─────> │ confidence: u8 (0-100)
│ value: Value │ value: Value
│ ─────> │ rule_path: Vec<String> [extracted]
│ │ length: usize [calculated]
│ ─────> │ mime_type: Option<String>
MatchResult::from_evaluator_match() performs three key transformations:
- Confidence scaling: Converts
f640.0-1.0 tou80-100 via(confidence * 100.0).min(100.0) as u8 - Tag extraction: Calls
extract_rule_path()on the match message to populaterule_path - Length calculation: Derives
lengthfield from theValuetype
Library to Output Layer#
EvaluationResult::from_library_result() converts library-level results:
pub fn from_library_result(
result: &crate::EvaluationResult,
filename: &std::path::Path,
) -> Self {
let mut output_matches: Vec<MatchResult> = result
.matches
.iter()
.map(|m| MatchResult::from_evaluator_match(m, result.mime_type.as_deref()))
.collect();
// Enrich the first match with tags from overall description
if let Some(first) = output_matches.first_mut() {
if first.rule_path.is_empty() {
first.rule_path = DEFAULT_TAG_EXTRACTOR.extract_tags(&result.description);
}
}
// Convert metadata...
}
This enrichment step ensures matches always have classification tags, even when individual messages do not yield rule paths.
Output to JSON Layer#
JsonMatchResult::from_match_result() converts to JSON format:
pub fn from_match_result(match_result: &MatchResult) -> Self {
Self {
text: match_result.message.clone(),
offset: match_result.offset,
value: format_value_as_hex(&match_result.value),
tags: match_result.rule_path.clone(),
score: match_result.confidence,
}
}
The key transformation is hex encoding of the value field.
CLI Integration#
Output Format Selection#
The rmagic CLI defines mutually exclusive output format flags:
/// Output results in JSON format
#[arg(long, conflicts_with = "text")]
pub json: bool,
/// Output results in text format (default)
#[arg(long)]
pub text: bool,
The output_format() method determines the output format, defaulting to text when neither flag is specified.
Single vs Multiple File Mode#
The CLI detects file count via args.files.len() > 1 and selects the appropriate JSON formatting function:
let json_result = if is_multiple_files {
format_json_line_output(file_path, &output_result.matches)
} else {
format_json_output(&output_result.matches)
};
This automatic mode selection enables efficient batch processing without requiring users to specify formatting modes.
Error Handling#
The CLI implements per-file error handling with optional strict mode:
let mut first_error: Option<LibmagicError> = None;
for file_or_stdin in &args.files {
match process_file(file_or_stdin, &db, args) {
Ok(()) => {}
Err(e) => {
eprintln!("Error processing {}: {}", file_or_stdin.filename(), e);
if first_error.is_none() {
first_error = Some(e);
}
}
}
}
if let Some(error) = first_error {
if args.strict {
return Err(error);
}
}
Without --strict, errors are printed to stderr but processing continues with exit code 0. With --strict, the first error causes a non-zero exit code.
Serialization Patterns#
All output types derive serde's Serialize and Deserialize traits, enabling:
- Direct JSON serialization via
serde_json - Format flexibility for any serde-compatible format (YAML, TOML, MessagePack, etc.)
- Round-trip conversion with data integrity guarantees
- API integration in downstream Rust applications
The output module provides complete serialization support without tying users to JSON exclusively. Library consumers can serialize MatchResult and EvaluationResult to any format their applications require.
Relevant Code Files#
| File | Purpose | Key Types/Functions |
|---|---|---|
src/output/mod.rs | Core output data structures and conversion layer | MatchResult, EvaluationResult, EvaluationMetadata, from_evaluator_match(), from_library_result() |
src/output/json.rs | JSON serialization and hex encoding | JsonMatchResult, JsonOutput, JsonLineOutput, format_value_as_hex(), format_json_output(), format_json_line_output() |
src/output/text.rs | GNU file-compatible text formatting | format_text_result(), format_text_output(), format_evaluation_result() |
src/tags.rs | Tag extraction and rule path normalization | TagExtractor, extract_tags(), extract_rule_path() |
src/main.rs | CLI integration and output format selection | output_format(), output_result(), error handling |
Related Topics#
- Magic Rule Evaluation – The evaluator layer produces
MatchResultinstances that feed the output system - Value Types – The
Valueenum (Bytes, String, Uint, Int) supports diverse matched data representations - CLI Usage – The
rmagicbinary integrates output formatting with user-facing command-line options - Serialization – Serde integration enables format-agnostic output beyond JSON and text