Output Formatting And Tag Extraction#

Output Formatting and Tag Extraction in libmagic-rs is a multi-layered system that transforms raw file evaluation results into structured, human-readable text or machine-processable JSON formats. The system comprises three core components: output data structures (MatchResult, EvaluationResult, EvaluationMetadata), JSON serialization types with hex-encoded values, and GNU file-compatible text formatting. All output types derive serde Serialize and Deserialize traits, enabling integration with any serde-compatible format.

The tag enrichment algorithm extracts 16 semantic keywords (executable, archive, image, video, audio, document, compressed, encrypted, text, binary, data, script, font, database, spreadsheet, presentation) from match descriptions through case-insensitive substring matching. Tags populate the rule_path field in MatchResult, providing hierarchical classification for machine processing. When individual match messages do not yield tags, the system enriches the first match using tags extracted from the overall evaluation description.

JSON output includes hex-encoded value representations using little-endian byte ordering for integers. The system provides three JSON formatting modes: pretty-printed JSON for single-file human inspection (format_json_output), compact JSON for single-file machine processing (format_json_output_compact), and JSON Lines format for batch processing multiple files (format_json_line_output). The CLI tool rmagic automatically selects between single-file and JSON Lines formats based on the number of input files.

Architecture#

Module Organization#

The output system is organized across three primary modules:

src/output/mod.rs – Core data structures and conversion layer with static LazyLock<TagExtractor> for tag enrichment
src/output/json.rs – JSON-specific types and formatting functions with hex encoding
src/output/text.rs – Text formatting functions producing filename: description output

Data Structure Hierarchy#

The output layer defines three core data structures that mirror but extend the evaluator layer:

MatchResult represents a single magic rule match:

pub struct MatchResult {
    pub message: String, // Human-readable description
    pub offset: usize, // Byte offset where match occurred
    pub length: usize, // Number of bytes examined
    pub value: Value, // Matched value (Bytes, String, Uint, Int)
    pub rule_path: Vec<String>, // Hierarchical tags from rule chain
    pub confidence: u8, // Score 0-100 (clamped)
    pub mime_type: Option<String>, // Optional MIME type
}

EvaluationResult wraps complete file evaluation:

pub struct EvaluationResult {
    pub filename: PathBuf,
    pub matches: Vec<MatchResult>,
    pub metadata: EvaluationMetadata,
    pub error: Option<String>,
}

EvaluationMetadata provides diagnostic information:

pub struct EvaluationMetadata {
    pub file_size: u64,
    pub evaluation_time_ms: f64,
    pub rules_evaluated: u32,
    pub rules_matched: u32,
}

Tag Enrichment System#

Keyword Set#

The tag enrichment system uses a static HashSet containing 16 classification keywords:

executable, archive, image, video, audio, document, compressed, encrypted, text, binary, data, script, font, database, spreadsheet, presentation

The TagExtractor struct wraps this keyword set and provides extraction methods. The output module instantiates a single TagExtractor via LazyLock to avoid repeated allocation.

Extraction Methods#

Two distinct methods handle tag extraction:

extract_tags(description: &str) -> Vec<String> performs case-insensitive substring matching:

pub fn extract_tags(&self, description: &str) -> Vec<String> {
    let lower = description.to_lowercase();
    let mut tags: Vec<String> = self
        .keywords
        .iter()
        .filter(|keyword| lower.contains(keyword.as_str()))
        .cloned()
        .collect();
    tags.sort();
    tags
}

For the description "PNG image data, 1920 x 1080", this returns ["data", "image"].

extract_rule_path(messages) -> Vec<String> normalizes messages into hyphenated identifiers:

pub fn extract_rule_path<'a, I>(&self, messages: I) -> Vec<String>
where
    I: IntoIterator<Item = &'a str>,
{
    messages
        .into_iter()
        .map(|msg| {
            msg.to_lowercase()
                .replace(' ', "-")
                .chars()
                .filter(|c| c.is_alphanumeric() || *c == '-')
                .collect()
        })
        .collect()
}

For "ELF 64-bit LSB executable", this returns ["elf-64-bit-lsb-executable"].

Enrichment Pipeline#

Tag extraction occurs at two critical conversion points:

Per-match extraction: MatchResult::from_evaluator_match() calls extract_rule_path on each match message
Overall enrichment: EvaluationResult::from_library_result() calls extract_tags on the overall description if the first match has an empty rule_path

This two-stage approach ensures all matches receive classification tags, preferring specific rule path identifiers but falling back to keyword extraction.

Text Output Formatting#

Text formatting produces output compatible with the GNU file command. The system formats results as filename: description, with multiple matches joined by ", ".

Formatting Functions#

format_text_result(result: &MatchResult) -> String returns the match message:

pub fn format_text_result(result: &MatchResult) -> String {
    result.message.clone()
}

format_text_output(results: &[MatchResult]) -> String joins multiple matches:

pub fn format_text_output(results: &[MatchResult]) -> String {
    if results.is_empty() {
        return "data".to_string(); // Default fallback for unknown files
    }
    results
        .iter()
        .map(|result| result.message.as_str())
        .collect::<Vec<&str>>()
        .join(", ")
}

format_evaluation_result(evaluation: &EvaluationResult) -> String produces complete output:

pub fn format_evaluation_result(evaluation: &EvaluationResult) -> String {
    let filename = evaluation
        .filename
        .file_name()
        .and_then(|name| name.to_str())
        .unwrap_or("unknown");

    let description = if evaluation.matches.is_empty() {
        if let Some(ref error) = evaluation.error {
            format!("ERROR: {error}")
        } else {
            "data".to_string()
        }
    } else {
        format_text_output(&evaluation.matches)
    };

    format!("{filename}: {description}")
}

Text Output Examples#

Single file, single match:

photo.png: PNG image data

Single file, multiple matches:

ls: ELF 64-bit LSB executable, x86-64, dynamically linked

No matches (unknown file type):

unknown.bin: data

Error case:

missing.txt: ERROR: File not found

JSON Output Formatting#

JSON Data Types#

JsonMatchResult represents a single match in JSON:

pub struct JsonMatchResult {
    pub text: String, // Match description
    pub offset: usize, // Byte offset
    pub value: String, // Hex-encoded matched bytes
    pub tags: Vec<String>, // Classification tags
    pub score: u8, // Confidence 0-100
}

JsonOutput wraps matches for single-file output:

pub struct JsonOutput {
    pub matches: Vec<JsonMatchResult>,
}

JsonLineOutput adds filename for batch processing:

pub struct JsonLineOutput {
    pub filename: String,
    pub matches: Vec<JsonMatchResult>,
}

Hex Encoding Strategy#

The format_value_as_hex(value: &Value) -> String function converts Value types to lowercase hex strings:

Value Type	Encoding	Example
`Bytes(vec)`	Direct hex encoding of each byte	`[0x7f, 0x45, 0x4c, 0x46]` → `"7f454c46"`
`String(s)`	Hex encoding of UTF-8 bytes	`"PNG"` → `"504e47"`
`Uint(n)`	Little-endian u64 bytes (16 hex chars)	`0x1234` → `"3412000000000000"`
`Int(n)`	Little-endian i64 bytes (16 hex chars)	`-1` → `"ffffffffffffffff"`

Little-endian encoding ensures consistent cross-platform byte ordering.

JSON Formatting Functions#

format_json_output(match_results: &[MatchResult]) -> Result<String, serde_json::Error> produces pretty-printed JSON:

{
  "matches": [
    {
      "text": "ELF 64-bit LSB executable",
      "offset": 0,
      "value": "7f454c46",
      "tags": [
        "executable",
        "elf"
      ],
      "score": 90
    }
  ]
}

format_json_output_compact(match_results: &[MatchResult]) -> Result<String, serde_json::Error> produces single-line JSON:

{"matches":[{"text":"ELF 64-bit LSB executable","offset":0,"value":"7f454c46","tags":["executable","elf"],"score":90}]}

format_json_line_output(filename: &Path, match_results: &[MatchResult]) -> Result<String, serde_json::Error> produces JSON Lines format:

{"filename":"file1.bin","matches":[{"text":"ELF executable","offset":0,"value":"7f454c46","tags":["executable"],"score":90}]}
{"filename":"file2.bin","matches":[{"text":"PNG image data","offset":0,"value":"89504e47","tags":["image"],"score":85}]}

Each file produces exactly one line, making the output suitable for streaming and line-oriented processing.

Usage Examples#

Library Integration (Rust API)#

Converting evaluator results to output format:

use libmagic_rs::output::{EvaluationResult, MatchResult};
use libmagic_rs::MagicDatabase;

// Evaluate a file
let db = MagicDatabase::load_default()?;
let result = db.evaluate_file("example.bin")?;

// Convert to output format
let output = EvaluationResult::from_library_result(&result, Path::new("example.bin"));

// Format as text
let text = libmagic_rs::output::text::format_evaluation_result(&output);
println!("{}", text); // "example.bin: ELF 64-bit LSB executable"

// Format as JSON
let json = libmagic_rs::output::json::format_json_output(&output.matches)?;
println!("{}", json);

Creating custom output with confidence scoring:

use libmagic_rs::output::MatchResult;
use libmagic_rs::parser::ast::Value;

let mut match_result = MatchResult::new(
    "PNG image data".to_string(),
    0,
    Value::Bytes(vec![0x89, 0x50, 0x4e, 0x47])
);
match_result.set_confidence(95);
match_result.add_rule_path("image".to_string());
match_result.add_rule_path("png".to_string());

CLI Usage (rmagic)#

Single file, text output (default):

rmagic photo.png
# Output: photo.png: PNG image data

Single file, JSON output:

rmagic --json photo.png
# Output (pretty-printed):
# {
# "matches": [
# {
# "text": "PNG image data",
# "offset": 0,
# "value": "89504e47",
# "tags": ["image"],
# "score": 85
# }
# ]
# }

Multiple files, JSON Lines output (automatic):

rmagic --json file1.bin file2.bin file3.bin
# Output (one line per file):
# {"filename":"file1.bin","matches":[...]}
# {"filename":"file2.bin","matches":[...]}
# {"filename":"file3.bin","matches":[...]}

Strict error handling mode:

rmagic --strict --json *.bin
# Exits with non-zero code if any file fails to evaluate

Conversion Pipeline#

Evaluator to Output Layer#

The conversion from evaluator types (crate::evaluator::MatchResult) to output types follows this structure:

evaluator::MatchResult ──from_evaluator_match()──> output::MatchResult
      │ │
      │ message: String │ message: String
      │ offset: usize │ offset: usize
      │ confidence: f64 (0.0-1.0) ─────> │ confidence: u8 (0-100)
      │ value: Value │ value: Value
      │ ─────> │ rule_path: Vec<String> [extracted]
      │ │ length: usize [calculated]
      │ ─────> │ mime_type: Option<String>

MatchResult::from_evaluator_match() performs three key transformations:

Confidence scaling: Converts f64 0.0-1.0 to u8 0-100 via (confidence * 100.0).min(100.0) as u8
Tag extraction: Calls extract_rule_path() on the match message to populate rule_path
Length calculation: Derives length field from the Value type

Library to Output Layer#

EvaluationResult::from_library_result() converts library-level results:

pub fn from_library_result(
    result: &crate::EvaluationResult,
    filename: &std::path::Path,
) -> Self {
    let mut output_matches: Vec<MatchResult> = result
        .matches
        .iter()
        .map(|m| MatchResult::from_evaluator_match(m, result.mime_type.as_deref()))
        .collect();

    // Enrich the first match with tags from overall description
    if let Some(first) = output_matches.first_mut() {
        if first.rule_path.is_empty() {
            first.rule_path = DEFAULT_TAG_EXTRACTOR.extract_tags(&result.description);
        }
    }

    // Convert metadata...
}

This enrichment step ensures matches always have classification tags, even when individual messages do not yield rule paths.

Output to JSON Layer#

JsonMatchResult::from_match_result() converts to JSON format:

pub fn from_match_result(match_result: &MatchResult) -> Self {
    Self {
        text: match_result.message.clone(),
        offset: match_result.offset,
        value: format_value_as_hex(&match_result.value),
        tags: match_result.rule_path.clone(),
        score: match_result.confidence,
    }
}

The key transformation is hex encoding of the value field.

CLI Integration#

Output Format Selection#

The rmagic CLI defines mutually exclusive output format flags:

/// Output results in JSON format
#[arg(long, conflicts_with = "text")]
pub json: bool,

/// Output results in text format (default)
#[arg(long)]
pub text: bool,

The output_format() method determines the output format, defaulting to text when neither flag is specified.

Single vs Multiple File Mode#

The CLI detects file count via args.files.len() > 1 and selects the appropriate JSON formatting function:

let json_result = if is_multiple_files {
    format_json_line_output(file_path, &output_result.matches)
} else {
    format_json_output(&output_result.matches)
};

This automatic mode selection enables efficient batch processing without requiring users to specify formatting modes.

Error Handling#

The CLI implements per-file error handling with optional strict mode:

let mut first_error: Option<LibmagicError> = None;

for file_or_stdin in &args.files {
    match process_file(file_or_stdin, &db, args) {
        Ok(()) => {}
        Err(e) => {
            eprintln!("Error processing {}: {}", file_or_stdin.filename(), e);
            if first_error.is_none() {
                first_error = Some(e);
            }
        }
    }
}

if let Some(error) = first_error {
    if args.strict {
        return Err(error);
    }
}

Without --strict, errors are printed to stderr but processing continues with exit code 0. With --strict, the first error causes a non-zero exit code.

Serialization Patterns#

All output types derive serde's Serialize and Deserialize traits, enabling:

Direct JSON serialization via serde_json
Format flexibility for any serde-compatible format (YAML, TOML, MessagePack, etc.)
Round-trip conversion with data integrity guarantees
API integration in downstream Rust applications

The output module provides complete serialization support without tying users to JSON exclusively. Library consumers can serialize MatchResult and EvaluationResult to any format their applications require.

Relevant Code Files#

File	Purpose	Key Types/Functions
`src/output/mod.rs`	Core output data structures and conversion layer	`MatchResult`, `EvaluationResult`, `EvaluationMetadata`, `from_evaluator_match()`, `from_library_result()`
`src/output/json.rs`	JSON serialization and hex encoding	`JsonMatchResult`, `JsonOutput`, `JsonLineOutput`, `format_value_as_hex()`, `format_json_output()`, `format_json_line_output()`
`src/output/text.rs`	GNU file-compatible text formatting	`format_text_result()`, `format_text_output()`, `format_evaluation_result()`
`src/tags.rs`	Tag extraction and rule path normalization	`TagExtractor`, `extract_tags()`, `extract_rule_path()`
`src/main.rs`	CLI integration and output format selection	`output_format()`, `output_result()`, error handling

Magic Rule Evaluation – The evaluator layer produces MatchResult instances that feed the output system
Value Types – The Value enum (Bytes, String, Uint, Int) supports diverse matched data representations
CLI Usage – The rmagic binary integrates output formatting with user-facing command-line options
Serialization – Serde integration enables format-agnostic output beyond JSON and text