Deterministic Execution Tracing

Deterministic Execution Tracing System#

The deterministic execution tracing system in insideLLMs provides reproducible, contract-validated traces of model execution. It enables reliable comparison of model behaviours, detection of regressions, and enforcement of structural contracts across runs. The system is composed of three main layers: event recording, configuration/contract compilation, and pure contract validation. It integrates with the model pipeline via trace-aware middleware and surfaces trace drift and contract violations in CLI reports.

Design Overview#

Ordered Event Recording#

The core of the tracing system is the TraceRecorder, which records a sequence of trace events during model execution. Each event is assigned a deterministic sequence number, independent of wall-clock time, ensuring that traces are stable and comparable across runs. Event kinds include generate_start, generate_end, chat_start, chat_end, stream_start, stream_chunk, stream_end, tool_call_start, tool_result, error, and others. Each event captures a payload with relevant data for that step in the execution flow.

Tool-Using Agent Tracing:

The tracing system is tightly integrated with tool-using agent probes via the AgentProbe class. When running an agent that calls tools (e.g., search, summarize), AgentProbe records each tool call and result as trace events, enabling fine-grained analysis and contract validation of agent reasoning steps. This makes it possible to enforce and audit the sequence and correctness of tool invocations in complex agent workflows.

See tracing.py.

TraceConfig Dataclasses#

Trace behaviour is configured via the TraceConfig dataclass, which is loaded from YAML and compiled into internal validator formats. The main configuration classes are:

TraceConfig: Top-level configuration, including version, enabled flag, store, fingerprint, contracts, and violation handling.
TraceStoreConfig: Controls trace data persistence (mode: none, compact, full), event limits, payload inclusion, and legacy redaction.
FingerprintConfig: Enables/disables fingerprinting, selects hash algorithm, and configures payload normalisation.
NormaliserConfig: Specifies how payloads are normalised before fingerprinting (drop keys, regex drops, hash paths, string length hashing).
TraceContractsConfig: Enables/disables contract validators for generate, stream, tool payloads, tool order, and tool results, with detailed sub-configurations.
OnViolationConfig: Configures how violations are handled (e.g., record, fail-fast).

These configuration classes, along with validation helpers and the AgentProbe class for tool-using agents, are now part of the public API and can be imported directly from the package. Trace configuration is used to control contract validation for both standard probes and agent probes.

See trace_config.py.

YAML Configuration Format#

A typical YAML configuration for tracing might look like:

trace:
  version: 1
  enabled: true
  store:
    mode: full
    max_events: 1000
    include_payloads: true
    redact:
      enabled: false
  fingerprint:
    enabled: true
    algorithm: sha256
    normaliser:
      kind: builtin
      name: structural_v1
      config:
        drop_keys: ["request_id", "timestamp", "latency_ms"]
        hash_paths: ["result", "raw"]
        hash_strings_over: 512
  contracts:
    enabled: true
    fail_fast: false
    generate_boundaries:
      enabled: true
    stream_boundaries:
      enabled: true
    tool_results:
      enabled: true
    tool_payloads:
      enabled: true
      tools:
        my_tool:
          args_schema:
            required: ["input"]
            properties:
              input:
                type: string
    tool_order:
      enabled: true
      must_precede:
        my_tool: ["other_tool"]
      forbidden_sequences:
        - ["my_tool", "my_tool"]
  on_violation:
    mode: record

This configures full trace recording, SHA-256 fingerprinting with structural payload normalisation, and enables all contract validators.

Trace Contract Compilation and Validation#

Trace contracts are compiled from the YAML configuration using the to_contracts() method of TraceConfig. This produces validator inputs such as tool schemas, tool order rules, toggles for each validator, and the fail_fast flag. Validation is performed by validate_with_config(events, config), which runs only the enabled validators and returns a list of violations sorted by event sequence. Each validator checks a specific contract:

validate_generate_boundaries: Ensures generate_start and generate_end events are properly paired and not nested.
validate_stream_boundaries: Checks stream_start precedes stream_chunk and stream_end, chunk indices are sequential, and no chunks occur after stream_end.
validate_tool_payloads: Validates tool call payloads against schemas, ensuring required arguments are present and types match.
validate_tool_order: Enforces ordering constraints on tool calls (must_precede, must_follow, forbidden sequences).
validate_tool_results: Ensures each tool_call_start has a corresponding tool_result and that results do not appear before calls.

Violations are represented by the Violation dataclass, which includes a code, event sequence, detail, event kind, and context. Violation codes are standardised (e.g., STREAM_NO_START, TOOL_INVALID_ARGUMENTS, GENERATE_NO_END).

AgentProbe Integration:

The AgentProbe class for tool-using agents automatically records tool call and result events and validates them using the trace contract system. This enables contract enforcement and drift detection for agent workflows involving multiple tool invocations.

The validate_with_config helper is available in the public API for custom validation workflows.

See trace_contracts.py.

Payload Redaction and Normalisation#

Payload redaction and normalisation are critical for deterministic fingerprinting and privacy. The TracePayloadNormaliser applies event-kind-aware transformations to payloads before fingerprinting, including dropping specified keys, hashing large blobs or specific paths, summarising stream chunks, and applying legacy redaction if configured. This ensures that volatile or sensitive fields (like timestamps or request IDs) do not affect trace fingerprints or leak in stored traces.

Trace Fingerprinting and Drift Detection#

The trace_fingerprint function computes a stable SHA-256 hash of the canonical JSON serialization of trace events, after normalisation. This fingerprint enables detection of trace drift: if two runs produce different fingerprints for the same input, their execution traces differ, even if output text is identical. This is essential for regression detection and reproducibility.

Using Trace-Aware Middleware#

To capture execution traces, add the TraceMiddleware to your model pipeline. This middleware wraps generate, chat, and stream operations, recording all relevant events in a deterministic sequence.

Example: Synchronous Usage

from insideLLMs.pipeline import ModelPipeline, TraceMiddleware
from insideLLMs.models import OpenAIModel

trace_mw = TraceMiddleware(run_id="run_001")
pipeline = ModelPipeline(OpenAIModel("gpt-4"), middlewares=[trace_mw])

response = pipeline.generate("Hello, world!")
events = trace_mw.recorder.events
print(f"Captured {len(events)} trace events")

Example: Asynchronous Usage

trace_mw = TraceMiddleware(run_id="run_002")
pipeline = ModelPipeline(OpenAIModel("gpt-4"), middlewares=[trace_mw])

response = await pipeline.agenerate("Hello, async world!")
events = trace_mw.recorder.events

The TraceMiddleware exposes a .recorder property for accessing the recorded events. You can reset the recorder for a new execution with trace_mw.reset(run_id="new_run").

Using Trace-Aware Middleware#

AgentProbe: Tool-Using Agent Tracing#

The AgentProbe class provides a framework-agnostic way to test tool-using LLM agents with full trace integration. It runs a tool-calling loop, records tool calls and results as trace events, and validates the trace using the configured contracts. This enables deterministic, contract-validated testing of agent workflows that involve multiple tool invocations.

Example Usage:

from insideLLMs import AgentProbe, ToolDefinition, TraceConfig

# Define available tools
search_tool = ToolDefinition(
    name="search",
    description="Search for information",
    parameters={"type": "object", "required": ["query"], "properties": {"query": {"type": "string"}}},
    handler=lambda args: {"results": ["result1", "result2"]},
)

# Create an agent probe with trace config
probe = AgentProbe(
    tools={"search": search_tool.handler},
    trace_config=TraceConfig(),
)

# Run the probe on a model and input
result = probe.run(model, {"question": "Find Python tutorials"})

# Access trace events and violations
trace_events = result.metadata["custom"]["trace"]["events"]
violations = result.metadata["custom"]["trace"]["violations"]

The AgentProbe automatically records all tool calls and results, validates the trace according to the configured contracts, and surfaces any violations or trace drift in the result metadata. This makes it easy to enforce and audit agent reasoning steps in CI/CD workflows.

CLI: Trace Drift and Violation Reports#

The CLI provides several commands for working with deterministic traces:

run / harness: Executes experiments and writes run artefacts (including manifest.json, records.jsonl, and config.resolved.yaml) to a run directory.
validate: Validates a run directory, checking that traces and outputs conform to schemas and contracts.
diff: Compares two run directories, reporting regressions, improvements, changes, trace drifts, and trace violation increases.

New CLI Flags for CI Enforcement#

The diff command now supports additional flags for stricter CI/CD enforcement:

--fail-on-trace-drift: Exit with a nonzero status if any trace fingerprints differ between baseline and candidate runs (even if output text is unchanged). This enforces strict behavioural determinism.
--fail-on-trace-violations: Exit with a nonzero status if the number of trace contract violations increases for any record in the candidate run compared to the baseline. This helps catch regressions in structural correctness.

These flags can be combined with existing options to enforce trace-based quality gates in automated workflows.

Interpreting Trace Drift and Violation Reports#

When running insidellms diff <run_dir_a> <run_dir_b>, the CLI compares trace fingerprints and violation counts for each record (identified by model, probe, and example). Output sections include:

Trace Drifts: Indicates records where the trace fingerprint changed between runs, e.g.:
```
Trace Drifts
  gpt-4 | logic | example 42: trace sha256:abc123... -> sha256:def456...
```
This means the execution trace structure changed, even if the output text did not.
Trace Violation Increases: Indicates records where the number of contract violations increased, e.g.:
```
Trace Violation Increases
  gpt-4 | logic | example 42: violations 0 -> 2
```
This means the candidate run produced more contract violations than the baseline.
Other Sections: Regressions, improvements, and changes are also reported, along with missing or new records.

You can limit the number of items shown with --limit, and use --fail-on-trace-drift or --fail-on-trace-violations to make the CLI exit nonzero if such issues are detected.

Purpose and Rationale#

Ordered Event Recording ensures that traces are stable, reproducible, and comparable across runs, independent of timing or concurrency.
Payload Redaction/Normalisation removes volatile or sensitive data, enabling deterministic fingerprinting and privacy.
Contract Validation enforces structural invariants on traces, catching regressions and unexpected behaviours early.
Trace Drift Detection provides a robust signal for behavioural changes, even when outputs are superficially similar.

For further details, see the relevant modules: tracing.py, trace_config.py, trace_contracts.py, and pipeline.py.