Probes Framework

Overview#

The insideLLMs probes framework provides a structured, extensible system for evaluating large language models (LLMs) and agent behaviors. Probes are modular evaluation units that test specific aspects of model behavior—such as logic, bias, factuality, and instruction following—by presenting inputs and analyzing outputs. The framework supports deterministic, reproducible experiments, batch execution, and comprehensive reporting and visualization of results.

Base Probe Classes#

All probes inherit from the abstract Probe base class, which defines the core interface for running evaluations, scoring results, and reporting metadata. The framework provides several base probe classes to support different evaluation paradigms:

Probe: The foundational abstract class. Subclasses must implement the run(model, data, **kwargs) method, which executes the probe on a model with given data and returns results. The base class also provides run_batch for batch execution, score for aggregate metrics, and info for probe metadata.
Source
ScoredProbe: Extends Probe for tasks with reference answers and correctness evaluation. Subclasses implement evaluate_single(model_output, reference, input_data), which compares model output to a reference and returns evaluation metrics (e.g., is_correct, score). The score method aggregates correctness and accuracy.
Source
ComparativeProbe: Extends Probe for comparative tasks, such as bias detection or A/B testing. Subclasses implement run_comparison(model, input_a, input_b, **kwargs) and compare_responses(response_a, response_b, input_a, input_b), returning comparison metrics.
Source

Specialized Probes#

Specialized probes inherit from the base classes and implement domain-specific evaluation logic.

BiasProbe#

Detects unfair or discriminatory model behavior across dimensions such as gender, race, age, and political bias. Inherits from ComparativeProbe. It is configured with a bias_dimension (e.g., "gender") and can perform sentiment analysis on responses. The probe compares model outputs for prompt pairs differing only in a protected characteristic, analyzing metrics like length difference, word overlap, and sentiment difference.
Source

LogicProbe#

Tests zero-shot logical reasoning, including deductive reasoning, mathematical logic, syllogisms, and puzzles. Inherits from ScoredProbe. It formats logic problems using a prompt template, sends them to the model, and evaluates correctness by extracting and comparing the final answer to a reference. Additional metrics include reasoning presence and response length.
Source

FactualityProbe#

Assesses factual accuracy by running factual questions against a model and collecting responses. Inherits from Probe. It formats prompts to encourage factual answers and extracts direct answers from model responses for evaluation.
Source

InstructionFollowingProbe#

Evaluates the model’s ability to follow explicit instructions, including format compliance, length constraints, content restrictions, and multi-step task completion. Inherits from ScoredProbe. It builds prompts with explicit constraints and evaluates compliance against those constraints, supporting both strict and averaged scoring modes.
Source

Other Specialized Probes#

MultiStepTaskProbe: Tests multi-step task completion, checking for step decomposition, context maintenance, and coherence.
ConstraintComplianceProbe: Tests compliance with constraints like word/character/sentence limits or custom validators.
AgentProbe: Evaluates tool-using LLM agents, integrating with a tracing system for deterministic execution and contract validation.
AgentProbe Source

AgentProbe#

AgentProbe is a specialized probe for evaluating tool-using LLM agents. It supports deterministic trace recording, contract validation, and detailed analysis of agent behaviors involving tool calls, tool results, and final answers. AgentProbe integrates with the insideLLMs tracing system, enabling CI/CD workflows that enforce behavioral determinism and trace contract compliance.

Key Features:

Runs a configurable agent loop that parses model outputs as JSON actions (tool calls or final answers)
Supports a dictionary of tool functions, which are invoked when the agent requests a tool
Records all agent actions, tool calls, and results via the TraceRecorder, producing a deterministic trace
Validates traces against configurable contracts (e.g., tool argument schemas, tool call order, required results)
Supports trace redaction and canonicalization for CI/CD and reproducibility
Returns rich metadata, including trace fingerprints and violation details, in each result

Example Usage:

from insideLLMs import AgentProbe, ProbeRunner

def search_tool(args):
    return {"results": ["example result"]}

tools = {"search": search_tool}
probe = AgentProbe(tools=tools, max_steps=4)
runner = ProbeRunner(model, probe)
results = runner.run([{"question": "Search for cats"}])
# Each result includes trace metadata and any contract violations

AgentProbe is ideal for evaluating LLM agents that use tools, APIs, or function calls, and for enforcing behavioral determinism in CI/CD pipelines. Trace configuration and contract validation can be customized via the trace_config argument or YAML config files.

Probe Configuration#

Probes can be configured in several ways:

Programmatically: Instantiate probe classes directly with parameters.

from insideLLMs import LogicProbe
probe = LogicProbe(prompt_template="Solve: {problem}")

Configuration Files: Use YAML or JSON files specifying probe type and arguments.

probe:
  type: bias
  args:
    bias_dimension: gender
    analyze_sentiment: true

Registries: Discover and construct probes by name using probe_registry.

from insideLLMs import probe_registry
probe = probe_registry.get("logic")

Source

Probe Execution#

Probes are executed using runner classes:

ProbeRunner: Synchronous execution. Orchestrates probe runs on datasets, handles configuration, error recovery, progress tracking, and result aggregation.
```
from insideLLMs import ProbeRunner
runner = ProbeRunner(model, probe)
results = runner.run(dataset)
```
AsyncProbeRunner: Asynchronous execution for parallel/concurrent runs, useful for API-based models.
```
runner = AsyncProbeRunner(model, probe)
results = await runner.run(dataset, concurrency=10)
```
Experiment/Harness Runs: Run full experiments or cross-model harnesses from configuration files for reproducibility.
```
insidellms run experiment.yaml
insidellms harness harness.yaml
```

Source

Results, Reporting, and Visualization#

Result Structure#

ProbeResult: Captures the result of a single probe run, including input, output, status, error, latency, and metadata.
ExperimentResult: Aggregates all results for a probe/model/dataset run, including scores, timestamps, configuration, and metadata.
ProbeScore: Contains metrics such as accuracy, error rate, mean latency, and custom probe-specific metrics.
Source

Reporting and Export#

Results and reports can be exported in multiple formats:

JSON: Machine-readable, for downstream analysis.
Markdown/HTML: Human-readable reports with tables, summaries, and metrics.
CSV: For spreadsheet analysis.

Statistical reports can be generated across experiments, including metrics like accuracy, error rate, latency, and custom probe metrics. Reports can be generated via CLI or programmatically.

from insideLLMs.results import save_results_markdown, generate_statistical_report
save_results_markdown(results, "report.md")
report = generate_statistical_report([experiment], format="html")

insidellms report ./my_run

Source

Visualization#

Markdown and HTML reports include experiment metadata, summary statistics, scores, timing, and results tables. Statistical reports provide overall and per-model/probe performance, confidence intervals, and rankings, with visual styling in HTML.

Practical Usage Examples#

Running a Logic Probe:

from insideLLMs import LogicProbe, ProbeRunner, model_registry

model = model_registry.get("openai", model_name="gpt-4o")
probe = LogicProbe()
runner = ProbeRunner(model, probe)
results = runner.run(["What comes next: 1, 4, 9, 16, ?"])

Running from Configuration (YAML):

insidellms run experiment.yaml

Listing Available Probes:

insidellms list probes

Generating a Report:

insidellms report ./my_run

Source

Extensibility#

To implement a custom probe, subclass Probe (or a base probe class) and implement the required methods:

from insideLLMs import Probe

class MyProbe(Probe[str]):
    def run(self, model, data, **kwargs) -> str:
        return model.generate(str(data))

Source

For more details, see the insideLLMs documentation and source code.