Overview#
The insideLLMs probes framework provides a structured, extensible system for evaluating large language models (LLMs) and agent behaviors. Probes are modular evaluation units that test specific aspects of model behavior—such as logic, bias, factuality, and instruction following—by presenting inputs and analyzing outputs. The framework supports deterministic, reproducible experiments, batch execution, and comprehensive reporting and visualization of results.
Base Probe Classes#
All probes inherit from the abstract Probe base class, which defines the core interface for running evaluations, scoring results, and reporting metadata. The framework provides several base probe classes to support different evaluation paradigms:
-
Probe: The foundational abstract class. Subclasses must implement the
run(model, data, **kwargs)method, which executes the probe on a model with given data and returns results. The base class also providesrun_batchfor batch execution,scorefor aggregate metrics, andinfofor probe metadata.
Source -
ScoredProbe: Extends
Probefor tasks with reference answers and correctness evaluation. Subclasses implementevaluate_single(model_output, reference, input_data), which compares model output to a reference and returns evaluation metrics (e.g., is_correct, score). Thescoremethod aggregates correctness and accuracy.
Source -
ComparativeProbe: Extends
Probefor comparative tasks, such as bias detection or A/B testing. Subclasses implementrun_comparison(model, input_a, input_b, **kwargs)andcompare_responses(response_a, response_b, input_a, input_b), returning comparison metrics.
Source
Specialized Probes#
Specialized probes inherit from the base classes and implement domain-specific evaluation logic.
BiasProbe#
Detects unfair or discriminatory model behavior across dimensions such as gender, race, age, and political bias. Inherits from ComparativeProbe. It is configured with a bias_dimension (e.g., "gender") and can perform sentiment analysis on responses. The probe compares model outputs for prompt pairs differing only in a protected characteristic, analyzing metrics like length difference, word overlap, and sentiment difference.
Source
LogicProbe#
Tests zero-shot logical reasoning, including deductive reasoning, mathematical logic, syllogisms, and puzzles. Inherits from ScoredProbe. It formats logic problems using a prompt template, sends them to the model, and evaluates correctness by extracting and comparing the final answer to a reference. Additional metrics include reasoning presence and response length.
Source
FactualityProbe#
Assesses factual accuracy by running factual questions against a model and collecting responses. Inherits from Probe. It formats prompts to encourage factual answers and extracts direct answers from model responses for evaluation.
Source
InstructionFollowingProbe#
Evaluates the model’s ability to follow explicit instructions, including format compliance, length constraints, content restrictions, and multi-step task completion. Inherits from ScoredProbe. It builds prompts with explicit constraints and evaluates compliance against those constraints, supporting both strict and averaged scoring modes.
Source
Other Specialized Probes#
- MultiStepTaskProbe: Tests multi-step task completion, checking for step decomposition, context maintenance, and coherence.
- ConstraintComplianceProbe: Tests compliance with constraints like word/character/sentence limits or custom validators.
- AgentProbe: Evaluates tool-using LLM agents, integrating with a tracing system for deterministic execution and contract validation.
AgentProbe Source
AgentProbe#
AgentProbe is a specialized probe for evaluating tool-using LLM agents. It supports deterministic trace recording, contract validation, and detailed analysis of agent behaviors involving tool calls, tool results, and final answers. AgentProbe integrates with the insideLLMs tracing system, enabling CI/CD workflows that enforce behavioral determinism and trace contract compliance.
Key Features:
- Runs a configurable agent loop that parses model outputs as JSON actions (tool calls or final answers)
- Supports a dictionary of tool functions, which are invoked when the agent requests a tool
- Records all agent actions, tool calls, and results via the TraceRecorder, producing a deterministic trace
- Validates traces against configurable contracts (e.g., tool argument schemas, tool call order, required results)
- Supports trace redaction and canonicalization for CI/CD and reproducibility
- Returns rich metadata, including trace fingerprints and violation details, in each result
Example Usage:
from insideLLMs import AgentProbe, ProbeRunner
def search_tool(args):
return {"results": ["example result"]}
tools = {"search": search_tool}
probe = AgentProbe(tools=tools, max_steps=4)
runner = ProbeRunner(model, probe)
results = runner.run([{"question": "Search for cats"}])
# Each result includes trace metadata and any contract violations
AgentProbe is ideal for evaluating LLM agents that use tools, APIs, or function calls, and for enforcing behavioral determinism in CI/CD pipelines. Trace configuration and contract validation can be customized via the trace_config argument or YAML config files.
Probe Configuration#
Probes can be configured in several ways:
- Programmatically: Instantiate probe classes directly with parameters.
from insideLLMs import LogicProbe probe = LogicProbe(prompt_template="Solve: {problem}") - Configuration Files: Use YAML or JSON files specifying probe type and arguments.
probe: type: bias args: bias_dimension: gender analyze_sentiment: true - Registries: Discover and construct probes by name using
probe_registry.from insideLLMs import probe_registry probe = probe_registry.get("logic")
Probe Execution#
Probes are executed using runner classes:
- ProbeRunner: Synchronous execution. Orchestrates probe runs on datasets, handles configuration, error recovery, progress tracking, and result aggregation.
from insideLLMs import ProbeRunner runner = ProbeRunner(model, probe) results = runner.run(dataset) - AsyncProbeRunner: Asynchronous execution for parallel/concurrent runs, useful for API-based models.
runner = AsyncProbeRunner(model, probe) results = await runner.run(dataset, concurrency=10) - Experiment/Harness Runs: Run full experiments or cross-model harnesses from configuration files for reproducibility.
insidellms run experiment.yaml insidellms harness harness.yaml
Results, Reporting, and Visualization#
Result Structure#
- ProbeResult: Captures the result of a single probe run, including input, output, status, error, latency, and metadata.
- ExperimentResult: Aggregates all results for a probe/model/dataset run, including scores, timestamps, configuration, and metadata.
- ProbeScore: Contains metrics such as accuracy, error rate, mean latency, and custom probe-specific metrics.
Source
Reporting and Export#
Results and reports can be exported in multiple formats:
- JSON: Machine-readable, for downstream analysis.
- Markdown/HTML: Human-readable reports with tables, summaries, and metrics.
- CSV: For spreadsheet analysis.
Statistical reports can be generated across experiments, including metrics like accuracy, error rate, latency, and custom probe metrics. Reports can be generated via CLI or programmatically.
from insideLLMs.results import save_results_markdown, generate_statistical_report
save_results_markdown(results, "report.md")
report = generate_statistical_report([experiment], format="html")
insidellms report ./my_run
Visualization#
Markdown and HTML reports include experiment metadata, summary statistics, scores, timing, and results tables. Statistical reports provide overall and per-model/probe performance, confidence intervals, and rankings, with visual styling in HTML.
Practical Usage Examples#
Running a Logic Probe:
from insideLLMs import LogicProbe, ProbeRunner, model_registry
model = model_registry.get("openai", model_name="gpt-4o")
probe = LogicProbe()
runner = ProbeRunner(model, probe)
results = runner.run(["What comes next: 1, 4, 9, 16, ?"])
Running from Configuration (YAML):
insidellms run experiment.yaml
Listing Available Probes:
insidellms list probes
Generating a Report:
insidellms report ./my_run
Extensibility#
To implement a custom probe, subclass Probe (or a base probe class) and implement the required methods:
from insideLLMs import Probe
class MyProbe(Probe[str]):
def run(self, model, data, **kwargs) -> str:
return model.generate(str(data))
For more details, see the insideLLMs documentation and source code.