CLI Diff-Gating and Fail-On-Changes Feature

CLI Diff Command Enhancements#

The insidellms diff command compares two run directories (each containing a records.jsonl file) and reports behavioral regressions, improvements, and other changes between them. Recent enhancements have introduced stricter validation options, most notably the --fail-on-changes flag, which is designed for robust CI gating and validation workflows.

`--fail-on-changes` Flag#

The --fail-on-changes flag causes the diff command to exit with a non-zero status if any differences are detected between the baseline and candidate runs. This includes regressions, other changes (such as metric mismatches or output differences), and missing or extra records. This strict mode is particularly useful for continuous integration (CI) pipelines, where any behavioral drift must be surfaced and gated before merging changes.

Other related flags include:

--fail-on-regressions: Fails only if regressions are detected.
--fail-on-trace-violations: Fails if trace violations increase.
--fail-on-trace-drift: Fails if trace fingerprints differ, even if output text is unchanged.

The diff command supports both human-readable text output and machine-readable JSON output (with --format json and --output <file> options) for integration with CI systems and dashboards. The command summarizes counts and details for regressions, improvements, other changes, trace drifts, and trace violation increases, with configurable limits on displayed items.

Example: Strict Diff Validation#

insidellms diff ./runs/baseline ./runs/candidate --fail-on-changes

If any differences are found, the command exits with code 2, causing the CI job to fail and preventing the merge of a pull request that introduces behavioral drift [source].

Example: Strict Diff Validation#

Example: Trace Drift and Violation Gating#

The diff command now supports additional flags for even stricter CI/CD gating:

--fail-on-trace-drift: Fails if any trace fingerprints differ between baseline and candidate runs, even if the output text is unchanged. This detects any behavioral drift at the trace level.
--fail-on-trace-violations: Fails if the number of trace violations increases in the candidate run compared to the baseline. This ensures that new contract violations are surfaced and gated.

Example: Enforcing Trace Drift and Violation Checks#

insidellms diff ./runs/baseline ./runs/candidate --fail-on-trace-drift --fail-on-trace-violations

If any trace fingerprints change or if trace violations increase, the command exits with a non-zero status (exit code 3 or 4), causing the CI job to fail. This provides an additional layer of determinism and contract enforcement beyond output text comparison.

These flags can be combined with --fail-on-changes or used independently, depending on the desired level of strictness in your CI/CD pipeline.

CI Diff-Gating Workflow#

The CI diff-gating workflow leverages the deterministic nature of the insideLLMs harness and diff tooling to ensure reproducible, reliable validation of pull requests. The workflow is designed to catch any unintended changes in model behavior or outputs before code is merged.

Workflow Steps#

Baseline Generation: Run and save a baseline using the harness. The baseline's records.jsonl is either committed to the repository or stored as a CI artifact.
Candidate Run: On each pull request, run the harness again on the candidate code (the PR branch), producing a new records.jsonl.
Validation: Optionally, validate both baseline and candidate run directories using insidellms validate to ensure schema compliance.
Diff and Gate: Use the diff command with --fail-on-changes to compare the baseline and candidate runs. If any differences are detected, the CI job fails.

Example: CI Diff-Gating Commands#

# Run harness on baseline (typically on main branch or a known-good SHA)
insidellms harness ci/harness.yaml --run-dir .tmp/runs/base --skip-report

# Run harness on candidate (pull request branch)
insidellms harness ci/harness.yaml --run-dir .tmp/runs/head --skip-report

# Optionally validate outputs
insidellms validate .tmp/runs/base
insidellms validate .tmp/runs/head

# Diff and gate on any changes
insidellms diff .tmp/runs/base .tmp/runs/head --fail-on-changes

If the diff command detects any differences, it exits with a non-zero status, causing the CI pipeline to fail and blocking the PR [source].

GitHub Actions Integration#

For teams using GitHub Actions, insideLLMs provides a pre-built composite action that automates the entire diff-gating workflow. This eliminates the need to write manual bash scripts in your CI workflows and provides a turnkey solution.

The action is located at .github/actions/diff-gate/ and handles:

Automated baseline and candidate harness execution: Runs the harness on both the baseline and candidate commits automatically.
Diff computation with configurable fail-on modes: Supports regressions (default), changes, and none modes to control when the CI job should fail.
Optional PR comment posting: Automatically posts a formatted diff summary as a PR comment, including counts of regressions, improvements, and other changes.
Configurable baseline ref: By default, compares against the PR base, but can be overridden to compare against a specific branch or commit (e.g., origin/main).

The action includes the diff-gate.sh script that performs the baseline run, candidate run, and diff analysis.

Example: Using the Diff-Gate Action#

name: Diff Gate
on:
  pull_request:

permissions:
  contents: read
  pull-requests: write

jobs:
  diff-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0 # full history needed for baseline checkout

      - uses: ./.github/actions/diff-gate
        with:
          config: ci/harness.yaml
          fail-on: regressions # or 'changes' for stricter gating

Action Inputs#

config: Path to the harness YAML config file (default: ci/harness.yaml)
baseline-ref: Git ref for the baseline run (default: PR base)
fail-on: Failure mode—regressions (default), changes, or none
python-version: Python version to use (default: 3.12)
extra-pip-args: Extra arguments passed to pip install
harness-args: Extra arguments passed to insidellms harness (default: --skip-report --deterministic-artifacts)
comment: Whether to post a PR comment with the diff summary (default: true)

Action Outputs#

exit-code: Exit code from the diff command (0 = pass, non-zero = fail)
diff-json: Path to the JSON diff report file

This GitHub Actions integration is a live, implemented feature in the repository that complements the manual CLI approach documented above, providing a more streamlined solution for teams using GitHub's CI/CD platform.

Deterministic Harness-Based Validation#

The harness and diff workflow is deterministic by design, including run IDs and timestamps. This ensures that repeated runs with the same configuration and data produce identical outputs, making the diff-gating process reliable and reproducible [source].

Artifacts and Reporting#

Typical artifacts in this workflow include:

records.jsonl: Canonical run log (one JSON object per example/model/probe).
summary.json: Aggregated metrics and confidence intervals.
report.html: Human-readable comparison report.
diff.json: Machine-readable diff report (produced with --format json --output diff.json).

Integration with Testing and Validation Strategy#

The diff-gating workflow is part of a broader, deterministic testing and validation strategy in insideLLMs. The typical lifecycle is:

Run/Harness: Execute experiments or harnesses to generate run artifacts.
Validate: Check run directories for schema compliance.
Report: Generate human- or machine-readable reports.
Diff: Compare baseline and candidate runs for behavioral drift.
Decide: Gate merges or deployments based on diff results.

Test coverage includes dedicated tests for diff gating and the --fail-on-changes flag, ensuring correctness and reliability of these features [source].

For more details, see the README CI Diff-Gating section and Results and Reports documentation.

CLI Diff Command Enhancements#

--fail-on-changes Flag#

Example: Strict Diff Validation#

Example: Strict Diff Validation#

Example: Trace Drift and Violation Gating#

Example: Enforcing Trace Drift and Violation Checks#

CI Diff-Gating Workflow#

Workflow Steps#

Example: CI Diff-Gating Commands#

GitHub Actions Integration#

Example: Using the Diff-Gate Action#

Action Inputs#

Action Outputs#

Deterministic Harness-Based Validation#

Artifacts and Reporting#

Integration with Testing and Validation Strategy#

`--fail-on-changes` Flag#