CI Diff-Gating Workflow

CI Diff-Gating Workflow for Deterministic PR Validation#

The CI diff-gating workflow in insideLLMs enforces deterministic validation of pull requests by running a minimal, reproducible harness on both the base commit (typically the main branch) and the candidate PR commit. It then compares the behavioral outputs to detect any drift, including regressions, changes, or missing/extra records. This process ensures that all behavioral changes introduced by a PR are surfaced and reviewed before merging, supporting reproducibility and preventing unintended model or logic regressions.

This workflow is implemented at .github/workflows/diff-gate.yml and uses the composite action at .github/actions/diff-gate/ to automate baseline and candidate harness execution, diff computation, and PR comment posting. The workflow was added as part of the synthesized branch effort to standardize CI quality gates.

Workflow Overview#

The workflow performs the following steps:

Baseline Run: The CI system checks out the PR base commit and runs the harness using the fixed configuration (ci/harness.yaml), saving results to .tmp/diff-gate/baseline.
Candidate Run: The workflow returns to the PR head commit and runs the same harness, saving results to .tmp/diff-gate/candidate.
Diffing: The outputs of the two runs are compared using the insidellms diff command with the configured failure mode (default: --fail-on-regressions).
Gating: If behavioral differences are detected that exceed the configured threshold, the CI job fails, blocking the PR until the differences are reviewed and addressed.
PR Comment: A summary of the diff results is posted as a comment on the pull request (configurable via the comment input).

Example workflow commands executed by the composite action:

insidellms harness ci/harness.yaml --run-dir .tmp/diff-gate/baseline --skip-report --deterministic-artifacts
insidellms diff .tmp/diff-gate/baseline .tmp/diff-gate/candidate --fail-on-regressions --format json

Deterministic Harness and Artifacts#

The harness configuration used for CI diff-gating (ci/harness.yaml) is designed for determinism. It uses a DummyModel (no API keys required) and a small, fixed dataset (ci/harness_dataset.jsonl) with a set of probes (logic, attack, instruction_following, code_generation). The harness spine—run, records, report, diff—is deterministic, including run IDs and timestamps, ensuring reliable and reproducible diffing in CI.

Example ci/harness.yaml excerpt:

models:
  - type: dummy
    args: {}
probes:
  - type: logic
    args: {}
  - type: attack
    args:
      attack_type: prompt_injection
  - type: instruction_following
    args: {}
  - type: code_generation
    args:
      language: python
dataset:
  format: jsonl
  path: ci/harness_dataset.jsonl
max_examples: 3
confidence_level: 0.95
report_title: CI Diff Gate Harness

View full ci/harness.yaml

How `--fail-on-changes` Enforces Stricter Validation#

The --fail-on-changes flag is used with the insidellms diff command to enforce strict gating on any behavioral drift between the base and candidate runs. When this flag is set, the diff command will exit with a non-zero status (exit code 2) if it detects any of the following:

Regressions (worse scores or failed statuses)
Changes (metric mismatches, output changes, status changes)
Records only present in the baseline or only in the candidate (missing/extra records)

This is stricter than gating only on regressions, as it surfaces all behavioral differences, not just negative ones. The flag is defined and handled in the CLI as follows:

diff_parser.add_argument(
    "--fail-on-changes",
    action="store_true",
    help=(
        "Exit with non-zero status if any differences are detected "
        "(regressions, changes, or missing/extra records)"
    ),
)

If any differences are found, the command returns exit code 2, causing the CI job to fail and blocking the PR until the differences are reviewed and resolved. See CLI implementation.

Types of Behavioral Changes Detected#

The diff command compares two run directories and reports:

Regressions: Cases where the candidate run performs worse than the baseline (e.g., lower scores, failed statuses).
Improvements: Cases where the candidate run performs better.
Changes: Metric mismatches, output changes, or status changes that are not strictly regressions or improvements.
Missing/Extra Records: Records present only in the baseline or only in the candidate.
Trace Drifts and Violations: (If enabled) Differences in execution traces or contract violations.

All these are surfaced in the diff report, and with --fail-on-changes, any of them will fail the CI job.

Contributor Guidance: Handling Diff-Gating Failures#

If your PR fails due to the diff-gating workflow, it means that behavioral changes have been detected between your branch and the base branch. To resolve:

Review the diff output in the CI logs to identify the specific changes.
Determine whether the changes are intended and justified. If so, document the rationale in your PR description.
If the changes are unintended, update your code to restore behavioral consistency.
Re-run the CI workflow to ensure the diff-gating check passes.

This process ensures that all behavioral changes are visible, reviewed, and justified before merging.

GitHub Actions Integration#

The repository includes a turnkey GitHub Actions workflow at .github/workflows/diff-gate.yml that uses the composite action at .github/actions/diff-gate/. This composite action automates baseline and candidate harness execution, diff computation, and optional PR comment posting.

Inputs#

config: Path to the harness YAML config file (default: ci/harness.yaml)
baseline-ref: Git ref for the baseline run (default: PR base)
fail-on: Failure mode controlling when the action fails:
- regressions: Fail only on regressions (default)
- changes: Fail on any detected changes (regressions, improvements, other changes, missing/extra records)
- none: Never fail based on diff results
python-version: Python version to use (default: 3.12)
extra-pip-args: Additional arguments passed to pip install
harness-args: Additional CLI arguments passed to insidellms harness (default: --skip-report --deterministic-artifacts)
comment: Whether to post a PR comment with the diff summary (default: true)

Outputs#

exit-code: Exit code from the diff command (0=pass, 2=fail)
diff-json: Path to the JSON diff report

Live Workflow#

The workflow is configured at .github/workflows/diff-gate.yml and triggers on pull_request and workflow_dispatch events:

name: Diff Gate

on:
  pull_request:
  workflow_dispatch:

permissions:
  contents: read
  pull-requests: write

jobs:
  diff-gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0 # full history needed to check out base ref

      - uses: ./.github/actions/diff-gate
        with:
          config: ci/harness.yaml
          fail-on: regressions

The workflow runs automatically on every pull request and can be manually triggered via workflow_dispatch. It uses the composite action to:

Check out the baseline ref (PR base) and run the harness, saving results to .tmp/diff-gate/baseline
Return to the PR head commit and run the harness again, saving results to .tmp/diff-gate/candidate
Run insidellms diff with the configured fail-on mode (default: regressions)
Generate a Markdown summary and post it as a PR comment (if enabled)
Exit with the diff command's exit code, failing the workflow if configured thresholds are exceeded

The workflow enforces quality gates by default, failing only on regressions. To enforce stricter validation that blocks on any behavioral change, set fail-on: changes in the workflow configuration.

For more details, see the CI Diff-Gating documentation and the CLI implementation.