Quick Start

insideLLMs is a Python library and CLI for comparing large language model (LLM) behaviour across models using shared probes and datasets. It is designed for deterministic, reproducible evaluation and reporting, making it suitable for research, benchmarking, and CI workflows. This guide explains how to quickly get up and running with insideLLMs.

Prerequisites#

insideLLMs requires Python 3.10 or higher. API keys are only required if you want to use hosted models (e.g., OpenAI, Anthropic); you can run offline tests with the built-in DummyModel.

Installation#

Clone the repository, create and activate a virtual environment, and install insideLLMs:

git clone https://github.com/dr-gareth-roberts/insideLLMs.git
cd insideLLMs
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"

Optional extras are available for narrower installs:

pip install -e ".[openai]" # OpenAI provider
pip install -e ".[anthropic]" # Anthropic provider
pip install -e ".[nlp]" # NLP probes (nltk, spacy)
pip install -e ".[visualization]" # Charts and reports
pip install -e ".[providers]" # All providers at once

Installation instructions

Offline Quick Start (No API Keys Required)#

The fastest way to validate your setup is the built-in deterministic golden path:

# 1) Install development environment
pip install -e ".[dev]"

# 2) Run deterministic harness + diff cycle using DummyModel
make golden-path

This runs a complete harness and diff cycle with DummyModel. Expected result: the diff exits cleanly (no unexpected changes) and writes artefacts under .tmp/runs/.

Alternatively, use the welcome command to see a friendly introduction and quick-start options:

insidellms welcome

Or use the init command to generate a sample configuration. When run without arguments in an interactive terminal, it prompts you for configuration choices (output path, template, model, and probe):

insidellms init

You can also pass options directly or use the --interactive flag:

insidellms init --output experiment.yaml --template basic --model dummy --probe logic
insidellms init --interactive

Typical Workflow#

The core workflow is baseline → candidate → diff:

1. Create a Baseline Run#

insidellms harness ci/harness.yaml --run-dir runs/baseline --overwrite

This produces run artefacts in runs/baseline/, including manifest.json, records.jsonl, and config.resolved.yaml.

2. Create a Candidate Run#

insidellms harness ci/harness.yaml --run-dir runs/candidate --overwrite

This produces run artefacts in runs/candidate/.

3. Compare and Optionally Gate#

Compare the two runs:

insidellms diff runs/baseline runs/candidate

Fail your pipeline if behaviour changed:

insidellms diff runs/baseline runs/candidate --fail-on-changes

Example Harness Configuration#

A minimal harness YAML file might look like:

models:
  - type: openai
    args:
      model_name: gpt-4o
probes:
  - type: logic
    args: {}
dataset:
  format: jsonl
  path: data/questions.jsonl
max_examples: 20
output_dir: results

Run it with:

insidellms harness harness.yaml

This produces records.jsonl, summary.json, and report.html in the output directory.

Setting Up API Keys#

To use hosted models, set the appropriate environment variables before running insideLLMs:

OPENAI_API_KEY
ANTHROPIC_API_KEY
GOOGLE_API_KEY
CO_API_KEY or COHERE_API_KEY
HUGGINGFACEHUB_API_TOKEN (optional, for private HuggingFace models)

For example:

export OPENAI_API_KEY=sk-...

API key setup

Using insideLLMs as a Python Library#

You can use insideLLMs programmatically. For example:

from insideLLMs.models import OpenAIModel
from insideLLMs.probes import LogicProbe
from insideLLMs.runner import run_probe

model = OpenAIModel(model_name="gpt-3.5-turbo")
probe = LogicProbe()
results = run_probe(model, probe, ["What is 2+2?"])

Python usage example

Exploring Models, Probes, and Datasets#

List available resources:

insidellms list models
insidellms list probes
insidellms list datasets

You can also use insidellms info model <name> or insidellms info probe <name> for details.

Additional CLI Commands#

insidellms welcome — friendly introduction for new users with quick-start examples
insidellms init — generate sample configuration (supports interactive mode with --interactive or -i)
insidellms benchmark — run comprehensive benchmarks across models and probes
insidellms compare — compare multiple models on the same inputs
insidellms interactive — start an interactive exploration session

For a full list of commands and options, run:

insidellms --help

Further Documentation#

Colored output can be disabled by setting NO_COLOR=1 in your environment.

insideLLMs supports local models (e.g., Ollama, llama.cpp) and offline tests using DummyModel. For more advanced configuration, see the documentation and example configs in the repository.