insideLLMs is a Python library and CLI for comparing large language model (LLM) behaviour across models using shared probes and datasets. It is designed for deterministic, reproducible evaluation and reporting, making it suitable for research, benchmarking, and CI workflows. This guide explains how to quickly get up and running with insideLLMs.
Prerequisites#
insideLLMs requires Python 3.10 or higher. API keys are only required if you want to use hosted models (e.g., OpenAI, Anthropic); you can run offline tests with the built-in DummyModel.
Installation#
Clone the repository, create and activate a virtual environment, and install insideLLMs:
git clone https://github.com/dr-gareth-roberts/insideLLMs.git
cd insideLLMs
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e ".[dev]"
Optional extras are available for narrower installs:
pip install -e ".[openai]" # OpenAI provider
pip install -e ".[anthropic]" # Anthropic provider
pip install -e ".[nlp]" # NLP probes (nltk, spacy)
pip install -e ".[visualization]" # Charts and reports
pip install -e ".[providers]" # All providers at once
Offline Quick Start (No API Keys Required)#
The fastest way to validate your setup is the built-in deterministic golden path:
# 1) Install development environment
pip install -e ".[dev]"
# 2) Run deterministic harness + diff cycle using DummyModel
make golden-path
This runs a complete harness and diff cycle with DummyModel. Expected result: the diff exits cleanly (no unexpected changes) and writes artefacts under .tmp/runs/.
Alternatively, use the welcome command to see a friendly introduction and quick-start options:
insidellms welcome
Or use the init command to generate a sample configuration. When run without arguments in an interactive terminal, it prompts you for configuration choices (output path, template, model, and probe):
insidellms init
You can also pass options directly or use the --interactive flag:
insidellms init --output experiment.yaml --template basic --model dummy --probe logic
insidellms init --interactive
Typical Workflow#
The core workflow is baseline → candidate → diff:
1. Create a Baseline Run#
insidellms harness ci/harness.yaml --run-dir runs/baseline --overwrite
This produces run artefacts in runs/baseline/, including manifest.json, records.jsonl, and config.resolved.yaml.
2. Create a Candidate Run#
insidellms harness ci/harness.yaml --run-dir runs/candidate --overwrite
This produces run artefacts in runs/candidate/.
3. Compare and Optionally Gate#
Compare the two runs:
insidellms diff runs/baseline runs/candidate
Fail your pipeline if behaviour changed:
insidellms diff runs/baseline runs/candidate --fail-on-changes
Example Harness Configuration#
A minimal harness YAML file might look like:
models:
- type: openai
args:
model_name: gpt-4o
probes:
- type: logic
args: {}
dataset:
format: jsonl
path: data/questions.jsonl
max_examples: 20
output_dir: results
Run it with:
insidellms harness harness.yaml
This produces records.jsonl, summary.json, and report.html in the output directory.
Setting Up API Keys#
To use hosted models, set the appropriate environment variables before running insideLLMs:
OPENAI_API_KEYANTHROPIC_API_KEYGOOGLE_API_KEYCO_API_KEYorCOHERE_API_KEYHUGGINGFACEHUB_API_TOKEN(optional, for private HuggingFace models)
For example:
export OPENAI_API_KEY=sk-...
Using insideLLMs as a Python Library#
You can use insideLLMs programmatically. For example:
from insideLLMs.models import OpenAIModel
from insideLLMs.probes import LogicProbe
from insideLLMs.runner import run_probe
model = OpenAIModel(model_name="gpt-3.5-turbo")
probe = LogicProbe()
results = run_probe(model, probe, ["What is 2+2?"])
Exploring Models, Probes, and Datasets#
List available resources:
insidellms list models
insidellms list probes
insidellms list datasets
You can also use insidellms info model <name> or insidellms info probe <name> for details.
Additional CLI Commands#
insidellms welcome— friendly introduction for new users with quick-start examplesinsidellms init— generate sample configuration (supports interactive mode with--interactiveor-i)insidellms benchmark— run comprehensive benchmarks across models and probesinsidellms compare— compare multiple models on the same inputsinsidellms interactive— start an interactive exploration session
For a full list of commands and options, run:
insidellms --help
Further Documentation#
Colored output can be disabled by setting NO_COLOR=1 in your environment.
insideLLMs supports local models (e.g., Ollama, llama.cpp) and offline tests using DummyModel. For more advanced configuration, see the documentation and example configs in the repository.