Project Structure and Development Setup

Project Directory Layout#

The insideLLMs project is organized to support modular development, clear separation of concerns, and ease of extension. The key directories are:

insideLLMs/
├── insideLLMs/ # Main Python package and CLI entrypoint
│ ├── models/ # Model implementations (OpenAI, Anthropic, Gemini, Cohere, HuggingFace, local, etc.)
│ ├── probes/ # Probe implementations (logic, bias, attack, code, etc.)
│ ├── nlp/ # NLP utilities (dependencies.py, feature_extraction.py, similarity.py, etc.)
│ ├── cli/ # CLI package (parser + command modules)
│ │ ├── __init__.py # Main entrypoint
│ │ └── commands/ # Individual CLI commands (harness, run, report, attest, etc.)
│ ├── runtime/ # Execution/runtime package
│ │ └── runner.py # Probe execution engine (canonical path)
│ ├── registry.py # Plugin registry system for models, probes, datasets
│ ├── caching.py # Unified caching infrastructure
│ ├── types.py # Type definitions
│ ├── exceptions.py # Exception hierarchy
│ └── ... # Additional core modules (infra, templates, etc.)
├── tests/ # Pytest suite and shared fixtures (test_*.py, conftest.py)
├── examples/ # Runnable scripts and example configs (example_quickstart.py, harness.yaml)
├── data/ # Datasets and assets for examples and experiments
├── ci/ # Deterministic harness inputs for CI diff-gating (harness.yaml, harness_dataset.jsonl)
├── docs/, wiki/ # Documentation and planning notes
├── benchmarks/ # Benchmark assets and run artifacts
├── compliance_intelligence/ # Multi-agent AML/KYC demo (LangGraph, separate scope)

See: AGENTS.md, CONTRIBUTING.md, README.md

Directory Purposes#

insideLLMs/: Core library code, CLI, and extension points.
insideLLMs/cli/: CLI package with main entrypoint and command modules (harness, run, report, attest, etc.).
insideLLMs/runtime/: Execution/runtime package with probe runner and workflow orchestration.
insideLLMs/nlp/: NLP-specific utilities (resource management, feature extraction, similarity metrics).
tests/: All test code, organized by module, with fixtures in conftest.py.
examples/: Quickstart scripts and configuration files to demonstrate usage.
data/: Example datasets for running probes and experiments.
ci/: Minimal configs and datasets for CI-based behavioral diff-gating.
benchmarks/: Benchmark definitions and run outputs.
compliance_intelligence/: Multi-agent AML/KYC demo (LangGraph, separate scope).
docs/, wiki/: User and developer documentation.

Rationale Behind Refactoring#

Recent refactoring focused on improving modularity, maintainability, and extensibility. Key changes include:

Lazy loading for optional and heavy dependencies (e.g., HuggingFace, local model backends) to reduce startup overhead and make optional features truly optional.
Expanded registries for models and probes, supporting new providers (Gemini, Cohere, LlamaCpp, Ollama, VLLM, OpenRouter) and new probe types (code, instruction following, jailbreak, judge-based evaluation, etc.).
Clarified CLI and reporting workflows to make behavioral diff-gating and artifact management more robust and deterministic.
Improved test coverage and artifact isolation, with coverage enforced at 95% and new tests for all major subsystems.
Pipeline/middleware architecture (ongoing) to standardize execution, enable composable middleware (retry, rate limiting, caching, cost tracking), and make batch/async execution explicit.
Artifact management improvements, including new .gitignore entries for local and benchmark run outputs.
Module reorganization as part of branch synthesis work to improve code organization: CLI moved to insideLLMs/cli/, runner consolidated into insideLLMs/runtime/, and caching unified into insideLLMs/caching.py.

These changes make it easier to add new models, probes, and datasets, and to maintain and extend the system as new LLM providers and evaluation techniques emerge.
See: ARCHITECTURE.md, PR #8, PR #9, PR #12, PR #26

Navigating the Codebase#

Core logic is in insideLLMs/, with models/ and probes/ as the main extension points.
CLI commands are in insideLLMs/cli/commands/ (harness, run, report, attest, sign, verify, trend, etc.).
Runner and orchestration logic is in insideLLMs/runtime/runner.py (canonical path; old insideLLMs/runner.py has been removed).
Caching is unified in insideLLMs/caching.py (old insideLLMs/caching_unified.py was renamed).
NLP utilities are in insideLLMs/nlp/ (e.g., dependencies.py for resource management, feature_extraction.py for vectorization, similarity.py for text similarity).
Registries for models, probes, and datasets are managed in insideLLMs/registry.py.
Type definitions are in insideLLMs/types.py.
Examples and quickstart configs are in examples/.
Tests are in tests/, organized by module.
Datasets for experiments are in data/.
Documentation is in docs/ and the GitHub Wiki.

To add a new model or probe, create a new file in the appropriate subdirectory, export it in the module’s __init__.py, register it in registry.py, and add tests in tests/ (see CONTRIBUTING.md).

Import paths: Always use canonical paths for new code. Import from insideLLMs.runtime.runner (not insideLLMs.runner), insideLLMs.caching (not insideLLMs.caching_unified), etc. See docs/IMPORT_PATHS.md for the full migration matrix.

Development Environment Setup#

The project requires Python 3.10 or higher.

Create and activate a virtual environment, then install the project in editable mode with all optional dependencies:

python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e ".[all]"

Alternatively, you can install only specific extras:

pip install -e ".[nlp]"
pip install -e ".[visualization]"
pip install -e ".[dev]"

See: README.md, Getting Started Wiki

Install pre-commit hooks to enforce code quality:

pre-commit install

Run insidellms doctor as a post-install sanity check. This command reports optional dependency gaps as warnings — in a [dev]-only environment you may see nltk/pydantic-related warnings, but the command will not fail outright.

The Makefile defaults to PYTHON=python3; Makefile targets can be overridden with make <target> PYTHON=python if you have a python symlink:

make check-fast PYTHON=python

Running Tests#

Run the test suite with:

pytest

For coverage:

pytest --cov=insideLLMs --cov-report=term

To skip slow or integration tests:

pytest -m "not slow and not integration"

Tests that create run artifacts should use an isolated root directory, e.g.:

INSIDELLMS_RUN_ROOT=.tmp/insidellms_runs pytest

Coverage is enforced in CI (minimum 95%). The test suite uses pytest and pytest-asyncio, with markers for slow and integration tests.
See: CONTRIBUTING.md, AGENTS.md

Contributing#

Follow the conventional commit style (feat(scope): ..., fix: ..., test: ..., docs: ..., chore: ...). Keep commits atomic and PRs focused. Add or adjust tests for any behavior changes. PRs should follow the template: clear description, linked issue (if any), test notes, and screenshots for UI/report changes.

Coding style guidelines:

Python 3.10, 4-space indentation, Ruff formatting (line length 100)
Type hints on public APIs
Explicit config surfaces (often via dataclasses)
snake_case for modules/functions, PascalCase for types
Sorted imports via Ruff

Never commit credentials. Configure providers via environment variables (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, CO_API_KEY/COHERE_API_KEY, HUGGINGFACEHUB_API_TOKEN). For vulnerability reports, follow SECURITY.md (do not open public issues for security findings).

For more details, see CONTRIBUTING.md and AGENTS.md.

Example: Running the Harness#

Create a config file (e.g., harness.yaml):

models:
  - type: openai
    args:
      model_name: gpt-4o
probes:
  - type: logic
    args: {}
dataset:
  format: jsonl
  path: data/questions.jsonl
max_examples: 20
output_dir: results

Run the harness and generate a report:

insidellms harness harness.yaml --run-dir ./runs/candidate
insidellms report ./runs/candidate

This produces records.jsonl, summary.json, and report.html for analysis and comparison.
See: README.md, Getting Started Wiki