Project Directory Layout#
The insideLLMs project is organized to support modular development, clear separation of concerns, and ease of extension. The key directories are:
insideLLMs/
├── insideLLMs/ # Main Python package and CLI entrypoint
│ ├── models/ # Model implementations (OpenAI, Anthropic, Gemini, Cohere, HuggingFace, local, etc.)
│ ├── probes/ # Probe implementations (logic, bias, attack, code, etc.)
│ ├── nlp/ # NLP utilities (dependencies.py, feature_extraction.py, similarity.py, etc.)
│ ├── cli/ # CLI package (parser + command modules)
│ │ ├── __init__.py # Main entrypoint
│ │ └── commands/ # Individual CLI commands (harness, run, report, attest, etc.)
│ ├── runtime/ # Execution/runtime package
│ │ └── runner.py # Probe execution engine (canonical path)
│ ├── registry.py # Plugin registry system for models, probes, datasets
│ ├── caching.py # Unified caching infrastructure
│ ├── types.py # Type definitions
│ ├── exceptions.py # Exception hierarchy
│ └── ... # Additional core modules (infra, templates, etc.)
├── tests/ # Pytest suite and shared fixtures (test_*.py, conftest.py)
├── examples/ # Runnable scripts and example configs (example_quickstart.py, harness.yaml)
├── data/ # Datasets and assets for examples and experiments
├── ci/ # Deterministic harness inputs for CI diff-gating (harness.yaml, harness_dataset.jsonl)
├── docs/, wiki/ # Documentation and planning notes
├── benchmarks/ # Benchmark assets and run artifacts
├── compliance_intelligence/ # Multi-agent AML/KYC demo (LangGraph, separate scope)
See: AGENTS.md, CONTRIBUTING.md, README.md
Directory Purposes#
insideLLMs/: Core library code, CLI, and extension points.insideLLMs/cli/: CLI package with main entrypoint and command modules (harness, run, report, attest, etc.).insideLLMs/runtime/: Execution/runtime package with probe runner and workflow orchestration.insideLLMs/nlp/: NLP-specific utilities (resource management, feature extraction, similarity metrics).tests/: All test code, organized by module, with fixtures inconftest.py.examples/: Quickstart scripts and configuration files to demonstrate usage.data/: Example datasets for running probes and experiments.ci/: Minimal configs and datasets for CI-based behavioral diff-gating.benchmarks/: Benchmark definitions and run outputs.compliance_intelligence/: Multi-agent AML/KYC demo (LangGraph, separate scope).docs/,wiki/: User and developer documentation.
Rationale Behind Refactoring#
Recent refactoring focused on improving modularity, maintainability, and extensibility. Key changes include:
- Lazy loading for optional and heavy dependencies (e.g., HuggingFace, local model backends) to reduce startup overhead and make optional features truly optional.
- Expanded registries for models and probes, supporting new providers (Gemini, Cohere, LlamaCpp, Ollama, VLLM, OpenRouter) and new probe types (code, instruction following, jailbreak, judge-based evaluation, etc.).
- Clarified CLI and reporting workflows to make behavioral diff-gating and artifact management more robust and deterministic.
- Improved test coverage and artifact isolation, with coverage enforced at 95% and new tests for all major subsystems.
- Pipeline/middleware architecture (ongoing) to standardize execution, enable composable middleware (retry, rate limiting, caching, cost tracking), and make batch/async execution explicit.
- Artifact management improvements, including new .gitignore entries for local and benchmark run outputs.
- Module reorganization as part of branch synthesis work to improve code organization: CLI moved to
insideLLMs/cli/, runner consolidated intoinsideLLMs/runtime/, and caching unified intoinsideLLMs/caching.py.
These changes make it easier to add new models, probes, and datasets, and to maintain and extend the system as new LLM providers and evaluation techniques emerge.
See: ARCHITECTURE.md, PR #8, PR #9, PR #12, PR #26
Navigating the Codebase#
- Core logic is in
insideLLMs/, withmodels/andprobes/as the main extension points. - CLI commands are in
insideLLMs/cli/commands/(harness, run, report, attest, sign, verify, trend, etc.). - Runner and orchestration logic is in
insideLLMs/runtime/runner.py(canonical path; oldinsideLLMs/runner.pyhas been removed). - Caching is unified in
insideLLMs/caching.py(oldinsideLLMs/caching_unified.pywas renamed). - NLP utilities are in
insideLLMs/nlp/(e.g.,dependencies.pyfor resource management,feature_extraction.pyfor vectorization,similarity.pyfor text similarity). - Registries for models, probes, and datasets are managed in
insideLLMs/registry.py. - Type definitions are in
insideLLMs/types.py. - Examples and quickstart configs are in
examples/. - Tests are in
tests/, organized by module. - Datasets for experiments are in
data/. - Documentation is in
docs/and the GitHub Wiki.
To add a new model or probe, create a new file in the appropriate subdirectory, export it in the module’s __init__.py, register it in registry.py, and add tests in tests/ (see CONTRIBUTING.md).
Import paths: Always use canonical paths for new code. Import from insideLLMs.runtime.runner (not insideLLMs.runner), insideLLMs.caching (not insideLLMs.caching_unified), etc. See docs/IMPORT_PATHS.md for the full migration matrix.
Development Environment Setup#
The project requires Python 3.10 or higher.
Create and activate a virtual environment, then install the project in editable mode with all optional dependencies:
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e ".[all]"
Alternatively, you can install only specific extras:
pip install -e ".[nlp]"
pip install -e ".[visualization]"
pip install -e ".[dev]"
See: README.md, Getting Started Wiki
Install pre-commit hooks to enforce code quality:
pre-commit install
Run insidellms doctor as a post-install sanity check. This command reports optional dependency gaps as warnings — in a [dev]-only environment you may see nltk/pydantic-related warnings, but the command will not fail outright.
The Makefile defaults to PYTHON=python3; Makefile targets can be overridden with make <target> PYTHON=python if you have a python symlink:
make check-fast PYTHON=python
Running Tests#
Run the test suite with:
pytest
For coverage:
pytest --cov=insideLLMs --cov-report=term
To skip slow or integration tests:
pytest -m "not slow and not integration"
Tests that create run artifacts should use an isolated root directory, e.g.:
INSIDELLMS_RUN_ROOT=.tmp/insidellms_runs pytest
Coverage is enforced in CI (minimum 95%). The test suite uses pytest and pytest-asyncio, with markers for slow and integration tests.
See: CONTRIBUTING.md, AGENTS.md
Contributing#
Follow the conventional commit style (feat(scope): ..., fix: ..., test: ..., docs: ..., chore: ...). Keep commits atomic and PRs focused. Add or adjust tests for any behavior changes. PRs should follow the template: clear description, linked issue (if any), test notes, and screenshots for UI/report changes.
Coding style guidelines:
- Python 3.10, 4-space indentation, Ruff formatting (line length 100)
- Type hints on public APIs
- Explicit config surfaces (often via dataclasses)
snake_casefor modules/functions,PascalCasefor types- Sorted imports via Ruff
Never commit credentials. Configure providers via environment variables (e.g., OPENAI_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, CO_API_KEY/COHERE_API_KEY, HUGGINGFACEHUB_API_TOKEN). For vulnerability reports, follow SECURITY.md (do not open public issues for security findings).
For more details, see CONTRIBUTING.md and AGENTS.md.
Example: Running the Harness#
Create a config file (e.g., harness.yaml):
models:
- type: openai
args:
model_name: gpt-4o
probes:
- type: logic
args: {}
dataset:
format: jsonl
path: data/questions.jsonl
max_examples: 20
output_dir: results
Run the harness and generate a report:
insidellms harness harness.yaml --run-dir ./runs/candidate
insidellms report ./runs/candidate
This produces records.jsonl, summary.json, and report.html for analysis and comparison.
See: README.md, Getting Started Wiki