Testing Strategy and Coverage

Comprehensive Testing Approach#

The insideLLMs project employs a robust, modular, and CI-enforced testing strategy to ensure reliability and maintainability across all major subsystems. The test suite covers probes, models, NLP utilities, structured output extraction, visualization, deployment endpoints, schema validation, and more. Tests are organized by component in the tests/ directory, with each file named test_*.py to reflect the subsystem under test (e.g., test_probes.py, test_models.py, test_nlp_text_cleaning.py, test_structured_output.py, test_visualization.py, test_deployment.py, test_output_schemas.py) [source].

Probes#

Probe tests validate creation, attribute correctness, and the ability to run with various input types (strings, dicts, batches). They check output evaluation, scoring, and edge cases using a DummyModel to isolate probe logic. Each probe type (logic, bias, attack, code generation, instruction following, multi-step, constraint compliance) has dedicated test classes and methods that assert expected behaviors and output types [source].

Models#

Model tests cover instantiation, method functionality (generate, chat, stream), protocol compliance, wrapper features (caching, retry logic), and edge cases such as missing environment variables. Tests use unittest.mock.patch to mock dependencies and environment variables, and verify that models conform to expected interfaces and handle errors gracefully [source].

NLP Utilities#

NLP utility tests are organized by function, with classes such as TestRemoveHtmlTags, TestRemoveUrls, TestRemovePunctuation, etc. Each class contains multiple methods that assert correct behavior for a variety of input cases, including edge cases like empty strings and text without the target pattern. Coverage includes text cleaning, normalization, contraction expansion, emoji/number/punctuation removal, and repeated character replacement [source].

Structured Output Extraction#

Tests for structured output extraction validate JSON extraction from text, parsing to Python types, dataclasses, and Pydantic models, schema generation, and error handling. Integration tests use MagicMock to simulate model responses. The suite ensures that structured output generation, batch processing, and error handling (e.g., retries, parsing errors) work as intended. Schema generation and validation are tested for both dataclasses and Pydantic models, including nested and list fields [source].

Visualization#

Visualization tests cover text-based and interactive charts, histograms, summary statistics, HTML report generation, and edge cases such as empty data, Unicode, and long labels. Tests check for correct output, dependency handling (matplotlib, plotly, ipywidgets), and file generation. Helper functions generate mock experiment results for testing. Tests also verify that missing dependencies raise appropriate errors and that all public visualization functions are documented [source].

Deployment Endpoints#

Deployment tests cover endpoint configuration, rate limiting, API key authentication, request logging, metrics collection, health checks, batch endpoints, and FastAPI integration. Tests use unittest.mock.AsyncMock and Mock to simulate both synchronous and asynchronous model methods. Async test functions are marked with pytest.mark.asyncio. Tests validate error handling, batch size limits, thread safety, and configuration serialization [source].

Async Mocks#

Async mocks are essential for testing asynchronous endpoints and model methods. Use AsyncMock from unittest.mock to mock async methods. When creating mocks, use the spec= argument to avoid automatic creation of unwanted attributes (such as agenerate). Mark async test functions with @pytest.mark.asyncio to ensure proper event loop handling. Example:

from unittest.mock import AsyncMock, Mock
import pytest

@pytest.mark.asyncio
async def test_generate_async_model():
    model = Mock()
    model.agenerate = AsyncMock(return_value="Async response")
    endpoint = ModelEndpoint(model)
    result = await endpoint.generate("Test prompt")
    assert result["response"] == "Async response"

[source]

Schema Validation#

Schema validation tests use OutputValidator and SchemaRegistry to validate outputs from ModelBenchmark, ProbeBenchmark, and comparison reports against registered schemas. Tests include round-trip JSON serialization to ensure that both in-memory and serialized forms are accepted. Because pydantic>=2.0.0 is included in the [dev] extras, the majority of pydantic-related tests run by default and are no longer skipped when installing with .[dev] [source].

Running Tests#

The project uses pytest and pytest-asyncio as the primary testing frameworks. To run all tests, use:

pytest

To run with coverage reporting:

pytest --cov=insideLLMs --cov-report=term

To run a specific test file:

pytest tests/test_models.py

To skip slow or integration tests:

pytest -m "not slow and not integration"

Test coverage is enforced in CI with a minimum threshold (e.g., --cov-fail-under=80). The CI test job installs optional provider extras to ensure OpenAI, Anthropic, and NLP tests run rather than skip:

pip install -e ".[dev,openai,anthropic,nlp]"

If tests create run artifacts, set the environment variable INSIDELLMS_RUN_ROOT to an isolated directory (e.g., .tmp/insidellms_runs) [source].

Provider-specific modules that require optional dependencies not present in the base [dev] environment are excluded from coverage measurement ([tool.coverage.run] omit in pyproject.toml):

insideLLMs/models/openai.py, anthropic.py, huggingface.py, cohere.py, gemini.py, openrouter.py
insideLLMs/nlp/*, insideLLMs/publish/*, insideLLMs/crypto/*
insideLLMs/datasets/tuf_client.py, insideLLMs/integrations/*, insideLLMs/contrib/*

Type coverage is enforced via scripts/check_type_coverage.py with a minimum threshold of 90%.

Adding New Tests and Maintaining Coverage#

To add a new test, create or update the relevant test file in tests/, following the naming convention test_*.py. For new models or probes, add the implementation, export and register it, and add corresponding tests in tests/test_models.py or tests/test_probes.py. Use atomic, focused test methods and follow the prevalent conventional commit style (e.g., test: ..., feat(scope): ...). All pull requests must include or update tests for any behavior changes, and coverage is checked in CI [source].

Shared fixtures are located in tests/conftest.py. Use them to set up environment variables and temporary directories for tests. For example, the INSIDELLMS_RUN_ROOT fixture ensures run artifacts are isolated per test session.

Skipped tests are typically dependency-gated (e.g., visualization or schema validation tests that require optional libraries). The test suite is large and mature, with over 4600 tests passing and 100+ skipped due to missing dependencies, indicating comprehensive coverage across all major subsystems [source].

Handling Optional Dependencies#

The codebase uses two equivalent patterns for skipping tests when optional dependencies are absent:

1. pytest.importorskip() at module level — skips the entire module if the package is missing. This prevents test collection failures in CI when optional dependencies are not installed:

import pytest

fernet_module = pytest.importorskip("cryptography.fernet")
Fernet = fernet_module.Fernet

def test_encryption():
    key = Fernet.generate_key()
    # ... test implementation

When pytest encounters importorskip, it marks the entire module as skipped rather than failing during collection. This pattern is used in tests/test_models_openai.py, tests/test_models_anthropic.py, tests/test_models_huggingface.py, tests/test_nlp_tokenization.py, and others. The concrete example from tests/test_encryption.py demonstrates this pattern for the cryptography.fernet module [source].

2. pytest.mark.skipif with importlib.util.find_spec() — applied as a class- or function-level decorator for more explicit, granular skipping. This pattern is used in tests/test_model_error_handling.py and tests/test_probes_models_coverage.py:

import importlib.util
import pytest

_openai_available = importlib.util.find_spec("openai") is not None

@pytest.mark.skipif(not _openai_available, reason="openai not installed")
class TestOpenAIModelErrorHandling:
    # ... tests

Both approaches ensure the test suite can run in minimal environments. Use pytest.importorskip() at module level when all tests in a file share the same optional dependency, and pytest.mark.skipif with importlib.util.find_spec() when only specific classes or functions within a file depend on an optional package.

Best Practices#

Organize tests by subsystem in the tests/ directory.
Use mocks and async mocks to isolate units under test.
Mark async tests with pytest.mark.asyncio.
Use environment variables and fixtures to isolate test artifacts.
Ensure all new features and bug fixes include appropriate tests.
Maintain atomic commits and focused pull requests.
Monitor and enforce coverage thresholds in CI.
Use pytest.importorskip() at module level for optional dependencies to prevent collection failures, or pytest.mark.skipif with importlib.util.find_spec() for class- or function-level skipping.

This approach ensures that insideLLMs remains reliable, maintainable, and easy to extend as new features and components are added.