Model Backend Support

Supported LLM Providers: Gemini, Cohere, LlamaCpp, Ollama, and VLLM#

insideLLMs supports a unified interface for multiple LLM providers, including Gemini, Cohere, LlamaCpp, Ollama, and VLLM. All providers implement the same core methods (generate, chat, stream), enabling seamless integration with probes and tracing features. Models are registered in the system and can be instantiated directly or via the model_registry by name. Configuration is typically managed via YAML files or programmatically.

Common Interface and Usage#

All providers support the following methods:

generate(prompt, **kwargs): Single-turn text generation.
chat(messages, **kwargs): Multi-turn chat with a list of message dicts.
stream(prompt, **kwargs): Streaming text generation.

Models can be instantiated via the registry:

from insideLLMs import model_registry

model = model_registry.get("gemini", model_name="gemini-pro")

Or configured in YAML:

models:
  - type: gemini
    args:
      model_name: gemini-pro

Probes interact with models through this interface, allowing any probe to be run against any supported provider. Tracing (call counts, latency, metadata) is handled by the base model class and is available for all providers via methods like generate_with_metadata and the CLI harness output files Probes and Models.

Gemini#

Description: Hosted LLM provider using the Google Gemini API via the google-generativeai SDK.

Configuration:

Requires GOOGLE_API_KEY environment variable or explicit api_key parameter.
Install dependency: pip install google-generativeai

YAML example:

models:
  - type: gemini
    args:
      model_name: gemini-pro

Programmatic example:

from insideLLMs import model_registry
model = model_registry.get("gemini", model_name="gemini-pro")

Provider-Specific Options:

model_name: e.g., "gemini-pro", "gemini-1.5-flash"
safety_settings, generation_config: Optional advanced configuration.
Streaming and chat are supported.
Lists available models via list_models().

Considerations:

API key is required; raises an error if missing.
Hosted API—ensure network access.
Token counting and model listing are supported.

Cohere#

Description: Hosted LLM provider using the Cohere API via the cohere Python package.

Configuration:

Requires CO_API_KEY or COHERE_API_KEY environment variable or explicit api_key parameter.
Install dependency: pip install cohere

YAML example:

models:
  - type: cohere
    args:
      model_name: command-r-plus

Programmatic example:

from insideLLMs import model_registry
model = model_registry.get("cohere", model_name="command-r-plus")

Provider-Specific Options:

model_name: e.g., "command-r-plus"
default_preamble: Optional system prompt for chat.
Supports embeddings (embed) and reranking (rerank) in addition to generation and chat.
Streaming and chat are supported.

Considerations:

API key is required; raises an error if missing.
Hosted API—ensure network access.

LlamaCpp#

Description: Local LLM runner using llama-cpp-python and GGUF model files.

Configuration:

Requires path to a GGUF model file (model_path).
Install dependency: pip install llama-cpp-python

YAML example:

models:
  - type: llamacpp
    args:
      model_path: /path/to/model.gguf
      n_ctx: 4096

Programmatic example:

from insideLLMs import model_registry
model = model_registry.get("llamacpp", model_path="/path/to/model.gguf", n_ctx=4096)

Provider-Specific Options:

n_ctx, n_gpu_layers, seed, f16_kv, verbose: Control context size, GPU usage, reproducibility, and logging.
Streaming and chat are supported.

Considerations:

No API key required.
Model file must be present locally.
Suitable for running Llama, Mistral, and compatible models.

Ollama#

Description: Local LLM runner interfacing with models managed by a local Ollama server via the ollama Python package.

Configuration:

Requires model_name and optionally base_url (default: http://localhost:11434).
Ollama server must be running locally.
Install dependency: pip install ollama

YAML example:

models:
  - type: ollama
    args:
      model_name: llama3.2

Programmatic example:

from insideLLMs import model_registry
model = model_registry.get("ollama", model_name="llama3.2")

Provider-Specific Options:

base_url: Change if Ollama server runs on a non-default port or host.
Can pull models (pull()), list available models (list_models()), and show model info (show_model_info()).
Streaming and chat are supported.

Considerations:

No API key required.
Ollama server and models must be available locally.

VLLM#

Description: Local or remote LLM runner connecting to a vLLM server using the OpenAI-compatible API via the openai Python package.

Configuration:

Requires model_name and optionally base_url (default: http://localhost:8000).
vLLM server must be running and accessible.
Install dependency: pip install openai

YAML example:

models:
  - type: vllm
    args:
      model_name: meta-llama/Llama-3.1-8B-Instruct
      base_url: http://localhost:8000

Programmatic example:

from insideLLMs import model_registry
model = model_registry.get("vllm", model_name="meta-llama/Llama-3.1-8B-Instruct")

Provider-Specific Options:

base_url: Change if vLLM server runs on a non-default port or host.
Streaming and chat are supported.

Considerations:

No API key required for local servers.
vLLM server must be running and accessible.

Probes and Tracing#

All providers are compatible with insideLLMs probes, which evaluate model behavior (logic, bias, safety, factuality, code, etc.) through the unified interface. Probes can be selected and run against any registered model. Tracing (call counts, latency, and metadata) is handled by the base model class and is available for all providers. Use generate_with_metadata for detailed tracing, and review output files like records.jsonl and summary.json for run metadata.

Example probe usage:

from insideLLMs import model_registry, probe_registry, ProbeRunner

model = model_registry.get("gemini", model_name="gemini-pro")
probe = probe_registry.get("logic")
runner = ProbeRunner(model, probe)
results = runner.run(["What comes next: 1, 4, 9, 16, ?"])

Troubleshooting and Tips#

Hosted providers (Gemini, Cohere) require valid API keys and network access.
Local runners (LlamaCpp, Ollama, VLLM) require local services or model files to be available and running.
Ensure all required Python dependencies are installed for the chosen provider.
Use the info() method on any model to inspect provider, model ID, and supported features.
For advanced usage, consult the provider-specific methods (e.g., embed, rerank for Cohere, list_models for Gemini and Ollama).

For more details, see the Probes and Models documentation and the Getting Started guide.