LLM Provider Integrations

Overview of Supported LLM Backends#

insideLLMs supports multiple large language model (LLM) backends, including both hosted providers and local runners. All models implement a shared interface with methods such as generate, chat, and stream, enabling uniform interaction regardless of backend. Models are instantiated via a global registry, allowing dynamic selection and configuration by name. This document describes the integration, configuration, and special considerations for the following backends: Gemini, Cohere, LlamaCpp, Ollama, and VLLM.

Integration Mechanism#

Each backend is implemented as a Python class that subclasses the base Model interface. These classes are registered in a global model_registry using a lazy import factory pattern, which defers importing heavy dependencies until the model is instantiated. Models can be created directly or via the registry:

from insideLLMs import model_registry

model = model_registry.get("gemini", model_name="gemini-pro")

Models can also be loaded from configuration files specifying the type and arguments:

model:
  type: gemini
  args:
    model_name: gemini-pro

Backend Details#

Gemini#

Integration:
GeminiModel wraps Google's Gemini models via the google-generativeai SDK. It supports text generation, multi-turn chat, and streaming responses.

Configuration:
Requires a Google API key, provided via the api_key argument or the GOOGLE_API_KEY environment variable. The model_name (e.g., "gemini-pro", "gemini-1.5-flash") must be specified.

Special Considerations:
Requires the google-generativeai Python package. Streaming and chat are supported. Token counting and model listing are available via the SDK.
Source

Cohere#

Integration:
CohereModel integrates with Cohere's API using the cohere Python package. It supports text generation, chat, streaming, embeddings, and reranking.

Configuration:
Requires a Cohere API key, provided via the api_key argument or the CO_API_KEY or COHERE_API_KEY environment variables. The model_name (e.g., "command-r-plus") must be specified.

Special Considerations:
Requires the cohere Python package. Streaming and chat are supported. Additional methods for embeddings and reranking are available.
Source

LlamaCpp#

Integration:
LlamaCppModel wraps local LLMs using llama-cpp-python for GGUF models (e.g., Llama, Mistral). It supports streaming and chat.

Configuration:
Requires a model_path to a GGUF model file. Additional parameters include n_ctx (context size), n_gpu_layers, seed, and f16_kv (float16 cache).

Special Considerations:
Requires the llama-cpp-python package. No API key is needed. The model file must be available locally.
Source

Ollama#

Integration:
OllamaModel connects to models managed by a locally running Ollama server via HTTP API. It supports streaming and chat.

Configuration:
Requires a model_name (e.g., "llama3.2") and optionally a base_url (default: http://localhost:11434). No API key is required.

Special Considerations:
Requires the ollama Python package. The Ollama server must be running locally with the desired model pulled.
Source

VLLM#

Integration:
VLLMModel connects to a vLLM server for high-performance inference using the OpenAI-compatible API via the openai Python client. It supports streaming and chat.

Configuration:
Requires a model_name and optionally a base_url (default: http://localhost:8000). An API key can be provided if the server requires authentication.

Special Considerations:
Requires the openai Python package. The vLLM server must be running and accessible.
Source

Comparison Table#

Backend	Hosted/Local	API Key Required	Python Package	Streaming	Chat	Special Notes
Gemini	Hosted	Yes	google-generativeai	Yes	Yes	Google API key required
Cohere	Hosted	Yes	cohere	Yes	Yes	Embeddings, reranking
LlamaCpp	Local	No	llama-cpp-python	Yes	Yes	GGUF model file required
Ollama	Local	No	ollama	Yes	Yes	Ollama server must be running
VLLM	Local	Optional	openai	Yes	Yes	vLLM server must be running

Extending with New Providers#

To add a new LLM backend:

Subclass the Model Interface:
Implement a new class inheriting from Model and define required methods such as generate, chat, and stream.

from insideLLMs import Model

class MyCustomModel(Model):
    def generate(self, prompt: str, **kwargs) -> str:
        # Implement model inference logic
        return "custom response"

Register the Model:
Register the new model in the model_registry, optionally using a lazy import factory if the backend has heavy dependencies.

from insideLLMs.registry import model_registry

model_registry.register("mycustom", MyCustomModel)

For lazy imports:

def _lazy_import_factory(module_path: str, class_name: str):
    def factory(**kwargs):
        import importlib
        module = importlib.import_module(module_path)
        cls = getattr(module, class_name)
        return cls(**kwargs)
    return factory

model_registry.register("mycustom", _lazy_import_factory("my_module", "MyCustomModel"))

Configuration:
Add the new model to your configuration files using its registered name and required arguments.
```
model:
  type: mycustom
  args:
    model_name: my-model
```

Notes and Special Considerations#

Hosted providers (Gemini, Cohere) require API keys and their respective Python SDKs.
Local runners (LlamaCpp, Ollama, VLLM) require local model files or servers and their Python packages.
The info method on each model provides metadata such as model_id, provider, and capabilities, which may differ between backends.
All models support streaming and chat if the underlying backend supports these features.
The registry and configuration system allows for flexible extension and dynamic backend selection.

For more details, see the Probes and Models documentation and the source code for model integration.