Overview of Supported LLM Backends#
insideLLMs supports multiple large language model (LLM) backends, including both hosted providers and local runners. All models implement a shared interface with methods such as generate, chat, and stream, enabling uniform interaction regardless of backend. Models are instantiated via a global registry, allowing dynamic selection and configuration by name. This document describes the integration, configuration, and special considerations for the following backends: Gemini, Cohere, LlamaCpp, Ollama, and VLLM.
Integration Mechanism#
Each backend is implemented as a Python class that subclasses the base Model interface. These classes are registered in a global model_registry using a lazy import factory pattern, which defers importing heavy dependencies until the model is instantiated. Models can be created directly or via the registry:
from insideLLMs import model_registry
model = model_registry.get("gemini", model_name="gemini-pro")
Models can also be loaded from configuration files specifying the type and arguments:
model:
type: gemini
args:
model_name: gemini-pro
Backend Details#
Gemini#
Integration:
GeminiModel wraps Google's Gemini models via the google-generativeai SDK. It supports text generation, multi-turn chat, and streaming responses.
Configuration:
Requires a Google API key, provided via the api_key argument or the GOOGLE_API_KEY environment variable. The model_name (e.g., "gemini-pro", "gemini-1.5-flash") must be specified.
Special Considerations:
Requires the google-generativeai Python package. Streaming and chat are supported. Token counting and model listing are available via the SDK.
Source
Cohere#
Integration:
CohereModel integrates with Cohere's API using the cohere Python package. It supports text generation, chat, streaming, embeddings, and reranking.
Configuration:
Requires a Cohere API key, provided via the api_key argument or the CO_API_KEY or COHERE_API_KEY environment variables. The model_name (e.g., "command-r-plus") must be specified.
Special Considerations:
Requires the cohere Python package. Streaming and chat are supported. Additional methods for embeddings and reranking are available.
Source
LlamaCpp#
Integration:
LlamaCppModel wraps local LLMs using llama-cpp-python for GGUF models (e.g., Llama, Mistral). It supports streaming and chat.
Configuration:
Requires a model_path to a GGUF model file. Additional parameters include n_ctx (context size), n_gpu_layers, seed, and f16_kv (float16 cache).
Special Considerations:
Requires the llama-cpp-python package. No API key is needed. The model file must be available locally.
Source
Ollama#
Integration:
OllamaModel connects to models managed by a locally running Ollama server via HTTP API. It supports streaming and chat.
Configuration:
Requires a model_name (e.g., "llama3.2") and optionally a base_url (default: http://localhost:11434). No API key is required.
Special Considerations:
Requires the ollama Python package. The Ollama server must be running locally with the desired model pulled.
Source
VLLM#
Integration:
VLLMModel connects to a vLLM server for high-performance inference using the OpenAI-compatible API via the openai Python client. It supports streaming and chat.
Configuration:
Requires a model_name and optionally a base_url (default: http://localhost:8000). An API key can be provided if the server requires authentication.
Special Considerations:
Requires the openai Python package. The vLLM server must be running and accessible.
Source
Comparison Table#
| Backend | Hosted/Local | API Key Required | Python Package | Streaming | Chat | Special Notes |
|---|---|---|---|---|---|---|
| Gemini | Hosted | Yes | google-generativeai | Yes | Yes | Google API key required |
| Cohere | Hosted | Yes | cohere | Yes | Yes | Embeddings, reranking |
| LlamaCpp | Local | No | llama-cpp-python | Yes | Yes | GGUF model file required |
| Ollama | Local | No | ollama | Yes | Yes | Ollama server must be running |
| VLLM | Local | Optional | openai | Yes | Yes | vLLM server must be running |
Extending with New Providers#
To add a new LLM backend:
-
Subclass the Model Interface:
Implement a new class inheriting fromModeland define required methods such asgenerate,chat, andstream.from insideLLMs import Model class MyCustomModel(Model): def generate(self, prompt: str, **kwargs) -> str: # Implement model inference logic return "custom response" -
Register the Model:
Register the new model in themodel_registry, optionally using a lazy import factory if the backend has heavy dependencies.from insideLLMs.registry import model_registry model_registry.register("mycustom", MyCustomModel)For lazy imports:
def _lazy_import_factory(module_path: str, class_name: str): def factory(**kwargs): import importlib module = importlib.import_module(module_path) cls = getattr(module, class_name) return cls(**kwargs) return factory model_registry.register("mycustom", _lazy_import_factory("my_module", "MyCustomModel")) -
Configuration:
Add the new model to your configuration files using its registered name and required arguments.model: type: mycustom args: model_name: my-model
Notes and Special Considerations#
- Hosted providers (Gemini, Cohere) require API keys and their respective Python SDKs.
- Local runners (LlamaCpp, Ollama, VLLM) require local model files or servers and their Python packages.
- The
infomethod on each model provides metadata such as model_id, provider, and capabilities, which may differ between backends. - All models support streaming and chat if the underlying backend supports these features.
- The registry and configuration system allows for flexible extension and dynamic backend selection.
For more details, see the Probes and Models documentation and the source code for model integration.