Documents
LLM Provider Integrations
LLM Provider Integrations
Type
Document
Status
Published
Created
Jan 23, 2026
Updated
Jan 23, 2026
Updated by
Dosu Bot

Overview of Supported LLM Backends#

insideLLMs supports multiple large language model (LLM) backends, including both hosted providers and local runners. All models implement a shared interface with methods such as generate, chat, and stream, enabling uniform interaction regardless of backend. Models are instantiated via a global registry, allowing dynamic selection and configuration by name. This document describes the integration, configuration, and special considerations for the following backends: Gemini, Cohere, LlamaCpp, Ollama, and VLLM.

Integration Mechanism#

Each backend is implemented as a Python class that subclasses the base Model interface. These classes are registered in a global model_registry using a lazy import factory pattern, which defers importing heavy dependencies until the model is instantiated. Models can be created directly or via the registry:

from insideLLMs import model_registry

model = model_registry.get("gemini", model_name="gemini-pro")

Models can also be loaded from configuration files specifying the type and arguments:

model:
  type: gemini
  args:
    model_name: gemini-pro

Backend Details#

Gemini#

Integration:
GeminiModel wraps Google's Gemini models via the google-generativeai SDK. It supports text generation, multi-turn chat, and streaming responses.

Configuration:
Requires a Google API key, provided via the api_key argument or the GOOGLE_API_KEY environment variable. The model_name (e.g., "gemini-pro", "gemini-1.5-flash") must be specified.

Special Considerations:
Requires the google-generativeai Python package. Streaming and chat are supported. Token counting and model listing are available via the SDK.
Source

Cohere#

Integration:
CohereModel integrates with Cohere's API using the cohere Python package. It supports text generation, chat, streaming, embeddings, and reranking.

Configuration:
Requires a Cohere API key, provided via the api_key argument or the CO_API_KEY or COHERE_API_KEY environment variables. The model_name (e.g., "command-r-plus") must be specified.

Special Considerations:
Requires the cohere Python package. Streaming and chat are supported. Additional methods for embeddings and reranking are available.
Source

LlamaCpp#

Integration:
LlamaCppModel wraps local LLMs using llama-cpp-python for GGUF models (e.g., Llama, Mistral). It supports streaming and chat.

Configuration:
Requires a model_path to a GGUF model file. Additional parameters include n_ctx (context size), n_gpu_layers, seed, and f16_kv (float16 cache).

Special Considerations:
Requires the llama-cpp-python package. No API key is needed. The model file must be available locally.
Source

Ollama#

Integration:
OllamaModel connects to models managed by a locally running Ollama server via HTTP API. It supports streaming and chat.

Configuration:
Requires a model_name (e.g., "llama3.2") and optionally a base_url (default: http://localhost:11434). No API key is required.

Special Considerations:
Requires the ollama Python package. The Ollama server must be running locally with the desired model pulled.
Source

VLLM#

Integration:
VLLMModel connects to a vLLM server for high-performance inference using the OpenAI-compatible API via the openai Python client. It supports streaming and chat.

Configuration:
Requires a model_name and optionally a base_url (default: http://localhost:8000). An API key can be provided if the server requires authentication.

Special Considerations:
Requires the openai Python package. The vLLM server must be running and accessible.
Source

Comparison Table#

BackendHosted/LocalAPI Key RequiredPython PackageStreamingChatSpecial Notes
GeminiHostedYesgoogle-generativeaiYesYesGoogle API key required
CohereHostedYescohereYesYesEmbeddings, reranking
LlamaCppLocalNollama-cpp-pythonYesYesGGUF model file required
OllamaLocalNoollamaYesYesOllama server must be running
VLLMLocalOptionalopenaiYesYesvLLM server must be running

Extending with New Providers#

To add a new LLM backend:

  1. Subclass the Model Interface:
    Implement a new class inheriting from Model and define required methods such as generate, chat, and stream.

    from insideLLMs import Model
    
    class MyCustomModel(Model):
        def generate(self, prompt: str, **kwargs) -> str:
            # Implement model inference logic
            return "custom response"
    
  2. Register the Model:
    Register the new model in the model_registry, optionally using a lazy import factory if the backend has heavy dependencies.

    from insideLLMs.registry import model_registry
    
    model_registry.register("mycustom", MyCustomModel)
    

    For lazy imports:

    def _lazy_import_factory(module_path: str, class_name: str):
        def factory(**kwargs):
            import importlib
            module = importlib.import_module(module_path)
            cls = getattr(module, class_name)
            return cls(**kwargs)
        return factory
    
    model_registry.register("mycustom", _lazy_import_factory("my_module", "MyCustomModel"))
    
  3. Configuration:
    Add the new model to your configuration files using its registered name and required arguments.

    model:
      type: mycustom
      args:
        model_name: my-model
    

Notes and Special Considerations#

  • Hosted providers (Gemini, Cohere) require API keys and their respective Python SDKs.
  • Local runners (LlamaCpp, Ollama, VLLM) require local model files or servers and their Python packages.
  • The info method on each model provides metadata such as model_id, provider, and capabilities, which may differ between backends.
  • All models support streaming and chat if the underlying backend supports these features.
  • The registry and configuration system allows for flexible extension and dynamic backend selection.

For more details, see the Probes and Models documentation and the source code for model integration.