LLM Providers

Zeph supports multiple LLM backends. Choose based on your needs:

Provider	Type	Embeddings	Vision	Streaming	Best For
Ollama	Local	Yes	Yes	Yes	Privacy, free, offline
Claude	Cloud	No	Yes	Yes	Quality, reasoning, prompt caching
OpenAI	Cloud	Yes	Yes	Yes	Ecosystem, GPT-4o, GPT-5
Gemini	Cloud	Yes	Yes	Yes	Google ecosystem, long context, extended thinking
Compatible	Cloud	Varies	Varies	Varies	Together AI, Groq, Fireworks
Candle	Local	No	No	No	Minimal footprint

Claude does not support embeddings natively. Use a multi-provider setup with embed = true on an Ollama or OpenAI provider entry to combine Claude chat with local embeddings. Gemini supports embeddings via the text-embedding-004 model — set embedding_model in the Gemini [[llm.providers]] entry to enable.

Quick Setup

Ollama (default — no API key needed):

ollama pull mistral:7b
ollama pull qwen3-embedding
zeph

Claude:

ZEPH_CLAUDE_API_KEY=sk-ant-... zeph

OpenAI:

ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... zeph

Gemini:

ZEPH_LLM_PROVIDER=gemini ZEPH_GEMINI_API_KEY=AIza... zeph

Gemini

Zeph supports Google Gemini as a first-class LLM backend. Gemini is a strong choice when you want access to Google’s latest models (Gemini 2.5 Pro, Gemini 2.0 Flash), very long context windows, extended thinking, or native multimodal reasoning.

Why Gemini

Google’s Gemini 2.5 family brings extended thinking (visible as streaming Thinking chunks in Zeph’s TUI), native tool use, vision, and embeddings. For tasks that require deep reasoning over large codebases or long documents, Gemini’s context capacity complements Zeph’s existing RAG pipeline.

Integration Overview

The GeminiProvider translates Zeph’s internal message format to Gemini’s generateContent API:

The system prompt becomes a top-level systemInstruction field (Gemini’s required format).
The assistant role is mapped to "model" (Gemini’s terminology for the model turn).
Consecutive messages with the same role are automatically merged — Gemini requires strict user/model alternation.
If the conversation starts with a model turn, a synthetic empty user message is prepended to satisfy the API contract.
Tool definitions are converted to Gemini functionDeclarations with JSON schema normalization ($ref inlining, anyOf/oneOf → nullable, type name uppercasing).
Vision inputs are sent as inlineData parts with base64-encoded image data.

Streaming uses streamGenerateContent?alt=sse. Thinking parts (returned with thought: true by Gemini 2.5 models) are surfaced as StreamChunk::Thinking and shown in the TUI sidebar.

Configuration

[llm]
[[llm.providers]]
type = "gemini"
model = "gemini-2.0-flash"           # default; use "gemini-2.5-pro" for extended thinking
max_tokens = 8192
# embedding_model = "text-embedding-004"  # enable Gemini embeddings (optional)
# thinking_level = "medium"              # minimal, low, medium, high (Gemini 2.5+)
# thinking_budget = 8192                 # token budget for thinking; -1 = dynamic, 0 = off
# include_thoughts = true                # surface thinking chunks in TUI
# base_url = "https://generativelanguage.googleapis.com/v1beta"  # default

Store the API key in the vault (recommended):

zeph vault set ZEPH_GEMINI_API_KEY AIza...

Or export it as an environment variable:

export ZEPH_GEMINI_API_KEY=AIza...

Run zeph init and choose Gemini as the provider to have the wizard generate a complete config with all Gemini parameters, including the thinking level prompt.

Capabilities

Feature	Gemini 2.0 Flash	Gemini 2.5 Pro
Chat	Yes	Yes
Streaming (SSE)	Yes	Yes
Tool use	Yes	Yes
Streaming tool use	Yes	Yes
Vision	Yes	Yes
Embeddings	Yes (`text-embedding-004`)	Yes (`text-embedding-004`)
Extended thinking	No	Yes (`thinking_level` / `thinking_budget`)
Remote model discovery	Yes	Yes

Embeddings

Set embedding_model in the Gemini [[llm.providers]] entry to enable Gemini embeddings. When set, supports_embeddings() returns true and Zeph uses POST /v1beta/models/{model}:embedContent for semantic memory and skill matching — no Ollama dependency required.

[[llm.providers]]
type = "gemini"
model = "gemini-2.0-flash"
embedding_model = "text-embedding-004"

Streaming and Thinking

When streaming is active, Zeph emits chunks as they arrive from the SSE stream (streamGenerateContent?alt=sse). For Gemini 2.5 models that return thinking parts, the TUI shows a “Thinking…” indicator while the model reasons and then switches to the response stream. Both paths use the same retry infrastructure (send_with_retry) — HTTP 429 (rate limit) and 503 (service unavailable) responses trigger automatic backoff and retry.

Configure thinking via thinking_level (categorical) or thinking_budget (token count). Both fields are optional and apply only to Gemini 2.5+ models.

Streaming Tool Use

Gemini delivers functionCall parts as complete objects within a single SSE event (not incrementally chunked). The SSE parser collects all functionCall parts from the event’s parts array and emits a single StreamChunk::ToolUse with all tool calls. When an event contains both text and function call parts, tool calls take priority and any text in that event is dropped (matching the non-streaming behavior).

Streaming tool use is available on all Gemini models that support function calling, including Gemini 2.0 Flash.

Switching Providers

Change the type field in the [[llm.providers]] entry. All skills, memory, and tools work the same regardless of which provider is active.

[llm]
[[llm.providers]]
type = "claude"   # ollama, claude, openai, gemini, candle, compatible
model = "claude-sonnet-4-6"

Response Caching

Enable SQLite-backed response caching to avoid redundant LLM calls for identical requests. The cache key is a blake3 hash of the full message history and model name. Streaming responses bypass the cache.

[llm]
response_cache_enabled = true
response_cache_ttl_secs = 3600  # 1 hour (default)

See Memory and Context — LLM Response Cache for details.

Deep Dives

Use a Cloud Provider — Claude, OpenAI, and compatible API setup
Model Orchestrator — multi-provider routing with fallback chains
Adaptive Inference — Thompson Sampling and EMA-based provider routing
Local Inference (Candle) — HuggingFace GGUF models

Keyboard shortcuts

Zeph Documentation