Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

LLM Providers

Zeph supports multiple LLM backends. Choose based on your needs:

ProviderTypeEmbeddingsVisionStreamingBest For
OllamaLocalYesYesYesPrivacy, free, offline
ClaudeCloudNoYesYesQuality, reasoning, prompt caching
OpenAICloudYesYesYesEcosystem, GPT-4o, GPT-5
GeminiCloudYesYesYesGoogle ecosystem, long context, extended thinking
CompatibleCloudVariesVariesVariesTogether AI, Groq, Fireworks
CandleLocalNoNoNoMinimal footprint

Claude does not support embeddings natively. Use a multi-provider setup with embed = true on an Ollama or OpenAI provider entry to combine Claude chat with local embeddings. Gemini supports embeddings via the text-embedding-004 model — set embedding_model in the Gemini [[llm.providers]] entry to enable.

Quick Setup

Ollama (default — no API key needed):

ollama pull mistral:7b
ollama pull qwen3-embedding
zeph

Claude:

ZEPH_CLAUDE_API_KEY=sk-ant-... zeph

OpenAI:

ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... zeph

Gemini:

ZEPH_LLM_PROVIDER=gemini ZEPH_GEMINI_API_KEY=AIza... zeph

Gemini

Zeph supports Google Gemini as a first-class LLM backend. Gemini is a strong choice when you want access to Google’s latest models (Gemini 2.5 Pro, Gemini 2.0 Flash), very long context windows, extended thinking, or native multimodal reasoning.

Why Gemini

Google’s Gemini 2.5 family brings extended thinking (visible as streaming Thinking chunks in Zeph’s TUI), native tool use, vision, and embeddings. For tasks that require deep reasoning over large codebases or long documents, Gemini’s context capacity complements Zeph’s existing RAG pipeline.

Integration Overview

The GeminiProvider translates Zeph’s internal message format to Gemini’s generateContent API:

  • The system prompt becomes a top-level systemInstruction field (Gemini’s required format).
  • The assistant role is mapped to "model" (Gemini’s terminology for the model turn).
  • Consecutive messages with the same role are automatically merged — Gemini requires strict user/model alternation.
  • If the conversation starts with a model turn, a synthetic empty user message is prepended to satisfy the API contract.
  • Tool definitions are converted to Gemini functionDeclarations with JSON schema normalization ($ref inlining, anyOf/oneOfnullable, type name uppercasing).
  • Vision inputs are sent as inlineData parts with base64-encoded image data.

Streaming uses streamGenerateContent?alt=sse. Thinking parts (returned with thought: true by Gemini 2.5 models) are surfaced as StreamChunk::Thinking and shown in the TUI sidebar.

Configuration

[llm]
[[llm.providers]]
type = "gemini"
model = "gemini-2.0-flash"           # default; use "gemini-2.5-pro" for extended thinking
max_tokens = 8192
# embedding_model = "text-embedding-004"  # enable Gemini embeddings (optional)
# thinking_level = "medium"              # minimal, low, medium, high (Gemini 2.5+)
# thinking_budget = 8192                 # token budget for thinking; -1 = dynamic, 0 = off
# include_thoughts = true                # surface thinking chunks in TUI
# base_url = "https://generativelanguage.googleapis.com/v1beta"  # default

Store the API key in the vault (recommended):

zeph vault set ZEPH_GEMINI_API_KEY AIza...

Or export it as an environment variable:

export ZEPH_GEMINI_API_KEY=AIza...

Run zeph init and choose Gemini as the provider to have the wizard generate a complete config with all Gemini parameters, including the thinking level prompt.

Capabilities

FeatureGemini 2.0 FlashGemini 2.5 Pro
ChatYesYes
Streaming (SSE)YesYes
Tool useYesYes
Streaming tool useYesYes
VisionYesYes
EmbeddingsYes (text-embedding-004)Yes (text-embedding-004)
Extended thinkingNoYes (thinking_level / thinking_budget)
Remote model discoveryYesYes

Embeddings

Set embedding_model in the Gemini [[llm.providers]] entry to enable Gemini embeddings. When set, supports_embeddings() returns true and Zeph uses POST /v1beta/models/{model}:embedContent for semantic memory and skill matching — no Ollama dependency required.

[[llm.providers]]
type = "gemini"
model = "gemini-2.0-flash"
embedding_model = "text-embedding-004"

Streaming and Thinking

When streaming is active, Zeph emits chunks as they arrive from the SSE stream (streamGenerateContent?alt=sse). For Gemini 2.5 models that return thinking parts, the TUI shows a “Thinking…” indicator while the model reasons and then switches to the response stream. Both paths use the same retry infrastructure (send_with_retry) — HTTP 429 (rate limit) and 503 (service unavailable) responses trigger automatic backoff and retry.

Configure thinking via thinking_level (categorical) or thinking_budget (token count). Both fields are optional and apply only to Gemini 2.5+ models.

Streaming Tool Use

Gemini delivers functionCall parts as complete objects within a single SSE event (not incrementally chunked). The SSE parser collects all functionCall parts from the event’s parts array and emits a single StreamChunk::ToolUse with all tool calls. When an event contains both text and function call parts, tool calls take priority and any text in that event is dropped (matching the non-streaming behavior).

Streaming tool use is available on all Gemini models that support function calling, including Gemini 2.0 Flash.

Switching Providers

Change the type field in the [[llm.providers]] entry. All skills, memory, and tools work the same regardless of which provider is active.

[llm]
[[llm.providers]]
type = "claude"   # ollama, claude, openai, gemini, candle, compatible
model = "claude-sonnet-4-6"

Response Caching

Enable SQLite-backed response caching to avoid redundant LLM calls for identical requests. The cache key is a blake3 hash of the full message history and model name. Streaming responses bypass the cache.

[llm]
response_cache_enabled = true
response_cache_ttl_secs = 3600  # 1 hour (default)

See Memory and Context — LLM Response Cache for details.

Deep Dives