LLM Providers
Zeph supports multiple LLM backends. Choose based on your needs:
| Provider | Type | Embeddings | Vision | Streaming | Best For |
|---|---|---|---|---|---|
| Ollama | Local | Yes | Yes | Yes | Privacy, free, offline |
| Claude | Cloud | No | Yes | Yes | Quality, reasoning, prompt caching |
| OpenAI | Cloud | Yes | Yes | Yes | Ecosystem, GPT-4o, GPT-5 |
| Gemini | Cloud | Yes | Yes | Yes | Google ecosystem, long context, extended thinking |
| Compatible | Cloud | Varies | Varies | Varies | Together AI, Groq, Fireworks |
| Candle | Local | No | No | No | Minimal footprint |
Claude does not support embeddings natively. Use a multi-provider setup with embed = true on an Ollama or OpenAI provider entry to combine Claude chat with local embeddings. Gemini supports embeddings via the text-embedding-004 model — set embedding_model in the Gemini [[llm.providers]] entry to enable.
Quick Setup
Ollama (default — no API key needed):
ollama pull mistral:7b
ollama pull qwen3-embedding
zeph
Claude:
ZEPH_CLAUDE_API_KEY=sk-ant-... zeph
OpenAI:
ZEPH_LLM_PROVIDER=openai ZEPH_OPENAI_API_KEY=sk-... zeph
Gemini:
ZEPH_LLM_PROVIDER=gemini ZEPH_GEMINI_API_KEY=AIza... zeph
Gemini
Zeph supports Google Gemini as a first-class LLM backend. Gemini is a strong choice when you want access to Google’s latest models (Gemini 2.5 Pro, Gemini 2.0 Flash), very long context windows, extended thinking, or native multimodal reasoning.
Why Gemini
Google’s Gemini 2.5 family brings extended thinking (visible as streaming Thinking chunks in Zeph’s TUI), native tool use, vision, and embeddings. For tasks that require deep reasoning over large codebases or long documents, Gemini’s context capacity complements Zeph’s existing RAG pipeline.
Integration Overview
The GeminiProvider translates Zeph’s internal message format to Gemini’s generateContent API:
- The system prompt becomes a top-level
systemInstructionfield (Gemini’s required format). - The
assistantrole is mapped to"model"(Gemini’s terminology for the model turn). - Consecutive messages with the same role are automatically merged — Gemini requires strict user/model alternation.
- If the conversation starts with a model turn, a synthetic empty user message is prepended to satisfy the API contract.
- Tool definitions are converted to Gemini
functionDeclarationswith JSON schema normalization ($refinlining,anyOf/oneOf→nullable, type name uppercasing). - Vision inputs are sent as
inlineDataparts with base64-encoded image data.
Streaming uses streamGenerateContent?alt=sse. Thinking parts (returned with thought: true by Gemini 2.5 models) are surfaced as StreamChunk::Thinking and shown in the TUI sidebar.
Configuration
[llm]
[[llm.providers]]
type = "gemini"
model = "gemini-2.0-flash" # default; use "gemini-2.5-pro" for extended thinking
max_tokens = 8192
# embedding_model = "text-embedding-004" # enable Gemini embeddings (optional)
# thinking_level = "medium" # minimal, low, medium, high (Gemini 2.5+)
# thinking_budget = 8192 # token budget for thinking; -1 = dynamic, 0 = off
# include_thoughts = true # surface thinking chunks in TUI
# base_url = "https://generativelanguage.googleapis.com/v1beta" # default
Store the API key in the vault (recommended):
zeph vault set ZEPH_GEMINI_API_KEY AIza...
Or export it as an environment variable:
export ZEPH_GEMINI_API_KEY=AIza...
Run zeph init and choose Gemini as the provider to have the wizard generate a complete config with all Gemini parameters, including the thinking level prompt.
Capabilities
| Feature | Gemini 2.0 Flash | Gemini 2.5 Pro |
|---|---|---|
| Chat | Yes | Yes |
| Streaming (SSE) | Yes | Yes |
| Tool use | Yes | Yes |
| Streaming tool use | Yes | Yes |
| Vision | Yes | Yes |
| Embeddings | Yes (text-embedding-004) | Yes (text-embedding-004) |
| Extended thinking | No | Yes (thinking_level / thinking_budget) |
| Remote model discovery | Yes | Yes |
Embeddings
Set embedding_model in the Gemini [[llm.providers]] entry to enable Gemini embeddings. When set, supports_embeddings() returns true and Zeph uses POST /v1beta/models/{model}:embedContent for semantic memory and skill matching — no Ollama dependency required.
[[llm.providers]]
type = "gemini"
model = "gemini-2.0-flash"
embedding_model = "text-embedding-004"
Streaming and Thinking
When streaming is active, Zeph emits chunks as they arrive from the SSE stream (streamGenerateContent?alt=sse). For Gemini 2.5 models that return thinking parts, the TUI shows a “Thinking…” indicator while the model reasons and then switches to the response stream. Both paths use the same retry infrastructure (send_with_retry) — HTTP 429 (rate limit) and 503 (service unavailable) responses trigger automatic backoff and retry.
Configure thinking via thinking_level (categorical) or thinking_budget (token count). Both fields are optional and apply only to Gemini 2.5+ models.
Streaming Tool Use
Gemini delivers functionCall parts as complete objects within a single SSE event (not incrementally chunked). The SSE parser collects all functionCall parts from the event’s parts array and emits a single StreamChunk::ToolUse with all tool calls. When an event contains both text and function call parts, tool calls take priority and any text in that event is dropped (matching the non-streaming behavior).
Streaming tool use is available on all Gemini models that support function calling, including Gemini 2.0 Flash.
Switching Providers
Change the type field in the [[llm.providers]] entry. All skills, memory, and tools work the same regardless of which provider is active.
[llm]
[[llm.providers]]
type = "claude" # ollama, claude, openai, gemini, candle, compatible
model = "claude-sonnet-4-6"
Response Caching
Enable SQLite-backed response caching to avoid redundant LLM calls for identical requests. The cache key is a blake3 hash of the full message history and model name. Streaming responses bypass the cache.
[llm]
response_cache_enabled = true
response_cache_ttl_secs = 3600 # 1 hour (default)
See Memory and Context — LLM Response Cache for details.
Deep Dives
- Use a Cloud Provider — Claude, OpenAI, and compatible API setup
- Model Orchestrator — multi-provider routing with fallback chains
- Adaptive Inference — Thompson Sampling and EMA-based provider routing
- Local Inference (Candle) — HuggingFace GGUF models