Problem
Users who configure a local model as their main provider (Ollama, vLLM, llama.cpp) and have no cloud API keys still get billed on OpenRouter. The auxiliary model system (compression, vision, memory flush) hardcodes google/gemini-3-flash-preview as the fallback at auxiliary_client.py:127-128:
_OPENROUTER_MODEL = "google/gemini-3-flash-preview"
_NOUS_MODEL = "google/gemini-3-flash-preview"
If the user's local model is slow to respond for auxiliary tasks, the system falls through to this hardcoded OpenRouter model — even if the user has no OpenRouter key configured (it finds one from a previous setup in .env).
User reports
From Reddit: "I woke up this morning with a three digits hole of intense gemini flash calls on open router. With a local model configured for compression in the yaml, but in JIT. Hermes don't like it — it fallback on Gemini Flash if it's not very fast, even if you populate the yaml for a local compression."
Another user built a skill that patches the hardcoded values on every update:
sed -i 's/_OPENROUTER_MODEL = "google\/gemini-3-flash-preview"/_OPENROUTER_MODEL = "minimax\/minimax-m2.5"/' agent/auxiliary_client.py
Expected behavior
If auxiliary.compression.provider: custom is set in config.yaml with a base_url, the system should use ONLY that endpoint — no silent fallback to OpenRouter. If the local model is slow, wait for it. If it fails, error — don't silently bill a cloud provider the user didn't authorize.
Current behavior
The fallback chain at auxiliary_client.py:754-776 (_resolve_auto):
OpenRouter → Nous Portal → Custom endpoint → Codex → API key provider
Even with explicit auxiliary.compression.provider: custom in config, if the custom endpoint is slow or times out, the system falls through to the next provider in the chain.
Suggested fix
When a user explicitly configures auxiliary.{task}.provider: custom, respect it as a hard constraint — don't fall back to cloud providers. The fallback chain should only apply when provider: auto (the default).
Related
Problem
Users who configure a local model as their main provider (Ollama, vLLM, llama.cpp) and have no cloud API keys still get billed on OpenRouter. The auxiliary model system (compression, vision, memory flush) hardcodes
google/gemini-3-flash-previewas the fallback atauxiliary_client.py:127-128:If the user's local model is slow to respond for auxiliary tasks, the system falls through to this hardcoded OpenRouter model — even if the user has no OpenRouter key configured (it finds one from a previous setup in
.env).User reports
From Reddit: "I woke up this morning with a three digits hole of intense gemini flash calls on open router. With a local model configured for compression in the yaml, but in JIT. Hermes don't like it — it fallback on Gemini Flash if it's not very fast, even if you populate the yaml for a local compression."
Another user built a skill that patches the hardcoded values on every update:
sed -i 's/_OPENROUTER_MODEL = "google\/gemini-3-flash-preview"/_OPENROUTER_MODEL = "minimax\/minimax-m2.5"/' agent/auxiliary_client.pyExpected behavior
If
auxiliary.compression.provider: customis set in config.yaml with abase_url, the system should use ONLY that endpoint — no silent fallback to OpenRouter. If the local model is slow, wait for it. If it fails, error — don't silently bill a cloud provider the user didn't authorize.Current behavior
The fallback chain at
auxiliary_client.py:754-776(_resolve_auto):Even with explicit
auxiliary.compression.provider: customin config, if the custom endpoint is slow or times out, the system falls through to the next provider in the chain.Suggested fix
When a user explicitly configures
auxiliary.{task}.provider: custom, respect it as a hard constraint — don't fall back to cloud providers. The fallback chain should only apply whenprovider: auto(the default).Related