Skip to content

fix(models): preserve /api/models cache metadata#1239

Closed
franksong2702 wants to merge 2 commits intonesquena:masterfrom
franksong2702:franksong2702/model-cache-metadata
Closed

fix(models): preserve /api/models cache metadata#1239
franksong2702 wants to merge 2 commits intonesquena:masterfrom
franksong2702:franksong2702/model-cache-metadata

Conversation

@franksong2702
Copy link
Copy Markdown
Contributor

Thinking Path

  • Hermes WebUI uses /api/models as more than a display list: the frontend reads it for provider identity, the configured default model, live-model enrichment, and model mismatch checks.
  • The cold model catalog build returns the full shape: active_provider, default_model, and groups.
  • The disk cache only persisted groups, and the load path accepted any cache with groups as valid.
  • That meant a disk-cache hit could serve an incomplete /api/models payload even though config.yaml, /api/settings, and /api/providers still had the correct provider/default-model state.
  • This PR fixes the cache schema at the backend source-of-truth layer so cached model catalogs keep the metadata the frontend depends on.

What Changed

  • Added validation for the /api/models disk-cache payload shape.
  • Disk cache is now valid only when it includes:
    • active_provider
    • default_model
    • groups
  • Legacy groups-only cache files are rejected and rebuilt instead of being served.
  • The cache write path now preserves the full metadata payload instead of writing only groups.
  • Added focused regression coverage for:
    • preserving metadata when saving cache
    • rejecting legacy groups-only cache files
    • rebuilding through get_available_models() instead of returning the legacy cache
  • Added an Unreleased changelog entry.

Why It Matters

Without this metadata, the frontend can lose the configured model/provider source of truth and fall back to weaker state such as the current dropdown value, stale localStorage, or static picker ordering.

This is intentionally a small reliability fix. It does not redesign the model picker or provider settings; it makes the existing /api/models contract stable across restarts and disk-cache hits.

Verification

  • python -m pytest tests/test_model_cache_metadata.py
  • python -m pytest tests/test_sprint11.py tests/test_ttl_cache.py tests/test_profile_switch_1200.py
  • python -m py_compile api/config.py
  • git diff --check

Risks / Follow-ups

Model Used

OpenAI Codex, GPT-5. AI-assisted implementation with a delegated worker sub-agent, followed by coordinator review and verification.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks for this targeted reliability fix, @franksong2702! The cache metadata preservation problem is real — a groups-only cache hit silently breaking active_provider and default_model for the frontend is exactly the kind of subtle state drift that causes hard-to-reproduce model picker issues.

The approach looks correct:

Cache schema validation
Requiring all three keys (active_provider, default_model, groups) before a cache file is considered valid, and rejecting legacy groups-only files for rebuild, is the right call. Better to do a cold rebuild than serve a partially valid payload.

A few things to confirm before merge:

TTL cache interaction
get_available_models() uses a TTL in-memory cache alongside the disk cache. When a legacy disk cache is rejected and a rebuild is triggered, does the TTL in-memory cache also get invalidated/refreshed? Or could the rebuilt disk cache be correct while the in-memory TTL cache still holds the old groups-only shape for the duration of the TTL window?

Cache invalidation on active_provider change
If the user switches providers (updating config.yaml), the disk cache now stores active_provider. When they switch back or change provider, is the old cached active_provider value detected as stale, or does the cache only get invalidated based on TTL/file age? A cache hit with a stale active_provider would re-introduce the same class of bug this PR is fixing.

Test coverage for partial metadata
The PR covers the groups-only legacy case. Is there test coverage for a disk cache that has active_provider and groups but is missing default_model (or vice versa)? The validation should reject any partial combination.

Changelog entry
The Unreleased entry is appreciated. Confirming it's under the correct section and doesn't duplicate an existing entry.

The test file looks focused and the change to api/config.py is appropriately scoped. This looks close to ready — the TTL cache interaction question is the main thing worth confirming before merge. Thanks for the detailed Thinking Path!

🤖 Automated triage via nesquena-hermes

@franksong2702
Copy link
Copy Markdown
Contributor Author

Addressed the cache-shape and invalidation edge cases from review in f3babd1.

What changed:

  • TTL memory cache now goes through the same shape validation before being served. If a legacy groups-only payload or partial metadata payload is still in memory, WebUI discards it and rebuilds instead of serving it for the 24h TTL window.
  • Config mtime changes now also discard any disk payload that was preloaded before the config reload path, so stale active_provider / default_model metadata cannot be served after config.yaml changes.
  • Added regression coverage for partial metadata payloads: missing active_provider, missing default_model, missing groups, and invalid field types.
  • Left the changelog entry under [Unreleased] / Fixed, which matches this patch as a user-visible bug fix.

Verification:

  • uv run --with pytest --with pyyaml python -m pytest tests/test_model_cache_metadata.py -> 6 passed
  • uv run --with pytest --with pyyaml python -m pytest tests/test_model_cache_metadata.py tests/test_sprint11.py tests/test_ttl_cache.py tests/test_profile_switch_1200.py -> 28 passed
  • python3 -m py_compile api/config.py
  • git diff --check

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks for the follow-up commit f3babd1, @franksong2702 — this directly addresses all three open questions from the previous review.

Confirming the resolutions:

  • TTL memory cache shape validation — in-memory cache now validates the full three-key shape before serving. A groups-only payload in memory is discarded and rebuilt rather than served for the TTL window. This closes the race between disk cache rebuild and in-memory TTL.
  • Config mtime invalidationconfig.yaml changes now also discard disk-preloaded payloads with stale active_provider / default_model. The class of bug this PR is fixing cannot re-enter through a provider switch.
  • Partial metadata regression coverage — four new cases: missing active_provider, missing default_model, missing groups, invalid field types. The validation rejects all partial combinations correctly.
  • Changelog[Unreleased] / Fixed is the correct section for a user-visible bug fix.

The 28-test pass across test_model_cache_metadata.py, test_sprint11.py, test_ttl_cache.py, and test_profile_switch_1200.py is a solid cross-suite confirmation that the fix doesn't regress the surrounding model cache behavior.

This looks ready for merge. The implementation is clean and the test coverage is thorough.

🤖 Automated triage via nesquena-hermes

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Merged in v0.50.237 via #1243. Thank you @franksong2702! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants