Skip to content

Harden auto title generation for reasoning models#1026

Closed
nesquena-hermes wants to merge 1 commit intomasterfrom
integrate/pr1002
Closed

Harden auto title generation for reasoning models#1026
nesquena-hermes wants to merge 1 commit intomasterfrom
integrate/pr1002

Conversation

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Attribution

This is a fresh PR carrying the exact code from @franksong2702's original PR #1002, which was closed when the repo briefly blocked new PRs. Full credit to Frank for the work.

Original PR: #1002


Summary

This fixes the concrete first-exchange title-generation failure discussed in #869 for reasoning-style models such as Ollama Cloud kimi-k2.6.

The root cause is not title quality. It is an output-budget / response-shape mismatch: reasoning models can spend the completion budget in reasoning and return empty content with finish_reason: length, so Hermes falls back to the first-message local summary.

Changes

  • Raises the title-generation completion budget to a reasoning-safe 512 tokens.
  • Retries once with 1024 tokens when the response is empty because of finish_reason: length or reasoning-only output.
  • Applies the same retry policy to both routes:
    • configured auxiliary.title_generation route
    • active-agent fallback route
  • Preserves MiniMax reasoning_split as a provider-specific enhancement, but no longer treats MiniMax as the only reasoning-model case.
  • Improves diagnostics: when local fallback is used, title_status now preserves the underlying reason, e.g. local_summary:llm_length_aux.
  • Keeps the existing auxiliary.title_generation path intact so users can still route title generation to a cheaper/faster dedicated model.

Scope

This PR targets the bug side of #869: provisional first-message titles not being replaced after the first assistant reply when title generation returns empty content.

It does not implement continuous title re-summarization as the conversation evolves. That remains a separate feature/design scope.

Verification

  • python3 -m py_compile api/streaming.py
  • pytest tests/test_title_aux_routing.py -q
  • pytest tests/test_sprint41.py tests/test_title_sanitization.py tests/test_issues_853_857.py tests/test_title_aux_routing.py -q

@nesquena-hermes nesquena-hermes added bug Something isn't working merge soon Reviewed and queued for next release batch labels Apr 25, 2026
Copy link
Copy Markdown
Owner

@nesquena nesquena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — end-to-end ✅ (clean, no fixes needed)

Traced against upstream hermes-agent

Fresh nousresearch/hermes-agent tarball pulled. Pure WebUI change inside api/streaming.pyagent.auxiliary_client.call_llm (the agent-side surface) is called as before, just with a larger max_tokens and a retry loop. Agent code untouched. ✓

What this PR does

Hardens auto title generation against reasoning models that route output into message.reasoning and finish with finish_reason: length and empty content. Direct fix for the concrete bug surfaced by @TUBEOF in #869 with kimi-k2.6 on Ollama Cloud.

Six surgical changes in api/streaming.py:

  1. Title budget bumped 160 → 512. Removes the MiniMax-only 384 special case; everyone gets the reasoning-safe baseline.
  2. _title_retry_completion_budget() returns 1024 — used on retry when the first call returned empty due to length/reasoning.
  3. _extract_title_response(resp, *, aux=False) — new helper. Returns (content, empty_status). Distinguishes llm_length, llm_empty_reasoning, and generic llm_empty based on finish_reason and presence of reasoning/reasoning_content/thinking fields.
  4. Retry loop in both generate_title_raw_via_aux and generate_title_raw_via_agent. Inner for budget_idx, max_tokens in enumerate(budgets) runs once at 512, then if _title_retry_status(last_status) triggers (length / empty_reasoning), appends 1024 and runs again. Retries are scoped to the same prompt index — outer for prompt in prompts continues unchanged.
  5. last_status tracking — both routes now return the most specific failure status (llm_length_aux, llm_empty_reasoning_aux, etc.) instead of the generic llm_error_aux on exhaustion.
  6. local_summary:{llm_status} — when the local summary fallback is used, _put_title_status includes the underlying LLM failure reason. The frontend console.info('[title]', …) listener already logs reason verbatim, so existing diagnostics flows pick this up automatically.

The MiniMax reasoning_split: true extra_body is preserved as a provider-specific enhancement (not removed alongside the budget change).

End-to-end trace (the #869 happy path)

User in Hermes WebUI: types first message, sends. Background title gen fires.

  1. generate_title_raw_via_aux calls agent.auxiliary_client.call_llm with max_tokens=512.
  2. Provider returns {choices[0]: {message: {content: "", reasoning: "..."}, finish_reason: "length"}}.
  3. _extract_title_response('', 'llm_length_aux'). last_status = 'llm_length_aux'.
  4. _title_retry_status('llm_length_aux') is True → append 1024 to budgets.
  5. Loop runs again with max_tokens=1024.
  6. Provider now has enough headroom — emits {message: {content: "Useful Title"}, finish_reason: "stop"}.
  7. Returns ('Useful Title', 'llm_aux_retry'). Sidebar gets the real title.

If retry also fails, last_status propagates up. _run_background_title_update sees source='fallback', builds fallback_reason = 'local_summary:llm_length_aux', emits via _put_title_status. Frontend console: [title] {status: 'fallback', reason: 'local_summary:llm_length_aux', ...} — operator can now distinguish "retry exhausted on length" from "provider exception" without reading server logs.

Edge-case trace

Scenario Behaviour
Non-reasoning model (most providers) First call at 512 returns content → no retry. Status llm_aux. ✅
Reasoning model with adequate budget First call at 512 returns content → no retry. Status llm_aux. ✅
Reasoning model that needs more headroom (the #869 case) First call empty+length → retry at 1024 → content. Status llm_aux_retry. ✅
Reasoning model that needs >1024 tokens Both calls empty → status llm_length_aux → fallback to local summary with reason local_summary:llm_length_aux. ✅
Provider returns 5xx First call raises → last_status = 'llm_error_aux', no retry, fallback with local_summary:llm_error_aux. ✅
Provider returns empty content + finish_reason=stop (no reasoning) llm_empty_aux, no retry (not in _title_retry_status set). Falls back. ✅ — correct, retrying length would just waste tokens.
Active-agent path (no auxiliary.title_generation configured) Same retry logic mirrored in generate_title_raw_via_agent for OpenAI, anthropic_messages, codex_responses api modes. The OpenAI branch uses _extract_title_response(resp) for status; codex/anthropic branches use a generic llm_empty (less precise but safe — these api modes don't typically return reasoning fields). ✅
MiniMax (uses reasoning_split extra_body) Still gets reasoning_split=True. Budget went from 384 → 512 (mild bump). Retry at 1024 still applies. ✅
Budget cost concern Title gen runs once per session (after first exchange), and only when prior call hit length/empty-reasoning. Worst case: 512+1024 = 1536 reasoning-tokens per session. For a typical 10-message session at 100K tokens, this is <2% overhead. Acceptable. ✅
_safe_obj_value MagicMock detection Audited: production code paths only ever pass real provider response objects (module.startswith('unittest.mock') is False). Detection only fires when legacy MagicMock test fixtures are used; new tests use dicts/SimpleNamespace. Mild ugliness in production code but no functional risk. Not worth pushing a fix. ✅

Backwards compatibility

  • Frontend static/messages.js:658 only console.infos the title_status payload — no exact-match comparisons on reason. Adding :{llm_status} suffix is purely additive diagnostic info. ✅
  • Existing test tests/test_sprint41.py::test_streaming_emits_title_status_sse_event checks for put_event('title_status', payload) literal — unaffected. ✅
  • No protocol changes, no new fields, no DB migrations.

Tests

  • tests/test_title_aux_routing.py22/22 pass (5 new tests added: budget defaults, aux retry, aux exhausted status, agent retry, fallback reason preservation)
  • tests/test_sprint41.py, test_title_sanitization.py, test_issues_853_857.py73/73 pass
  • Full local suite: 2090 passed, 47 skipped, 1 deselected, 0 unrelated failed (test_workspace_add_rejects_system_paths fails on master too — pre-existing)
  • python3 -c "import ast; ast.parse(open('api/streaming.py').read())" — clean

Security audit

  • No new input paths. max_tokens and extra_body payloads are server-internal constants.
  • _extract_title_response does string-only comparisons on finish_reason (lowercased) and bounded getattr on response objects. No injection vector.
  • _safe_text_value uses str(value or '').strip() — safe coercion.
  • Fallback reason local_summary:{llm_status} only includes server-controlled status strings (llm_length, llm_empty_reasoning, etc.) — no user-controlled content. ✓

Recommendation

This is the cleanest of the three "integrate" PRs. Surgical fix for a real reproducible bug (TUBEOF's curl output in #869 confirmed kimi-k2.6 returns reasoning + empty content + length). The fix matches the diagnosis exactly: bump budget, retry on length/empty-reasoning, preserve diagnostic reason on fallback.

The MiniMax-specific 384 budget is replaced with a uniform 512 baseline plus a generic retry. This is a net architectural simplification — fewer provider-specific branches.

Credit: @franksong2702 for the original work in #1002 and the systematic diagnosis in the #869 thread.

Closes the bug-fix portion of #869. The continuous re-summarization feature request (titles that adapt as the conversation evolves) remains a separate scope, as the PR description correctly notes.

Approved. Ready for merge. Independent of #1024 and #1025 — no merge-order constraints.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator Author

Thanks for the detailed writeup, @franksong2702 (via @nesquena-hermes reconstruction) — this is a clean, targeted bug fix for the reasoning-model title-generation failure.

What looks good:

  • Root cause is correctly identified: reasoning models spend their completion budget in reasoning tokens and return empty content with finish_reason: length — not a title quality issue, but a budget mismatch.
  • The two-stage budget strategy (512 → retry at 1024 on length failure) is proportionate and avoids over-allocating on every call.
  • Applying the same retry policy to both routes (configured auxiliary.title_generation and the active-agent fallback) ensures consistent behavior regardless of routing.
  • Preserving the MiniMax reasoning_split path while generalizing the retry logic is the right call — no regressions for users already relying on that path.
  • title_status: local_summary:llm_length_aux diagnostic makes it much easier to distinguish "LLM returned empty due to budget" from other fallback reasons.
  • test_title_aux_routing.py covers the core new behavior.

One minor note:

The retry cap is 1024 tokens. For very large reasoning model contexts this may still hit finish_reason: length — but this is an edge case and the fallback behavior (local summary) is reasonable. A follow-up could expose title_max_tokens as a config option if needed, but that's out of scope here.

This looks ready. ✅

nesquena-hermes added a commit that referenced this pull request Apr 25, 2026
…n UX, folder create, model errors, session speed, title gen (#1031)

* fix: remove orphaned i18n keys from top-level LOCALES object

Three Traditional Chinese translation keys (cmd_status, memory_saved,
profile_delete_title) were placed outside any locale block between the
en and ru blocks in static/i18n.js. They became top-level properties
of the LOCALES object, causing them to appear as invalid language
options in the Settings > Preferences dropdown.

The correct translations already exist in the zh-Hant locale block.

Fixes #1008

* fix: block stale SSE events from polluting new session's DOM

- appendThinking(): guard with !S.session||!S.activeStreamId to drop
  events from a previous session's SSE stream during a session switch
- appendLiveToolCard(): same guard for consistency
- finalizeThinkingCard(): scroll thinking-card-body to top when
  scroll is pinned, so completed response is immediately visible
- appendThinking(): auto-scroll thinking card body to bottom while
  streaming if user is watching (scroll pinned)

* Fix empty agent sessions in sidebar

* fix: resolve cron UI UX issues — icon ambiguity, toast overlap, running status

Fixes #995 — three sub-issues in the Cron Jobs UI:

1. Dual play icons ambiguous: Resume button now shows a distinct
   play+bar icon (play triangle + vertical line) instead of the
   identical triangle used by Run now.

2. Toast notification overlapping header buttons: Added
   position:relative; z-index:10 to .main-view-header so it
   stacks above the fixed toast (z-index:100 within its layer).

3. No running status after trigger: After triggering a job, the
   status badge immediately shows 'running…' with a CSS spinner
   animation, and polls the cron list every 3s (up to 30s) to
   refresh when the job completes.

- Added cron_status_running i18n key in all 5 locales (en, es, de, ru, zh, zh-Hant)
- Added .detail-badge.running CSS class with spinner animation
- New functions: _setCronDetailStatus(), _startCronRunningPoll()

* fix(#1011): address review feedback — poll cleanup, badge persistence, 30s fallback

- _clearCronDetail() now clears _cronRunningPoll interval on navigation
- Poll re-applies 'running' badge after loadCrons() re-render (prevents flicker)
- When poll ends (30s max), detail re-renders with actual status as fallback

* feat: create folder and add space directly from UI (#782)

- After creating a folder via the file tree New folder button, offer to add it as a space via confirm dialog
- Add Create folder if it doesnt exist checkbox in the New Space form
- Backend: support create flag in /api/workspaces/add to mkdir before validation
- i18n: 4 new keys (folder_add_as_space_title/msg/btn, workspace_auto_create_folder) in all 6 locales

* fix: validate workspace path before mkdir to prevent orphan directories

Review feedback (critical): the previous code called mkdir() before
validate_workspace_to_add(), which meant a rejected path (e.g. system dir)
would leave an orphan directory on disk.

New flow:
1. Resolve path and check against blocked system roots BEFORE any mutation
2. mkdir() only if path passes the blocklist check
3. Full validation (exists, is_dir) after mkdir

Also imports _workspace_blocked_roots for the pre-mutation blocklist check.

* fix(#1014): classify model-not-found errors with helpful message

- Add model_not_found error type to streaming.py exception classifier
- Detect 404, 'not found', 'does not exist', 'invalid model' patterns
- Strip HTML tags from provider error messages (nginx 404 pages, etc.)
- Add model_not_found branch to apperror handler in messages.js
- Add i18n key model_not_found_label in all 6 locales
- 15 tests covering detection, sanitization, frontend, and i18n

* feat(ui): add live TPS stat to header

Adds a TPS (Tokens Per Second) chip to the right of the header title bar
that updates live while AI output is streaming.

Metering (api/metering.py)
- Tracks per-session output + reasoning tokens via GlobalMeter singleton
- Per-session TPS = total_tokens / elapsed_time
- Global TPS = average of active sessions' TPS values
- HIGH/LOW are max/min of global_tps snapshots over a 60-minute rolling
  window (only recorded when > 0, so idle periods are excluded)
- Thread-safe with a single lock

Metering events emitted from streaming.py
- Throttled at 100ms from token/reasoning/tool callbacks so the display
  updates rapidly during fast token streams
- 1Hz ticker as fallback for slow streams (exits when no active sessions)
- Final stats emitted on stream end

Routes (api/routes.py)
- Removed POST /api/metering/interval endpoint (dynamic interval via
  focus/blur was replaced with simple always-1s-when-active approach)

UI (static/messages.js, index.html, style.css)
- TPS chip in titlebar: shows 'N.N t/s . N.N high . N.N low'
- Default: '0.0 t/s . 0.0 high' when idle
- Display updates on every metering SSE event (throttled to 100ms)

* feat: session restore speed + title gen reasoning hardening (#1025, #1026)

PR #1025 (@franksong2702): Speed up large session restore paths
- GET /api/session?messages=0 now parses only metadata before the messages array
- Metadata-only loads no longer populate the full-session LRU cache
- Frontend lazy fetch uses resolve_model=0 to avoid cold model-catalog lookup
- Hard reload no longer waits for populateModelDropdown() before restoring session

PR #1026 (@franksong2702): Harden auto title generation for reasoning models
- Raises title-gen completion budget to 512 tokens (reasoning-safe)
- Retries once with 1024 tokens on empty content / finish_reason:length
- Applies retry to both auxiliary and active-agent fallback routes
- Preserves underlying failure reason in title_status on local fallback

Co-authored-by: Frank Song <[email protected]>

* feat: session attention indicators in right slot + last_message_at timestamps (#1024)

PR #1024 (@franksong2702): Polish session attention indicators

- Streaming spinners and unread dots now reuse the right-side actions slot
- Running/unread rows hide timestamps; idle/read rows keep right-aligned timestamps
- Date group carets point down when expanded, right when collapsed
- Pinned group no longer repeats pinned-star icon per row
- Running indicators appear immediately after send (local busy state while /api/sessions catches up)
- Sidebar sorting/grouping/timestamps now prefer last_message_at (derived from last real message)
  so metadata-only saves don't make old sessions appear under Today

Co-authored-by: Frank Song <[email protected]>

* docs: v0.50.207 release notes — 10 PRs, 2169 tests (+36)

---------

Co-authored-by: bergeouss <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Frank Song <[email protected]>
Co-authored-by: nesquena-hermes <[email protected]>
@nesquena-hermes
Copy link
Copy Markdown
Collaborator Author

Merged in v0.50.207 via stage PR #1031. Thank you @franksong2702! 🎉

@nesquena-hermes nesquena-hermes deleted the integrate/pr1002 branch April 25, 2026 20:08
JKJameson pushed a commit to JKJameson/hermes-webui that referenced this pull request Apr 29, 2026
…n UX, folder create, model errors, session speed, title gen (nesquena#1031)

* fix: remove orphaned i18n keys from top-level LOCALES object

Three Traditional Chinese translation keys (cmd_status, memory_saved,
profile_delete_title) were placed outside any locale block between the
en and ru blocks in static/i18n.js. They became top-level properties
of the LOCALES object, causing them to appear as invalid language
options in the Settings > Preferences dropdown.

The correct translations already exist in the zh-Hant locale block.

Fixes nesquena#1008

* fix: block stale SSE events from polluting new session's DOM

- appendThinking(): guard with !S.session||!S.activeStreamId to drop
  events from a previous session's SSE stream during a session switch
- appendLiveToolCard(): same guard for consistency
- finalizeThinkingCard(): scroll thinking-card-body to top when
  scroll is pinned, so completed response is immediately visible
- appendThinking(): auto-scroll thinking card body to bottom while
  streaming if user is watching (scroll pinned)

* Fix empty agent sessions in sidebar

* fix: resolve cron UI UX issues — icon ambiguity, toast overlap, running status

Fixes nesquena#995 — three sub-issues in the Cron Jobs UI:

1. Dual play icons ambiguous: Resume button now shows a distinct
   play+bar icon (play triangle + vertical line) instead of the
   identical triangle used by Run now.

2. Toast notification overlapping header buttons: Added
   position:relative; z-index:10 to .main-view-header so it
   stacks above the fixed toast (z-index:100 within its layer).

3. No running status after trigger: After triggering a job, the
   status badge immediately shows 'running…' with a CSS spinner
   animation, and polls the cron list every 3s (up to 30s) to
   refresh when the job completes.

- Added cron_status_running i18n key in all 5 locales (en, es, de, ru, zh, zh-Hant)
- Added .detail-badge.running CSS class with spinner animation
- New functions: _setCronDetailStatus(), _startCronRunningPoll()

* fix(nesquena#1011): address review feedback — poll cleanup, badge persistence, 30s fallback

- _clearCronDetail() now clears _cronRunningPoll interval on navigation
- Poll re-applies 'running' badge after loadCrons() re-render (prevents flicker)
- When poll ends (30s max), detail re-renders with actual status as fallback

* feat: create folder and add space directly from UI (nesquena#782)

- After creating a folder via the file tree New folder button, offer to add it as a space via confirm dialog
- Add Create folder if it doesnt exist checkbox in the New Space form
- Backend: support create flag in /api/workspaces/add to mkdir before validation
- i18n: 4 new keys (folder_add_as_space_title/msg/btn, workspace_auto_create_folder) in all 6 locales

* fix: validate workspace path before mkdir to prevent orphan directories

Review feedback (critical): the previous code called mkdir() before
validate_workspace_to_add(), which meant a rejected path (e.g. system dir)
would leave an orphan directory on disk.

New flow:
1. Resolve path and check against blocked system roots BEFORE any mutation
2. mkdir() only if path passes the blocklist check
3. Full validation (exists, is_dir) after mkdir

Also imports _workspace_blocked_roots for the pre-mutation blocklist check.

* fix(nesquena#1014): classify model-not-found errors with helpful message

- Add model_not_found error type to streaming.py exception classifier
- Detect 404, 'not found', 'does not exist', 'invalid model' patterns
- Strip HTML tags from provider error messages (nginx 404 pages, etc.)
- Add model_not_found branch to apperror handler in messages.js
- Add i18n key model_not_found_label in all 6 locales
- 15 tests covering detection, sanitization, frontend, and i18n

* feat(ui): add live TPS stat to header

Adds a TPS (Tokens Per Second) chip to the right of the header title bar
that updates live while AI output is streaming.

Metering (api/metering.py)
- Tracks per-session output + reasoning tokens via GlobalMeter singleton
- Per-session TPS = total_tokens / elapsed_time
- Global TPS = average of active sessions' TPS values
- HIGH/LOW are max/min of global_tps snapshots over a 60-minute rolling
  window (only recorded when > 0, so idle periods are excluded)
- Thread-safe with a single lock

Metering events emitted from streaming.py
- Throttled at 100ms from token/reasoning/tool callbacks so the display
  updates rapidly during fast token streams
- 1Hz ticker as fallback for slow streams (exits when no active sessions)
- Final stats emitted on stream end

Routes (api/routes.py)
- Removed POST /api/metering/interval endpoint (dynamic interval via
  focus/blur was replaced with simple always-1s-when-active approach)

UI (static/messages.js, index.html, style.css)
- TPS chip in titlebar: shows 'N.N t/s . N.N high . N.N low'
- Default: '0.0 t/s . 0.0 high' when idle
- Display updates on every metering SSE event (throttled to 100ms)

* feat: session restore speed + title gen reasoning hardening (nesquena#1025, nesquena#1026)

PR nesquena#1025 (@franksong2702): Speed up large session restore paths
- GET /api/session?messages=0 now parses only metadata before the messages array
- Metadata-only loads no longer populate the full-session LRU cache
- Frontend lazy fetch uses resolve_model=0 to avoid cold model-catalog lookup
- Hard reload no longer waits for populateModelDropdown() before restoring session

PR nesquena#1026 (@franksong2702): Harden auto title generation for reasoning models
- Raises title-gen completion budget to 512 tokens (reasoning-safe)
- Retries once with 1024 tokens on empty content / finish_reason:length
- Applies retry to both auxiliary and active-agent fallback routes
- Preserves underlying failure reason in title_status on local fallback

Co-authored-by: Frank Song <[email protected]>

* feat: session attention indicators in right slot + last_message_at timestamps (nesquena#1024)

PR nesquena#1024 (@franksong2702): Polish session attention indicators

- Streaming spinners and unread dots now reuse the right-side actions slot
- Running/unread rows hide timestamps; idle/read rows keep right-aligned timestamps
- Date group carets point down when expanded, right when collapsed
- Pinned group no longer repeats pinned-star icon per row
- Running indicators appear immediately after send (local busy state while /api/sessions catches up)
- Sidebar sorting/grouping/timestamps now prefer last_message_at (derived from last real message)
  so metadata-only saves don't make old sessions appear under Today

Co-authored-by: Frank Song <[email protected]>

* docs: v0.50.207 release notes — 10 PRs, 2169 tests (+36)

---------

Co-authored-by: bergeouss <[email protected]>
Co-authored-by: Josh <[email protected]>
Co-authored-by: Frank Song <[email protected]>
Co-authored-by: nesquena-hermes <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working merge soon Reviewed and queued for next release batch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants