Skip to content

fix #1164: prevent .env corruption from concurrent WebUI + CLI/Telegram writes#1178

Closed
bergeouss wants to merge 2 commits intonesquena:masterfrom
bergeouss:fix/issue-1164-env-file-corruption
Closed

fix #1164: prevent .env corruption from concurrent WebUI + CLI/Telegram writes#1178
bergeouss wants to merge 2 commits intonesquena:masterfrom
bergeouss:fix/issue-1164-env-file-corruption

Conversation

@bergeouss
Copy link
Copy Markdown
Contributor

Thinking Path

Problem: When the Hermes WebUI writes to the shared ~/.hermes/.env file (onboarding, provider setup), concurrent access from the Telegram bot or CLI can corrupt the file — API keys vanish, causing No LLM provider configured errors that persist even after /reset and /model commands.

The issue has two root causes:

  1. Missing lock on onboarding's _write_env_fileapi/onboarding.py had its own copy of _write_env_file() without _ENV_LOCK, while api/providers.py's version held the lock. Concurrent writes from WebUI onboarding and the Telegram gateway could interleave, corrupting the file.

  2. Comments and key order destroyed — Both _write_env_file copies loaded the .env into a dict (dropping comments and blank lines), then rewrote the entire file with keys sorted alphabetically. User comments were silently lost on every WebUI config save.

What Changed

api/onboarding.py:

  • Removed the duplicate _write_env_file() (20 lines)
  • Added from api.providers import _write_env_file — the module-level name api.onboarding._write_env_file is preserved so all existing monkey-patch tests continue to work unchanged

api/providers.py:

  • Refactored _write_env_file() to preserve comments, blank lines, and original key order
  • Updates happen in-place on existing lines instead of rebuilding from a dict
  • New keys are appended at the end with a blank-line separator
  • Deleted keys (value=None) are removed cleanly without affecting surrounding lines
  • _ENV_LOCK from api.streaming still protects the entire load→modify→write cycle

Tests: 8 new tests in test_issue1164_env_file_corruption.py:

  • Comment preservation on key update
  • Blank line preservation between key blocks
  • Original key order maintained (not sorted)
  • New keys appended with separator
  • Key removal preserves other entries
  • Empty file handled gracefully
  • Import verification (onboarding delegates to providers)
  • Lock presence verification in providers source

Closes #1164

…I/Telegram writes

Two-part fix for the race condition where WebUI onboarding corrupts
the shared .env file, breaking CLI/Telegram integrations:

Cause 1 — Missing lock on onboarding's _write_env_file:
  onboarding.py had a duplicate _write_env_file() without _ENV_LOCK,
  while providers.py's version held the lock. Concurrent writes from
  the WebUI (onboarding flow) and the Telegram bot could corrupt the
  shared ~/.hermes/.env file, causing API keys to vanish.

  Fix: Remove the duplicate from onboarding.py and import the locked
  version from providers.py. All monkey-patching tests continue to
  work since the name api.onboarding._write_env_file is preserved.

Cause 2 — _write_env_file destroyed comments and reordered keys:
  Both _write_env_file copies loaded the .env into a dict (dropping
  comments/blank lines) and rewrote it sorted alphabetically. This
  meant any comments in the user's .env were silently lost on every
  WebUI config save.

  Fix: Refactored providers._write_env_file to preserve comments, blank
  lines, and original key order. Updates happen in-place on existing
  lines; new keys are appended with a blank-line separator; deleted
  keys are removed cleanly.

Includes 8 tests: comment preservation, blank lines, key order, key
removal, empty file, import verification, and lock presence check.
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks for this thorough fix, @bergeouss! The two-part approach is exactly right for this class of bug.

What looks good:

  • Removing the duplicate _write_env_file() in onboarding.py and importing the locked version from providers.py is the cleanest possible fix for the race condition. No new code paths, just proper code reuse.
  • The in-place line editing approach (update existing keys, append new ones, remove deleted ones) is significantly safer than the previous dict-roundtrip that silently dropped comments and reordered keys.
  • _ENV_LOCK coverage now spans both the onboarding and provider-settings write paths.
  • 8 tests covering the specific properties that were broken (comment preservation, key order, blank lines) give future refactors a solid regression net.

One question worth confirming before merge: the import from api.providers import _write_env_file means api.onboarding._write_env_file is the same function object as api.providers._write_env_file. Monkey-patch tests that do api.onboarding._write_env_file = mock should still work since they patch the name in api.onboarding's namespace. If any test instead patches api.providers._write_env_file, that patch won't be seen by callers going through api.onboarding._write_env_file after the import. The commit message says tests pass — just worth a note for anyone adding future tests.

Overall: clean fix, well-tested, closes a real multi-interface regression. Flagging for maintainer review.

…#1164 cross-process)

The contributor PR's within-process fix (_ENV_LOCK on the load→modify→write
cycle) is correct, but doesn't close the cross-process race that is the
actual root cause of nesquena#1164. The hermes-agent CLI/Telegram bot uses
hermes_cli.config.save_env_value, which writes .env atomically via
tempfile + os.replace. The WebUI's _write_env_file was using
os.open(..., O_TRUNC) — between the truncate and the complete write,
a concurrent reader sees a partial or empty file. The agent has a
_sanitize_env_lines workaround for the resulting corruption pattern
(concatenated KEY=VALUE pairs from interleaved writes / partial reads),
which is a strong signal that this race actually occurs in practice.

Fix: stage the write through tempfile.mkstemp() in the same directory
(same filesystem, so os.replace is atomic on POSIX), fsync before
rename, chmod 0600 before rename so readers never see a 0644 window,
and clean up the tempfile on any exception. Matches the exact pattern
hermes_cli/config.py:save_env_value uses.

Within-process safety from the PR's _ENV_LOCK is preserved (the whole
sequence still runs inside the with-lock block).

Also added a regression test asserting the new invariants:
- tempfile staging present
- os.replace called
- O_TRUNC pattern is gone

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown
Owner

@nesquena nesquena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — end-to-end ✅ (approved after atomic-write fix pushed)

What this ships

@bergeouss's fix for #1164.env corruption when WebUI and Telegram bot/CLI write the shared ~/.hermes/.env concurrently:

  1. Removed duplicate _write_env_file in api/onboarding.py; onboarding now imports the locked version from api/providers.py (preserves the api.onboarding._write_env_file name for monkey-patching tests).
  2. Refactored _write_env_file in api/providers.py to preserve comments, blank lines, and original key order via line-based update instead of dict-rebuild-and-sort.

Traced against upstream hermes-agent

Pulled fresh nousresearch/hermes-agent tarball.

The shared file is ~/.hermes/.env. Both processes touch it:

  • Agent CLI / Telegram: writes via hermes_cli/config.py:save_env_value — uses tempfile.mkstemp + os.replace (atomic on POSIX).
  • Agent CLI: reads via load_env which runs _sanitize_env_lines on every read to fix exactly this corruption pattern: "Concatenated KEY=VALUE pairs on a single line (missing newline between entries)." The agent's sanitizer existing is a strong signal that this race occurs in practice.

So the agent does atomic writes. The WebUI's pre-PR _write_env_file used os.open(..., O_TRUNC) — non-atomic. Between the truncate and the complete write, a concurrent reader (or agent writer) sees a partial / interleaved file.

What I caught — within-process is fixed, but cross-process race remains

The PR closes the within-process leg correctly:

  • _ENV_LOCK from api.streaming (a real threading.Lock at streaming.py:34) wraps the entire load → modify → write cycle ✅
  • onboarding's duplicate is gone, so both code paths share the lock ✅
  • comment/order/blank-line preservation prevents loss-on-save corruption ✅

But it leaves the cross-process leg open: the WebUI was still doing os.open(..., O_WRONLY | O_CREAT | O_TRUNC) inside the lock. Threads can't observe the race, but a separate process (Telegram bot, CLI) can — and the agent's _sanitize_env_lines exists specifically because this race happens.

The actual user-reported symptom in #1164 ("config.yaml overwrite or .env corruption causes No LLM provider configured") is the cross-process leg.

What I pushed — 7e5cdfa

(Pushed to @bergeouss's fork via maintainer-edit access.)

Replaced the O_TRUNC write with tempfile.mkstemp + os.replace — same pattern as hermes_cli/config.py:save_env_value:

_tmp_fd, _tmp_path = _tempfile.mkstemp(
    dir=str(env_path.parent), prefix=".env_", suffix=".tmp"
)
try:
    with os.fdopen(_tmp_fd, "w", encoding="utf-8") as _f:
        _f.write(content)
        _f.flush()
        os.fsync(_f.fileno())
    os.chmod(_tmp_path, _mode)  # tighten before rename so readers see 0600
    os.replace(_tmp_path, env_path)
except BaseException:
    try:
        os.unlink(_tmp_path)
    except OSError:
        pass
    raise

Key properties:

  • Tempfile is in the same directory → same filesystem → os.replace is atomic on POSIX.
  • fsync before rename so the new content is durable.
  • chmod 0600 before rename so readers never see a 0644 window.
  • Cleanup on exception to avoid leaving stray .env_*.tmp files.

Plus a regression test (test_providers_write_env_uses_atomic_rename) asserting:

  • tempfile is referenced in the function body
  • os.replace( is called
  • O_TRUNC is NOT present (locks the bug from regressing)

CI green on 7e5cdfa: 3.11/3.12/3.13 all ✅. Local full suite: 2595 passed.

End-to-end trace

Comment & key-order preservation (PR's contribution)

providers.py:104-150 — uses existing_lines array indexed by key. Updates rewrite the existing line in place:

if key in existing_key_indices:
    output_lines[existing_key_indices[key]] = f"{key}={clean}"
else:
    new_keys.append(f"{key}={clean}")

New keys appended after a blank-line separator. Deletions use a None sentinel filtered out before write. ✅ All 6 preservation tests pass.

Onboarding delegation (PR's contribution)

onboarding.py:26:

from api.providers import _write_env_file  # shared impl with _ENV_LOCK (#1164)

api.onboarding._write_env_file and api.providers._write_env_file are now the same object — verified by test_onboarding_imports_write_env_from_providers (assertIs). Existing monkey-patching tests that patch api.onboarding._write_env_file continue to work because the name still resolves at the onboarding module level (re-export semantics). ✅

Atomic write (my contribution)

After my push, the write sequence is atomic. Cross-process readers (Telegram bot, agent CLI) see either the OLD file or the NEW file — never partial. Inotify-style watchers fire once, on the rename. ✅ Matches hermes_cli/config.py:save_env_value so both processes use the same shape for atomicity.

Edge-case trace

Scenario Expected Actual
WebUI updates a single key comment lines preserved ✅ test_comments_preserved_on_update
WebUI saves config with no key changes file unchanged ✅ output_lines==existing_lines
Two WebUI threads write concurrently serialized by _ENV_LOCK ✅ within-process safe
WebUI writes while CLI reads reader sees old or new, never partial ✅ atomic rename after my fix
WebUI writes while Telegram writes last writer wins (no corruption) ✅ both processes use atomic rename
Tempfile creation fails cleanup, raises ✅ except BaseException + os.unlink
Filesystem doesn't support atomic rename (rare) os.replace falls back to error ⚠️ edge case; not relevant on POSIX
Empty updates dict file unchanged or recreated ✅ test_empty_file_handled_gracefully
Removed key + added key in same call both ops applied ✅ deletion sentinel + new_keys append

Tests

  • PR's tests (8) + my added test (1): 9/9 pass.
  • Local full suite: 2595 passed, 47 skipped, 1 PR-unrelated pre-existing failure (macOS-only test_sprint3 system-paths quirk; verified pre-existing on master).
  • CI on PR after my push: ✅ test (3.11), ✅ test (3.12), ✅ test (3.13).

Other audit — confirmed correct

  • Permissions: 0600 set on tempfile via os.chmod BEFORE rename, so the file is never world-readable even briefly. The trailing env_path.chmod(_mode) is a belt-and-suspenders for the case where the original file had wider perms and os.replace somehow preserved them.
  • Newline injection rejection: if "\n" in clean or "\r" in clean: raise ValueError preserved at line 134 — still blocks .env injection from API key fields.
  • os.environ mutation: still inside the lock — concurrent streaming sessions reading env vars see consistent state.
  • Backward compat with monkey-patching: api.onboarding._write_env_file is api.providers._write_env_file after the import. mock.patch('api.onboarding._write_env_file') continues to work because Python attribute lookup resolves at the module level.

Minor observations (non-blocking)

  • The _load_env_file helper in api/onboarding.py is now unused (the duplicate _write_env_file was its only caller). Leaving it for now since _load_env_file may be referenced by tests; harmless dead code.
  • The PR doesn't update ~/.hermes/config.yaml writes (save_settings and friends). That file has its own write path; if it has the same O_TRUNC problem, it'd warrant a similar pass. Out of scope for #1164 which is specifically about .env.
  • The agent's _sanitize_env_lines workaround can be safely removed in a future agent release once enough time passes for in-the-wild WebUIs to upgrade. Worth a follow-up note in the agent repo.

Recommendation

Approved (after atomic-rename fix pushed). Within-process correctness from the contributor's PR is solid (lock, comment preservation, key order). I added cross-process safety via tempfile + os.replace to match the pattern hermes_cli uses for the same shared file — that closes the actual root cause symptom in the user report. Existing 8 PR tests preserved + 1 regression test for the atomic invariant. CI green on 3.11/3.12/3.13. Parked at approval — ready for the release agent's merge/tag pipeline.

nesquena added a commit that referenced this pull request Apr 27, 2026
…omic .env write

#1178 was absorbed into this batch from bergeouss's commit BEFORE the
review-pushed atomic-write fix made it onto the contributor branch.
The within-process _ENV_LOCK fix is correct, but #1164's actual root
cause is the cross-process race: hermes-agent CLI / Telegram bot uses
hermes_cli.config.save_env_value (atomic tempfile + os.replace), while
WebUI was using os.open(..., O_TRUNC) which leaves a partial file
visible to concurrent readers. The agent has _sanitize_env_lines as a
workaround for the resulting "concatenated KEY=VALUE on one line"
corruption — strong evidence that this race occurs in practice.

Replace O_TRUNC with tempfile.mkstemp + fsync + chmod 0600 + os.replace
to match the agent's pattern. Same filesystem (env_path.parent), so
os.replace is atomic on POSIX. Cross-process readers now see only OLD
or NEW, never partial. Cleanup on exception so stray .env_*.tmp don't
accumulate.

Plus a regression test asserting the new invariants (tempfile present,
os.replace called, O_TRUNC gone) so the cross-process leg can't regress.

This brings the batch's #1178 content to parity with the approved review
state at #1178 (review).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
nesquena-hermes added a commit that referenced this pull request Apr 27, 2026
…, .env (#1179)

Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser.

**Fix verifications:**
- #1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅
- #1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅
- #1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅  
- #1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅
- #1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅

Thanks @bsgdigital (#1150) and @bergeouss (#1178)!
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Merged as v0.50.228 via #1179. Thank you @bergeouss!

JKJameson pushed a commit to JKJameson/hermes-webui that referenced this pull request Apr 29, 2026
…, .env (nesquena#1179)

Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser.

**Fix verifications:**
- nesquena#1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅
- nesquena#1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅
- nesquena#1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅  
- nesquena#1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅
- nesquena#1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅

Thanks @bsgdigital (nesquena#1150) and @bergeouss (nesquena#1178)!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

WebUI usage breaks Telegram/CLI integration: config.yaml overwrite or .env corruption causes No LLM provider configured

3 participants