fix #1164: prevent .env corruption from concurrent WebUI + CLI/Telegram writes#1178
fix #1164: prevent .env corruption from concurrent WebUI + CLI/Telegram writes#1178bergeouss wants to merge 2 commits intonesquena:masterfrom
Conversation
…I/Telegram writes Two-part fix for the race condition where WebUI onboarding corrupts the shared .env file, breaking CLI/Telegram integrations: Cause 1 — Missing lock on onboarding's _write_env_file: onboarding.py had a duplicate _write_env_file() without _ENV_LOCK, while providers.py's version held the lock. Concurrent writes from the WebUI (onboarding flow) and the Telegram bot could corrupt the shared ~/.hermes/.env file, causing API keys to vanish. Fix: Remove the duplicate from onboarding.py and import the locked version from providers.py. All monkey-patching tests continue to work since the name api.onboarding._write_env_file is preserved. Cause 2 — _write_env_file destroyed comments and reordered keys: Both _write_env_file copies loaded the .env into a dict (dropping comments/blank lines) and rewrote it sorted alphabetically. This meant any comments in the user's .env were silently lost on every WebUI config save. Fix: Refactored providers._write_env_file to preserve comments, blank lines, and original key order. Updates happen in-place on existing lines; new keys are appended with a blank-line separator; deleted keys are removed cleanly. Includes 8 tests: comment preservation, blank lines, key order, key removal, empty file, import verification, and lock presence check.
|
Thanks for this thorough fix, @bergeouss! The two-part approach is exactly right for this class of bug. What looks good:
One question worth confirming before merge: the import Overall: clean fix, well-tested, closes a real multi-interface regression. Flagging for maintainer review. |
…#1164 cross-process) The contributor PR's within-process fix (_ENV_LOCK on the load→modify→write cycle) is correct, but doesn't close the cross-process race that is the actual root cause of nesquena#1164. The hermes-agent CLI/Telegram bot uses hermes_cli.config.save_env_value, which writes .env atomically via tempfile + os.replace. The WebUI's _write_env_file was using os.open(..., O_TRUNC) — between the truncate and the complete write, a concurrent reader sees a partial or empty file. The agent has a _sanitize_env_lines workaround for the resulting corruption pattern (concatenated KEY=VALUE pairs from interleaved writes / partial reads), which is a strong signal that this race actually occurs in practice. Fix: stage the write through tempfile.mkstemp() in the same directory (same filesystem, so os.replace is atomic on POSIX), fsync before rename, chmod 0600 before rename so readers never see a 0644 window, and clean up the tempfile on any exception. Matches the exact pattern hermes_cli/config.py:save_env_value uses. Within-process safety from the PR's _ENV_LOCK is preserved (the whole sequence still runs inside the with-lock block). Also added a regression test asserting the new invariants: - tempfile staging present - os.replace called - O_TRUNC pattern is gone Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
nesquena
left a comment
There was a problem hiding this comment.
Review — end-to-end ✅ (approved after atomic-write fix pushed)
What this ships
@bergeouss's fix for #1164 — .env corruption when WebUI and Telegram bot/CLI write the shared ~/.hermes/.env concurrently:
- Removed duplicate
_write_env_fileinapi/onboarding.py; onboarding now imports the locked version fromapi/providers.py(preserves theapi.onboarding._write_env_filename for monkey-patching tests). - Refactored
_write_env_fileinapi/providers.pyto preserve comments, blank lines, and original key order via line-based update instead of dict-rebuild-and-sort.
Traced against upstream hermes-agent
Pulled fresh nousresearch/hermes-agent tarball.
The shared file is ~/.hermes/.env. Both processes touch it:
- Agent CLI / Telegram: writes via
hermes_cli/config.py:save_env_value— usestempfile.mkstemp+os.replace(atomic on POSIX). - Agent CLI: reads via
load_envwhich runs_sanitize_env_lineson every read to fix exactly this corruption pattern: "Concatenated KEY=VALUE pairs on a single line (missing newline between entries)." The agent's sanitizer existing is a strong signal that this race occurs in practice.
So the agent does atomic writes. The WebUI's pre-PR _write_env_file used os.open(..., O_TRUNC) — non-atomic. Between the truncate and the complete write, a concurrent reader (or agent writer) sees a partial / interleaved file.
What I caught — within-process is fixed, but cross-process race remains
The PR closes the within-process leg correctly:
_ENV_LOCKfromapi.streaming(a realthreading.Lockat streaming.py:34) wraps the entire load → modify → write cycle ✅- onboarding's duplicate is gone, so both code paths share the lock ✅
- comment/order/blank-line preservation prevents loss-on-save corruption ✅
But it leaves the cross-process leg open: the WebUI was still doing os.open(..., O_WRONLY | O_CREAT | O_TRUNC) inside the lock. Threads can't observe the race, but a separate process (Telegram bot, CLI) can — and the agent's _sanitize_env_lines exists specifically because this race happens.
The actual user-reported symptom in #1164 ("config.yaml overwrite or .env corruption causes No LLM provider configured") is the cross-process leg.
What I pushed — 7e5cdfa
(Pushed to @bergeouss's fork via maintainer-edit access.)
Replaced the O_TRUNC write with tempfile.mkstemp + os.replace — same pattern as hermes_cli/config.py:save_env_value:
_tmp_fd, _tmp_path = _tempfile.mkstemp(
dir=str(env_path.parent), prefix=".env_", suffix=".tmp"
)
try:
with os.fdopen(_tmp_fd, "w", encoding="utf-8") as _f:
_f.write(content)
_f.flush()
os.fsync(_f.fileno())
os.chmod(_tmp_path, _mode) # tighten before rename so readers see 0600
os.replace(_tmp_path, env_path)
except BaseException:
try:
os.unlink(_tmp_path)
except OSError:
pass
raiseKey properties:
- Tempfile is in the same directory → same filesystem →
os.replaceis atomic on POSIX. fsyncbefore rename so the new content is durable.chmod 0600before rename so readers never see a 0644 window.- Cleanup on exception to avoid leaving stray
.env_*.tmpfiles.
Plus a regression test (test_providers_write_env_uses_atomic_rename) asserting:
tempfileis referenced in the function bodyos.replace(is calledO_TRUNCis NOT present (locks the bug from regressing)
CI green on 7e5cdfa: 3.11/3.12/3.13 all ✅. Local full suite: 2595 passed.
End-to-end trace
Comment & key-order preservation (PR's contribution)
providers.py:104-150 — uses existing_lines array indexed by key. Updates rewrite the existing line in place:
if key in existing_key_indices:
output_lines[existing_key_indices[key]] = f"{key}={clean}"
else:
new_keys.append(f"{key}={clean}")New keys appended after a blank-line separator. Deletions use a None sentinel filtered out before write. ✅ All 6 preservation tests pass.
Onboarding delegation (PR's contribution)
from api.providers import _write_env_file # shared impl with _ENV_LOCK (#1164)api.onboarding._write_env_file and api.providers._write_env_file are now the same object — verified by test_onboarding_imports_write_env_from_providers (assertIs). Existing monkey-patching tests that patch api.onboarding._write_env_file continue to work because the name still resolves at the onboarding module level (re-export semantics). ✅
Atomic write (my contribution)
After my push, the write sequence is atomic. Cross-process readers (Telegram bot, agent CLI) see either the OLD file or the NEW file — never partial. Inotify-style watchers fire once, on the rename. ✅ Matches hermes_cli/config.py:save_env_value so both processes use the same shape for atomicity.
Edge-case trace
| Scenario | Expected | Actual |
|---|---|---|
| WebUI updates a single key | comment lines preserved | ✅ test_comments_preserved_on_update |
| WebUI saves config with no key changes | file unchanged | ✅ output_lines==existing_lines |
| Two WebUI threads write concurrently | serialized by _ENV_LOCK | ✅ within-process safe |
| WebUI writes while CLI reads | reader sees old or new, never partial | ✅ atomic rename after my fix |
| WebUI writes while Telegram writes | last writer wins (no corruption) | ✅ both processes use atomic rename |
| Tempfile creation fails | cleanup, raises | ✅ except BaseException + os.unlink |
| Filesystem doesn't support atomic rename (rare) | os.replace falls back to error | |
| Empty updates dict | file unchanged or recreated | ✅ test_empty_file_handled_gracefully |
| Removed key + added key in same call | both ops applied | ✅ deletion sentinel + new_keys append |
Tests
- PR's tests (8) + my added test (1): 9/9 pass.
- Local full suite: 2595 passed, 47 skipped, 1 PR-unrelated pre-existing failure (macOS-only
test_sprint3system-paths quirk; verified pre-existing on master). - CI on PR after my push: ✅ test (3.11), ✅ test (3.12), ✅ test (3.13).
Other audit — confirmed correct
- Permissions: 0600 set on tempfile via
os.chmodBEFORE rename, so the file is never world-readable even briefly. The trailingenv_path.chmod(_mode)is a belt-and-suspenders for the case where the original file had wider perms andos.replacesomehow preserved them. - Newline injection rejection:
if "\n" in clean or "\r" in clean: raise ValueErrorpreserved at line 134 — still blocks.envinjection from API key fields. os.environmutation: still inside the lock — concurrent streaming sessions reading env vars see consistent state.- Backward compat with monkey-patching:
api.onboarding._write_env_file is api.providers._write_env_fileafter the import.mock.patch('api.onboarding._write_env_file')continues to work because Python attribute lookup resolves at the module level.
Minor observations (non-blocking)
- The
_load_env_filehelper inapi/onboarding.pyis now unused (the duplicate_write_env_filewas its only caller). Leaving it for now since_load_env_filemay be referenced by tests; harmless dead code. - The PR doesn't update
~/.hermes/config.yamlwrites (save_settingsand friends). That file has its own write path; if it has the sameO_TRUNCproblem, it'd warrant a similar pass. Out of scope for #1164 which is specifically about.env. - The agent's
_sanitize_env_linesworkaround can be safely removed in a future agent release once enough time passes for in-the-wild WebUIs to upgrade. Worth a follow-up note in the agent repo.
Recommendation
Approved (after atomic-rename fix pushed). Within-process correctness from the contributor's PR is solid (lock, comment preservation, key order). I added cross-process safety via tempfile + os.replace to match the pattern hermes_cli uses for the same shared file — that closes the actual root cause symptom in the user report. Existing 8 PR tests preserved + 1 regression test for the atomic invariant. CI green on 3.11/3.12/3.13. Parked at approval — ready for the release agent's merge/tag pipeline.
…omic .env write #1178 was absorbed into this batch from bergeouss's commit BEFORE the review-pushed atomic-write fix made it onto the contributor branch. The within-process _ENV_LOCK fix is correct, but #1164's actual root cause is the cross-process race: hermes-agent CLI / Telegram bot uses hermes_cli.config.save_env_value (atomic tempfile + os.replace), while WebUI was using os.open(..., O_TRUNC) which leaves a partial file visible to concurrent readers. The agent has _sanitize_env_lines as a workaround for the resulting "concatenated KEY=VALUE on one line" corruption — strong evidence that this race occurs in practice. Replace O_TRUNC with tempfile.mkstemp + fsync + chmod 0600 + os.replace to match the agent's pattern. Same filesystem (env_path.parent), so os.replace is atomic on POSIX. Cross-process readers now see only OLD or NEW, never partial. Cleanup on exception so stray .env_*.tmp don't accumulate. Plus a regression test asserting the new invariants (tempfile present, os.replace called, O_TRUNC gone) so the cross-process leg can't regress. This brings the batch's #1178 content to parity with the approved review state at #1178 (review). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…, .env (#1179) Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser. **Fix verifications:** - #1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅ - #1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅ - #1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅ - #1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅ - #1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅ Thanks @bsgdigital (#1150) and @bergeouss (#1178)!
|
Merged as v0.50.228 via #1179. Thank you @bergeouss! |
…, .env (nesquena#1179) Merged as v0.50.228. 2644 tests passing. Browser QA 21/21 (desktop 1440×900 + mobile iPhone 14). All 5 fix invariants verified live in browser. **Fix verifications:** - nesquena#1172 (`renderMd` pre-stash): `rawPreStash` present in function, `<pre>` blocks pass through without content rewrite ✅ - nesquena#1174 (model race guard): `syncTopbar()` contains `liveStillPending` guard ✅ - nesquena#1175 (tool card): `.tool-card-result pre` max-height=360px, `.tool-card.open .tool-card-detail` overflow=auto, cap=600px ✅ - nesquena#1176 (empty session guard): double-click New Conversation on empty session → stays on same session, composer focused ✅ - nesquena#1178 (`.env` atomic write): `tempfile.mkstemp + os.replace` in `providers.py`, 9/9 env tests pass ✅ Thanks @bsgdigital (nesquena#1150) and @bergeouss (nesquena#1178)!
Thinking Path
Problem: When the Hermes WebUI writes to the shared
~/.hermes/.envfile (onboarding, provider setup), concurrent access from the Telegram bot or CLI can corrupt the file — API keys vanish, causingNo LLM provider configurederrors that persist even after/resetand/modelcommands.The issue has two root causes:
Missing lock on onboarding's
_write_env_file—api/onboarding.pyhad its own copy of_write_env_file()without_ENV_LOCK, whileapi/providers.py's version held the lock. Concurrent writes from WebUI onboarding and the Telegram gateway could interleave, corrupting the file.Comments and key order destroyed — Both
_write_env_filecopies loaded the.envinto a dict (dropping comments and blank lines), then rewrote the entire file with keys sorted alphabetically. User comments were silently lost on every WebUI config save.What Changed
api/onboarding.py:_write_env_file()(20 lines)from api.providers import _write_env_file— the module-level nameapi.onboarding._write_env_fileis preserved so all existing monkey-patch tests continue to work unchangedapi/providers.py:_write_env_file()to preserve comments, blank lines, and original key order_ENV_LOCKfromapi.streamingstill protects the entire load→modify→write cycleTests: 8 new tests in
test_issue1164_env_file_corruption.py:Closes #1164