Skip to content

feat: enhance Docker setup and startup script for Hermes WebUI#1686

Open
binhpt310 wants to merge 1 commit intonesquena:masterfrom
binhpt310:docker-fixes-hanging-wait
Open

feat: enhance Docker setup and startup script for Hermes WebUI#1686
binhpt310 wants to merge 1 commit intonesquena:masterfrom
binhpt310:docker-fixes-hanging-wait

Conversation

@binhpt310
Copy link
Copy Markdown

Problem

Running ./start.sh inside the Docker container causes two failures:

  1. xz: Cannot exec: No such file or directoryxz-utils missing
  2. npm install hangs indefinitely during agent auto-install
  3. .env UID/GID variables are readonly in bash

Solution

  • Added xz-utils and git to Dockerfile
  • Pre-installed Node.js 22 LTS system-wide (no runtime download)
  • Pre-baked hermes-agent source from sibling repo into the image
  • Set HERMES_WEBUI_AGENT_DIR to skip network installs
  • Fixed start.sh to safely parse .env (skip readonly vars)
  • Auto-drop root privileges in start.sh to avoid permission issues
  • Updated docker-compose.yml build context to include agent source

Testing

docker compose build --no-cache
docker compose up -d
docker exec -it hermes-webui-hermes-webui-1 /apptoo/start.sh

- Added git and xz-utils to Dockerfile for improved build capabilities.
- Updated docker-compose.yml to specify build context and Dockerfile location.
- Enhanced start.sh to handle environment variable parsing more robustly and set Python executable path.
@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Review

Reading the diff at f4b618d against origin/master, the implementation is actually quite clean — coarse aggregate metrics, dependency-free, no process/path/argv leakage, dedicated module mirror of api/agent_health.py. But there are two issues that need addressing before merge.

🔴 Blocker: syntax error in the regression test file

tests/test_issue693_system_health_panel.py:16 has a literal *** in place of a Path expression:

ROUTES_PY = (REPO_ROOT / "api" / "routes.py").read_text(encoding="utf-8")
AUTH_PY=*** / "api" / "auth.py").read_text(encoding="utf-8")

That looks like an editor/tool artefact from a redacted diff — the test file simply will not parse. python3 -m py_compile errors at line 16. The intent is clearly:

AUTH_PY = (REPO_ROOT / "api" / "auth.py").read_text(encoding="utf-8")

…which is then asserted against at line 107 (assert '"/api/system/health"' not in AUTH_PY, "system metrics must not be public"). Please fix this single line — without it the entire test module fails to import and the suite errors out.

🟡 Authentication

The endpoint goes through check_auth in server.py:130 (do_GET calls check_auth(self, parsed) before handle_get), and /api/system/health is not in api/auth.py:21-25's PUBLIC_PATHS allowlist. So when WebUI auth is enabled (single-user password is configured), the endpoint is correctly gated. ✅ The test at line 107 asserts the endpoint name is not in auth.py — that's a fine canary, just note it doesn't actually verify cookie verification in the flow (only that nobody added the path to the public list). Worth a follow-up integration test that hits /api/system/health with is_auth_enabled=True and no cookie, expects 401. Not a blocker.

What concerns me more is the default-no-auth case: a fresh WebUI install with no password set has is_auth_enabled() == False (see api/auth.py:142-...), so /api/system/health returns CPU/RAM/disk percentages on every poll to anyone who can reach the bind address. The default bind is 127.0.0.1 so this is mostly defence-in-depth, but if anyone runs WebUI on a multi-user box or behind a misconfigured reverse proxy, the panel becomes a passive sidechannel. Two possible mitigations, both small:

  1. Gate the endpoint behind is_auth_enabled() even on the public-allow path — return 404 if auth is off (so the feature only lights up for password-protected installs). The frontend would see unavailable and hide the panel via the .system-health-panel.unavailable{display:none} rule already in static/style.css:295.
  2. Or rate-limit it — 5s polling × N tabs per session is fine, but there's no throttle.

Note: the polled /api/health/agent (api/routes.py:2491) and /api/dashboard/status (:2500) sit alongside this endpoint with the same auth posture, so this isn't a new precedent — just calling it out because it's the right time to think about it.

🟢 Implementation looks good

api/system_health.py reads only /proc/stat and /proc/meminfo and uses shutil.disk_usage("/") — no command exec, no argv, no env. The CPU sampler:

def _cpu_percent() -> float:
    start = _read_proc_stat_cpu()
    time.sleep(_CPU_SAMPLE_SECONDS)  # 0.05s
    end = _read_proc_stat_cpu()
    return _cpu_delta_percent(start, end)

50ms blocking sleep on every poll is fine in practice (BaseHTTPServer is threaded), but worth noting this will serialize behind the GIL alongside other GETs. At 5s polling per client × small instance, totally negligible.

_safe_error() strips exception messages to just the type name — good defence against /proc paths leaking when /proc/stat is unreadable on a hardened container. ✅

The frontend at static/ui.js:3068-3155 has the right shape: visibilityState gating to stop polling on hidden tabs, hard .unavailable class on macOS/Windows/non-Linux hosts where /proc/stat raises, single 5s setInterval, and the panel is already wired into index.html:79-100 above the layout. Mobile @media rules at style.css:1295-1299 collapse the bar to a column — looks reasonable.

One small nit: _systemHealthTimer is started both inline and on DOMContentLoaded — the if(_systemHealthTimer) return; guard in startSystemHealthMonitor() makes that idempotent, but the visibility listener is added via addEventListener without a removal path. Not a leak (singleton document), just inelegant.

Action items

  1. Fix the syntax error at tests/test_issue693_system_health_panel.py:16 — this is the only blocker.
  2. (Recommended) Add an early-return when is_auth_enabled() is False, or document that the panel exposes coarse host metrics on no-auth installs.
  3. (Optional) Follow-up integration test that hits /api/system/health with auth-on/no-cookie → 401.

Once the test file parses I'd be happy with this. Closes #693 — the feature itself is well-scoped and matches the feature request shape.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks for the patch. Three of the symptoms you describe are real (xz-utils missing, UID/GID readonly trip-up, npm install running at agent install), but the fix as written has structural issues. After pulling the branch and reading the diff against origin/master, this needs a rework before it can land.

CI is red on this branch

test (3.13) fails with:

tests/test_bootstrap_dotenv.py:181: AssertionError:
start.sh should still source .env (regression guard)

The guard at tests/test_bootstrap_dotenv.py:176-188:

def test_start_sh_and_bootstrap_equivalent_env_loading(self):
    start_sh = (REPO_ROOT / "start.sh").read_text(encoding="utf-8")
    ...
    assert "source" in start_sh and ".env" in start_sh, (
        "start.sh should still source .env (regression guard)"
    )

Your replacement loop never says source (it iterates with read -r raw_line), so the guard fires. The 3.11 and 3.12 jobs were CANCELLED so we do not know their state, but 3.13 is FAILED. A clean fix that keeps the existing guard happy:

set -a
source <(grep -vE "^[[:space:]]*(export[[:space:]]+)?(UID|GID|EUID|EGID|PPID|PID)=" "${REPO_ROOT}/.env")
set +a

Same effect as your filter, half the lines, no test churn.

Compose context change is breaking

The PR rewrites docker-compose.yml to:

build:
  context: ..
  dockerfile: hermes-webui/Dockerfile

and the Dockerfile then does COPY hermes-webui/ /apptoo/ and COPY hermes-agent-desktop/hermes-agent /opt/hermes/. That assumes a parent layout with hermes-webui/ and hermes-agent-desktop/hermes-agent/ as siblings.

The canonical user flow is clone the repo, cd in, then bring up the stack. With this PR applied, that breaks: context: .. resolves to whatever happens to sit one level above the clone, and COPY hermes-agent-desktop/hermes-agent /opt/hermes/ will fail outright because that path does not exist in the standard layout. This looks like a leak from your local multi-repo workspace into the public PR.

For two-container layouts, docker-compose.two-container.yml already exists (referenced at docker_init.bash:351) and uses an -v /path/to/hermes-agent:... mount instead of a build-context expansion. Keep the single-container compose context as ..

The /app/venv early-exit in start.sh bypasses docker_init.bash

if [[ -f "/.within_container" && -x "/app/venv/bin/python" ]]; then
  export HERMES_WEBUI_PYTHON="/app/venv/bin/python"
  exec "/app/venv/bin/python" "${REPO_ROOT}/bootstrap.py" --no-browser "$@"
fi

docker_init.bash is the container CMD and does substantive work that bootstrap.py does not: WANTED_UID/GID detection from mounted volumes (docker_init.bash:60-83), the hermeswebui user switch via usermod, /uv_cache ownership setup, and ensure_hindsight_client_docker_dependency self-healing (:307). If start.sh ever runs inside a container outside the CMD path (e.g. a debug exec), this branch silently skips all of that.

Also: the Dockerfile pre-bakes /opt/hermes but does not pre-create /app/venv. That path does not exist until docker_init.bash:297 runs uv venv venv. So the [[ -x "/app/venv/bin/python" ]] guard is dead on a fresh image and only fires on subsequent restarts after init has already run once. Fragile state machine.

Pieces that look correct and should be split out

  • apt-get install ... xz-utils git: legit, needed for the node-22 tar extract; git is also useful in the existing image.
  • The UID|GID|EUID|EGID|PPID|PID|_ skip-list: real bug, real fix.
  • Trailing whitespace and line-continuation cleanup in the groupadd block: harmless.

Recommendation

Split this into two surgical PRs:

  1. fix(start.sh): skip bash readonly vars when sourcing .env -- apply the filter via the grep -vE + source <(...) form above so the regression guard at tests/test_bootstrap_dotenv.py:181 keeps passing. No Dockerfile changes.
  2. fix(docker): add xz-utils + git to base image -- just the apt-get hunk. No compose changes.

The pre-bake-agent-source plus change-compose-context piece needs a separate design conversation. docker_init.bash:330-335 already supports /opt/hermes as a fallback agent path, so if the goal is offline-friendly images, the cleaner approach is a multi-stage Dockerfile that fetches the agent by tag during build, not a build-context expansion that requires a specific filesystem layout outside the repo.

One question on the npm install hang: I could not find any npm install invocation in bootstrap.py, docker_init.bash, or start.sh. If that hang is happening inside the agent install scripts, it is a hermes-agent fix, not a webui one. Could you point at the exact log line where it hangs? That helps triage the right repo.

@nesquena-hermes
Copy link
Copy Markdown
Collaborator

Thanks @binhpt310 — your fixes for the xz-utils / git package gaps, the readonly-UID/GID .env parsing crash, and the auto-drop-privileges helper are exactly the right kind of operational hardening for the Docker setup. They're holding up nicely under review.

We need to defer this PR from the v0.51.5 release pass because the Dockerfile + docker-compose.yml shape it introduces makes the build context-dependent on a sibling repo:

# Dockerfile (lines ~80-100):
COPY hermes-agent-desktop/hermes-agent /opt/hermes/
COPY --chown=hermeswebuitoo:hermeswebuitoo hermes-webui/ /apptoo/
# docker-compose.yml:
build:
  context: ..
  dockerfile: hermes-webui/Dockerfile

This means the build now requires a parent directory containing both hermes-webui/ and hermes-agent-desktop/ siblings. A user who runs the canonical:

git clone https://github.com/nesquena/hermes-webui
cd hermes-webui
docker compose build

…will hit COPY failed: file not found in build context: hermes-agent-desktop/hermes-agent because there's no sibling repo. That breaks Docker for every standalone clone, which is the most common path.

There are a few ways forward — pick whichever fits best:

Option A — Build arg (smallest change):

ARG WITH_AGENT_SOURCE=0
RUN if [ "$WITH_AGENT_SOURCE" = "1" ]; then \
      mkdir -p /opt/hermes && \
      cp -r /tmp/hermes-agent-desktop/hermes-agent /opt/hermes/; \
    fi
  • keep docker-compose.yml's context: . (no parent-context change). Users with the sibling repo opt-in via --build-arg WITH_AGENT_SOURCE=1.

Option B — Vendor a download helper:
A docker_init.bash step that fetches hermes-agent from PyPI / npm at image build time. Closer to the existing bootstrap.py install path but at build time, so no runtime download.

Option C — Drop the agent-pre-bake and keep the rest:
Ship the xz-utils / git / readonly-UID parser / auto-drop-privileges fixes alone (those are independently valuable) and leave the agent-source baking for a follow-up that has access to a vendoring channel.

The other improvements in this PR are all valuable on their own — the .env readonly-vars parser is a real fix, the auto-drop-privileges via sudo -n -u hermeswebui is exactly the right shape, and the pre-built-container fast path through /.within_container + /app/venv/bin/python is clean.

A secondary nit (not blocking): the start.sh re-execs as hermeswebui (UID 1024) but the Dockerfile chowns /apptoo to hermeswebuitoo:hermeswebuitoo (UID 1025). I couldn't see how bootstrap.py writes .hermes/ and venv state from a user that doesn't own the tree — worth a docker compose up smoke test to confirm a successful end-to-end boot before merge.

Happy to absorb a v2 with Option A or C as soon as it lands. The infrastructure half of this is unambiguously good.

@nesquena nesquena added the hold label May 5, 2026
githb-ac pushed a commit to githb-ac-org/hermes-webui that referenced this pull request May 5, 2026
4 PRs (1 surface addition, 3 fixes):
- nesquena#1688 VPS resource health Insights panel (@Michaelyklam, closes nesquena#693)
- nesquena#1709 preserve scroll on stream completion (@Michaelyklam, closes nesquena#1690)
- nesquena#1711 hide rename tooltip on folders (@nesquena-hermes, closes nesquena#1710)
- nesquena#1712 guard localStorage.setItem against QuotaExceededError (@24601)

Tests: 4504 → 4527 (+23). Opus: SHIP, 6/6 verification clean.

Held back: nesquena#1686 (Docker enhance) — Opus flagged sibling-repo dep that
breaks standalone clones. Left open for follow-up.

Co-authored-by: Michael Lam <[email protected]>
Co-authored-by: 24601 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants