Skip to content

docs: add Azure Container Apps install guide with managed identity an…#52555

Draft
kimvaddi wants to merge 4 commits intoopenclaw:mainfrom
kimvaddi:docs/azure-container-apps-install-guide
Draft

docs: add Azure Container Apps install guide with managed identity an…#52555
kimvaddi wants to merge 4 commits intoopenclaw:mainfrom
kimvaddi:docs/azure-container-apps-install-guide

Conversation

@kimvaddi
Copy link
Copy Markdown

Summary

Describe the problem and fix in 2–5 bullets:

  • Problem: OpenClaw docs had no Azure Container Apps install guide. The only Azure option is the VM guide (~$195/month with Bastion).
  • Why it matters: Azure Container Apps targets the free tier ($0-27/month), uses managed identity instead of admin credentials, and eliminates VM/Bastion/NSG costs entirely. Azure users need a first-party containerized deployment path.
  • What changed: Added docs/install/azure-containers.md with managed identity ACR pull, Key Vault secrets via RBAC, GHCR skip-ACR alternative, persistent Azure Files storage, and nav/redirect/hub-page wiring in docs.json, vps.md, and platforms/index.md.
  • What did NOT change (scope boundary): No Gateway/runtime code, auth logic, provider behavior, or installer execution behavior changed.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

User-visible / Behavior Changes

  • New Azure Container Apps install documentation at /install/azure-containers.
  • VPS picker page now includes an "Azure Container Apps" card (and renames the existing card from "Azure" to "Azure VM" for clarity).
  • Platforms hub page now lists Azure Container Apps alongside the existing Azure VM link.
  • Redirects added: /azure-containers and /platforms/azure-containers both resolve to /install/azure-containers.

Security Impact (required)

  • New permissions/capabilities? (No)
  • Secrets/tokens handling changed? (No)
  • New/changed network calls? (No)
  • Command/tool execution surface changed? (No)
  • Data access scope changed? (No)
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: Windows 11
  • Runtime/container: Azure Container Apps (consumption plan)
  • Model/provider: N/A (docs-only change)
  • Integration/channel (if any): GitHub docs/navigation
  • Relevant config (redacted): N/A

Steps

  1. Open diff main...docs/azure-container-apps-install-guide.
  2. Verify new file docs/install/azure-containers.md exists with managed identity, Key Vault, and GHCR sections.
  3. Validate docs/docs.json has nav entry under Hosting and both redirects (/azure-containers, /platforms/azure-containers).
  4. Confirm docs/vps.md has Azure Container Apps card and docs/platforms/index.md has the link.
  5. Follow the guide end-to-end in a fresh Azure resource group.
  6. Verify az containerapp exec runs openclaw gateway status successfully.
  7. Access the Control UI via the auto-assigned FQDN in a browser.

Expected

  • Azure Container Apps page reachable via docs nav at /install/azure-containers and via redirects.
  • All internal docs links resolve without broken links.
  • Azure resource group provisions: Container App, ACR, Storage Account, Key Vault.
  • OpenClaw Gateway running and accessible via HTTPS FQDN.

Actual

  • (Attach screenshots after deployment verification)

Evidence

Attach at least one:

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

  • Verified scenarios:
    • Reviewed main...HEAD diff and file-level changes.
    • Checked docs/docs.json redirects and nav entries for Azure Container Apps.
    • Ran oxfmt --check on all changed files — format is clean.
    • Confirmed managed identity (--registry-identity system), Key Vault RBAC, and GHCR fallback flows are documented correctly.
    • (Add after deploy: Deployed to Azure and verified end-to-end.)
  • Edge cases checked:
    • Redirect compatibility from /azure-containers and /platforms/azure-containers.
    • Links from vps.md and platforms/index.md resolve correctly.
    • GHCR fallback path (skip ACR) documented as alternative deployment option.
    • Existing Azure VM card renamed from "Azure" to "Azure VM" for disambiguation.
  • What you did not verify:
    • (Fill in after deployment testing)

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes)
  • Config/env changes? (No)
  • Migration needed? (No)
  • If yes, exact upgrade steps: N/A

Failure Recovery (if this breaks)

  • How to disable/revert this change quickly: Revert this PR/commit.
  • Files/config to restore: docs/docs.json, docs/vps.md, docs/platforms/index.md, and remove docs/install/azure-containers.md.
  • Known bad symptoms reviewers should watch for: Broken nav links or 404s on /install/azure-containers.

Risks and Mitigations

None — docs-only PR with no runtime code changes.

@openclaw-barnacle openclaw-barnacle bot added docs Improvements or additions to documentation size: XS labels Mar 23, 2026
@kimvaddi kimvaddi marked this pull request as draft March 23, 2026 01:18
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 23, 2026

Greptile Summary

This PR adds a new Azure Container Apps installation guide (docs/install/azure-containers.md) and wires it into the nav, redirects, VPS picker, and platforms hub. The guide is well-structured and covers a meaningful gap — managed identity ACR pulls, Key Vault secrets via RBAC, persistent Azure Files storage, and a GHCR fallback path.

Key findings from the review:

  • Step ordering bug (P1): The container app is created with --registry-identity system in Step 8, which immediately triggers an ACR image pull using the new managed identity. The AcrPull role is not assigned until Step 9, so the first pull will be denied and the gateway container will start in a failed state. The guide needs either a re-deployment step after Step 9 (e.g. az containerapp update --image ... to trigger a new revision) or a note explaining this ordering constraint.
  • Missing resource group flag (P2): az keyvault show in the "Complete onboarding" section omits -g "${RG}", which is inconsistent with the rest of the guide and can cause failures for users with multiple active subscriptions.
  • Nav placement, redirects, alphabetical ordering in docs.json, card renaming in vps.md, and the link addition in platforms/index.md are all correct.

Confidence Score: 3/5

  • The guide contains a deployment-breaking step ordering issue that will cause all users to encounter a failed initial deployment.
  • The ACR role assignment ordering bug is not a "potential" issue — every user who follows the guide end-to-end will hit it. The initial container will fail to pull from ACR, the gateway won't start, and there's no documented recovery path. This is on the primary user path (deploy and verify the gateway). Once that step ordering is corrected (or a recovery step is documented), the PR is otherwise clean and ready to merge.
  • docs/install/azure-containers.md — Steps 8 and 9 ordering (ACR pull role must be effective before the first image pull attempt).
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/install/azure-containers.md
Line: 181-218

Comment:
**ACR role assignment after first image pull — initial deployment will fail**

Steps 8 and 9 have a chicken-and-egg ordering problem. Step 8 creates the container app with `--registry-identity system`, which immediately schedules a container start and causes Azure to attempt an ACR image pull using the not-yet-authorized managed identity. The `AcrPull` role is only assigned in Step 9 — after the pull has already been attempted and failed.

The container app resource will be created successfully (exit code 0), but the underlying container will enter a failed/crashed state with an authorization error. The user will be left with a non-running gateway and no indication of why.

**Fix**: After the role assignment in Step 9, document that a new revision must be triggered to retry the (now-authorized) image pull:

```bash
# After granting AcrPull in Step 9, force a new revision:
az containerapp update \
  -g "${RG}" -n "${ACA_APP}" \
  --image "${ACR_NAME}.azurecr.io/openclaw:latest"
```

Alternatively, restructure the steps to use a user-assigned managed identity (pre-created and granted `AcrPull` before the container app is created), which avoids the ordering problem entirely.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/install/azure-containers.md
Line: 291

Comment:
**Missing `-g` flag on `az keyvault show`**

This command omits the resource group flag, unlike every other `az` command in the guide. While Key Vault names are globally unique within a subscription, omitting `-g` can cause unexpected failures when users have multiple subscriptions active or when Azure CLI subscription defaults are not set correctly. Adding `-g "${RG}"` after `az keyvault show` keeps the command consistent with the rest of the guide and avoids ambiguity.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "docs: add Azure Container Apps install g..." | Re-trigger Greptile

Comment on lines +181 to +218
--registry-identity system \
--system-assigned \
--target-port 18789 \
--ingress external \
--min-replicas 1 --max-replicas 1 \
--cpu 0.5 --memory 1Gi \
--env-vars \
"OPENCLAW_GATEWAY_PORT=18789" \
"OPENCLAW_HOME=/data/.openclaw" \
--args "gateway" "run" "--bind" "all" "--port" "18789"
```

`--registry-identity system` tells Container Apps to pull images using the app's managed identity instead of ACR admin credentials. No passwords to manage or rotate.

<Note>
Set `--min-replicas 1` to keep the Gateway always running. Scaling to 0 stops the Gateway.
OpenClaw is a single-instance gateway — do not scale above 1 replica.
Using 0.5 vCPU / 1 GiB keeps costs low. Scale up to `--cpu 1.0 --memory 2Gi` if you hit OOMs or need more concurrency.
</Note>

</Step>

<Step title="Grant ACR pull permission to the managed identity">
```bash
IDENTITY_PRINCIPAL="$(az containerapp show -g "${RG}" -n "${ACA_APP}" \
--query identity.principalId -o tsv)"

ACR_ID="$(az acr show -g "${RG}" -n "${ACR_NAME}" --query id -o tsv)"

az role assignment create \
--assignee "${IDENTITY_PRINCIPAL}" \
--role AcrPull \
--scope "${ACR_ID}"
```

</Step>

<Step title="Grant Key Vault access to the managed identity">
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 ACR role assignment after first image pull — initial deployment will fail

Steps 8 and 9 have a chicken-and-egg ordering problem. Step 8 creates the container app with --registry-identity system, which immediately schedules a container start and causes Azure to attempt an ACR image pull using the not-yet-authorized managed identity. The AcrPull role is only assigned in Step 9 — after the pull has already been attempted and failed.

The container app resource will be created successfully (exit code 0), but the underlying container will enter a failed/crashed state with an authorization error. The user will be left with a non-running gateway and no indication of why.

Fix: After the role assignment in Step 9, document that a new revision must be triggered to retry the (now-authorized) image pull:

# After granting AcrPull in Step 9, force a new revision:
az containerapp update \
  -g "${RG}" -n "${ACA_APP}" \
  --image "${ACR_NAME}.azurecr.io/openclaw:latest"

Alternatively, restructure the steps to use a user-assigned managed identity (pre-created and granted AcrPull before the container app is created), which avoids the ordering problem entirely.

Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/install/azure-containers.md
Line: 181-218

Comment:
**ACR role assignment after first image pull — initial deployment will fail**

Steps 8 and 9 have a chicken-and-egg ordering problem. Step 8 creates the container app with `--registry-identity system`, which immediately schedules a container start and causes Azure to attempt an ACR image pull using the not-yet-authorized managed identity. The `AcrPull` role is only assigned in Step 9 — after the pull has already been attempted and failed.

The container app resource will be created successfully (exit code 0), but the underlying container will enter a failed/crashed state with an authorization error. The user will be left with a non-running gateway and no indication of why.

**Fix**: After the role assignment in Step 9, document that a new revision must be triggered to retry the (now-authorized) image pull:

```bash
# After granting AcrPull in Step 9, force a new revision:
az containerapp update \
  -g "${RG}" -n "${ACA_APP}" \
  --image "${ACR_NAME}.azurecr.io/openclaw:latest"
```

Alternatively, restructure the steps to use a user-assigned managed identity (pre-created and granted `AcrPull` before the container app is created), which avoids the ordering problem entirely.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c00e084. Renamed the step to "Grant ACR pull permission and restart the container", added an explanatory note that the initial pull will fail, and added az containerapp update --image ... after the role assignment to force a new revision that retries the (now-authorized) image pull.

Fixed in c00e084. Added -g "${RG}" to az keyvault show for consistency with the rest of the guide.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 209af79141

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +183 to +190
--target-port 18789 \
--ingress external \
--min-replicas 1 --max-replicas 1 \
--cpu 0.5 --memory 1Gi \
--env-vars \
"OPENCLAW_GATEWAY_PORT=18789" \
"OPENCLAW_HOME=/data/.openclaw" \
--args "gateway" "run" "--bind" "all" "--port" "18789"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Set Control UI allowed origins before enabling external ingress

Even after fixing the bind value, this revision still won't start a public Control UI. src/gateway/server-runtime-config.ts:139-146 rejects any non-loopback Control UI unless gateway.controlUi.allowedOrigins (or the dangerous host-header fallback) is configured. This guide enables external ingress and then tells readers to open the ACA FQDN, but it never writes that FQDN into the gateway config, so the gateway process exits before the UI becomes reachable.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Improvements or additions to documentation size: XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant