Skip to content

bug(eks): investigation produces zero EKS tools when AWS integration uses IAM user credentials #723

@chaosreload

Description

@chaosreload

Bug

When the AWS integration is configured with IAM user credentials (env AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY, or a stored access_key_id / secret_access_key on the integration — i.e. no role_arn, no injected _backend), opensre investigate on a Kubernetes alert produces zero EKS tools in the plan. EKS pods / events / nodes are silently not investigated even when cluster_name is present in the alert annotations and the credentials work for STS / EKS directly.

Environment

  • Commit: main (reproduced on current HEAD as of 2026-04-21, also present in 0.x releases)
  • Python: 3.12
  • AWS integration: configured via opensre onboarding aws, choosing "access key / secret" (no role ARN), or via env AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_REGION.

Reproduction

  1. Configure AWS integration with IAM user credentials, no role_arn:

    opensre onboarding aws
    # enter access key / secret / region, skip role ARN
  2. Run investigate on a Kubernetes alert with cluster_name set:

    opensre investigate -i tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json
  3. Expected: EKS appears in the detected sources; pods / events / node-health tools are planned.

  4. Actual: EKS is silently skipped — the plan contains no EKS tools.

Root cause

app/integrations/catalog.py::resolve_integrations already lifts IAM-user credentials into _eks_int["credentials"] (around L229-L235). But app/nodes/plan_actions/detect_sources.py only checks _eks_int.get("role_arn") or an injected _backend before entering the EKS branch (L660):

_eks_int = (resolved_integrations or {}).get("aws")
_has_injected_eks_backend = bool(_eks_int and "_backend" in _eks_int)
if _eks_int and (_eks_int.get("role_arn") or _has_injected_eks_backend):
    ...  # EKS branch

IAM-user integrations have neither role_arn nor _backend, so the EKS branch is skipped entirely — even though credentials are present and valid.

This gate is the only remaining site that forgets credentials; other gates already include it correctly:

  • app/integrations/models.py::_require_auth_methodself.role_arn or self.credentials
  • app/integrations/verify.py::_build_sts_clientif role_arn: ... else credentials branch
  • app/cli/wizard/integration_health.py — same two-branch pattern ✔

Proposed fix

Accept _eks_int.get("credentials") as a third way to gate the EKS branch. Downstream code (L693) already tolerates an empty role_arn via _eks_int.get("role_arn", ""), so no other change is required at the planning layer.

if _eks_int and (
    _eks_int.get("role_arn") or _has_injected_eks_backend or _eks_int.get("credentials")
):
    ...

This is a strict superset of the existing behaviour: every integration previously accepted is still accepted. The new branch only activates for integrations that already carry a credentials dict produced by catalog.resolve_integrations.

Related

The k8s client that actually consumes these credentials also needs to learn to honour them (it currently calls _assume_role unconditionally with an empty RoleArn). That is a separate follow-up bug; I will open a second issue + PR for it that Depends on this one.

Offer to fix

I have a fix ready (one-line change in detect_sources.py, plus ruff format); happy to open a PR referencing this issue. Tested locally with ruff check, ruff format --check, and py_compile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions