fix(eks): detect EKS source when AWS integration uses IAM user credentials#724
Conversation
…tials
Problem
-------
When the AWS integration is configured with IAM user credentials
(env AK/SK or stored access_key_id/secret_access_key), opensre does
not register EKS as an investigable source. The investigation plan
produces zero EKS tool calls even though a cluster_name is present
in the alert annotations and the credentials would otherwise work.
Reproduction
------------
1. Configure AWS integration with access_key_id/secret_access_key
(no role_arn, no injected _backend):
opensre onboarding aws
# enter access key / secret / region, skip role ARN
2. Run investigate on a kubernetes alert with cluster_name set:
opensre investigate -i tests/e2e/kubernetes/fixtures/datadog_k8s_alert.json
3. Expected: EKS appears in detected sources and pods/events tools
are planned.
Actual: EKS is silently skipped; plan has no EKS tools.
Root Cause
----------
app/integrations/catalog.py (resolve_integrations) already lifts IAM
user credentials into _eks_int["credentials"] (see L229-235), but
detect_sources only checks _eks_int.get("role_arn") or an injected
_backend before entering the EKS branch. IAM-user integrations have
neither, so the branch is skipped entirely.
Fix
---
Accept _eks_int.get("credentials") as a third way to gate the EKS
branch. The downstream code already tolerates an empty role_arn
(L693 uses _eks_int.get("role_arn", "")), so no other change is
required at the planning layer.
The k8s client that actually consumes these credentials still needs
to learn to honour them — that is a separate bug tracked in the
follow-up PR on app/services/eks/eks_k8s_client.py.
Risk / Backward Compatibility
-----------------------------
Strict superset of current behaviour: any integration previously
accepted (role_arn or injected backend) is still accepted. The new
branch only activates for integrations that already carry a
credentials dict produced by catalog.resolve_integrations.
Greptile SummaryThis PR fixes a silent bug where Key changes:
Concerns:
Confidence Score: 4/5Safe to merge as a prerequisite fix; one P1 gap (credentials not forwarded in eks_params) should ideally be addressed here or guaranteed in the follow-up. The detection gate change is correct and consistent with the existing auth patterns across the codebase. The PR is a strict superset of existing behaviour and doesn't regress role_arn or _backend paths. The known downstream failure (build_k8s_clients still assumes role unconditionally) is explicitly acknowledged with a follow-up. The only unaddressed gap is that credentials are not included in eks_params, which means the follow-up PR will also need to touch detect_sources.py. This is a P1 concern worth resolving, either here or coordinated in the follow-up. app/nodes/plan_actions/detect_sources.py — specifically the eks_params dict construction (lines 682–697) where credentials should be forwarded. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[AWS integration resolved] --> B{_eks_int present?}
B -- No --> Z[Skip EKS]
B -- Yes --> C{role_arn present?\nor _backend injected?\nor credentials present?}
C -- No --> Z
C -- Yes --> D[Extract eks_cluster from annotations]
D --> E{eks_cluster found?}
E -- No --> Z
E -- Yes --> F[Build eks_params\ncluster_name, namespace, role_arn, region, ...]
F --> G{_backend injected?}
G -- Yes --> H[eks_params._backend = ...\nno connection_verified]
G -- No --> I[eks_params.connection_verified = True]
I --> J[sources.eks = eks_params]
H --> J
J --> K[EKS tools planned]
K --> L{runtime: role_arn empty\nand no credentials forwarded}
L -- role_arn auth --> M[build_k8s_clients OK]
L -- credentials only --> N[_assume_role fails\nfollowup PR needed]
|
|
Thanks for the review @greptile-apps — good catch on the Plan This PR stays scoped to the single-line gate fix. The follow-up PR (on Why not pre-land the I looked at this before opening the PR and considered two shapes for the follow-up:
The follow-up PR will take approach #2, so the Will link the follow-up PR here once this one lands. Thanks again for the review — your P1 comment pushed me to re-verify that approach #1 alone isn't sufficient for the stored-integration path, which saved a round-trip. |
|
That scoping makes sense — a half-wired field on Approach #2 is the right call for the follow-up. The ambient botocore resolver would silently miss stored-integration credentials exactly when the user is relying on them (i.e., the case this whole fix is about), so #1 alone would be a partial fix that's hard to test reliably. The gate fix here looks good to merge. Just make sure the follow-up PR links back to this one so the two changes are traceable together. Tip: You can customize Greptile's behavior for this repo with |
…ilder Fixes Tracer-Cloud#969. Follow-up to Tracer-Cloud#724. Problem ------- PR Tracer-Cloud#724 let EKS be detected when the AWS integration is configured with stored IAM user credentials (access_key_id + secret_access_key, no role_arn), but build_k8s_clients still had no way to use those stored credentials. After Tracer-Cloud#724 it supports: 1. role_arn -> STS AssumeRole 2. ambient botocore chain (env / shared config / instance profile / IRSA) The stored integration credentials (persisted by resolve_integrations and surfaced in _eks_int['credentials']) were silently dropped. On hosts that happen to have ambient credentials pointing elsewhere, this picks up the wrong principal; on hosts without ambient credentials the call fails entirely, even though the integration has perfectly usable keys. Fix (approach Tracer-Cloud#2, pre-approved by Greptile in PR Tracer-Cloud#724 review) ------------------------------------------------------------- Forward the stored credentials end-to-end through the EKS pipeline: 1. detect_sources.py: set eks_params['credentials'] = _eks_int.get( 'credentials') or None alongside the other EKS integration fields. 2. EKSListClustersTool._eks_creds (shared helper used by every EKS tool for its extract_params): include 'credentials' in the returned dict so it flows into each tool's kwargs. 3. Seven EKS tool functions accept a new `credentials` kwarg and forward it to build_k8s_clients (and surface it in their input_schema): EKSPodLogsTool, EKSListPodsTool, EKSEventsTool, EKSNodeHealthTool, EKSListDeploymentsTool, EKSDeploymentStatusTool, EKSListNamespacesTool. 4. eks_k8s_client.build_k8s_clients accepts an optional credentials kwarg. Resolution priority: 1. explicit credentials kwarg (stored-integration, new) 2. role_arn AssumeRole (existing production path) 3. ambient botocore chain (Tracer-Cloud#724 fallback) The two secondary fixes from Tracer-Cloud#724 are preserved (empty SessionToken coerced to None, regional STS endpoint). Addresses Greptile P1 --------------------- The initial commit only wired the two ends of the pipeline (detect_sources and build_k8s_clients); Greptile's review of Tracer-Cloud#976 correctly pointed out that the shared _eks_creds helper and the seven tool call sites in between were never updated, leaving the new credentials branch unreachable. This revision wires credentials through every intermediate layer so each of the seven EKS tools passes credentials=credentials to build_k8s_clients. Test plan --------- - py_compile passes on all 10 changed files. - ruff check + ruff format clean on all 10 changed files. - Regression safety: the role_arn (AssumeRole) and ambient-botocore branches are unchanged; the new explicit branch only runs when the credentials kwarg is truthy and has access_key_id + secret_access_key. Related ------- - Fixes Tracer-Cloud#969. - Follow-up to Tracer-Cloud#724. - Greptile pre-approved this approach in Tracer-Cloud#724 (comment) - Greptile P1 on Tracer-Cloud#976 that drove the helper+tools fixups: Tracer-Cloud#976 (comment)
…lients Problem ------- PR Tracer-Cloud#724 fixed the EKS source-detection gate so an AWS integration configured with stored IAM user credentials (access_key_id + secret_access_key, no role_arn) now activates the EKS branch in detect_sources. The follow-up path build_k8s_clients still does not know about those stored credentials. eks_params already carries role_arn, external_id, cluster_names, etc. but not credentials, and build_k8s_clients had no way to receive them. As a result the function unconditionally called STS AssumeRole, which fails for IAM-user-only integrations even though the credentials are persisted on the integration record. Fix (Approach Tracer-Cloud#2 from issue) ---------------------------- Forward the stored credentials through the existing eks_params pipeline: 1. detect_sources sets eks_params["credentials"] = _eks_int.get( "credentials") or None alongside the other EKS integration fields. 2. _eks_creds (the shared extract_params helper used by every k8s-API EKS tool) surfaces the credentials field. 3. The seven k8s-API EKS tools accept an optional credentials kwarg and thread it into build_k8s_clients. 4. build_k8s_clients accepts an optional credentials kwarg. New resolution priority: 1. explicit credentials kwarg (stored-integration, new) 2. role_arn AssumeRole (existing production path) Empty SessionToken is coerced to None — botocore rejects "" but accepts a missing token for IAM user credentials. Risk / Backward Compatibility ----------------------------- Strict superset of current behaviour. The new explicit branch only runs when credentials is truthy and carries both access_key_id and secret_access_key. Tools that do not receive credentials behave exactly as before (AssumeRole on the supplied role_arn). Tests ----- - tests/services/test_eks_k8s_client.py: direct coverage for _stored_credentials_to_aws_creds (full dict, empty session token, missing keys) and build_k8s_clients credential resolution priority (stored credentials skip AssumeRole; missing/incomplete fall back). - tests/tools/test_eks_list_clusters_tool.py: _eks_creds exposes the credentials field (and defaults it to None when absent). - tests/tools/test_eks_list_pods_tool.py: list_eks_pods threads the credentials kwarg into build_k8s_clients. - tests/nodes/plan_actions/test_detect_sources.py: detect_sources forwards stored credentials onto eks_params and defaults to None for role_arn-only integrations. Fixes Tracer-Cloud#969
Sets `eks_params["credentials"] = _eks_int.get("credentials") or None`
alongside the other EKS integration fields. This is the last link in
the chain that lets `build_k8s_clients` see stored IAM user keys when
the AWS integration was configured with `access_key_id` /
`secret_access_key` and no `role_arn`.
PR Tracer-Cloud#724 enabled the EKS branch in `detect_sources` for this
integration shape but did not forward the credentials themselves;
without this commit the stored keys are still silently dropped before
they reach `build_k8s_clients`.
Tests cover the populated and absent (`role_arn`-only) cases.
Fixes Tracer-Cloud#969
Sets `eks_params["credentials"] = _eks_int.get("credentials") or None`
alongside the other EKS integration fields. This is the last link in
the chain that lets `build_k8s_clients` see stored IAM user keys when
the AWS integration was configured with `access_key_id` /
`secret_access_key` and no `role_arn`.
PR Tracer-Cloud#724 enabled the EKS branch in `detect_sources` for this
integration shape but did not forward the credentials themselves;
without this commit the stored keys are still silently dropped before
they reach `build_k8s_clients`.
Tests cover the populated and absent (`role_arn`-only) cases.
Fixes Tracer-Cloud#969
…ilder (#976) * fix(eks): forward stored AWS integration credentials to k8s client builder Fixes #969. Follow-up to #724. Problem ------- PR #724 let EKS be detected when the AWS integration is configured with stored IAM user credentials (access_key_id + secret_access_key, no role_arn), but build_k8s_clients still had no way to use those stored credentials. After #724 it supports: 1. role_arn -> STS AssumeRole 2. ambient botocore chain (env / shared config / instance profile / IRSA) The stored integration credentials (persisted by resolve_integrations and surfaced in _eks_int['credentials']) were silently dropped. On hosts that happen to have ambient credentials pointing elsewhere, this picks up the wrong principal; on hosts without ambient credentials the call fails entirely, even though the integration has perfectly usable keys. Fix (approach #2, pre-approved by Greptile in PR #724 review) ------------------------------------------------------------- Forward the stored credentials end-to-end through the EKS pipeline: 1. detect_sources.py: set eks_params['credentials'] = _eks_int.get( 'credentials') or None alongside the other EKS integration fields. 2. EKSListClustersTool._eks_creds (shared helper used by every EKS tool for its extract_params): include 'credentials' in the returned dict so it flows into each tool's kwargs. 3. Seven EKS tool functions accept a new `credentials` kwarg and forward it to build_k8s_clients (and surface it in their input_schema): EKSPodLogsTool, EKSListPodsTool, EKSEventsTool, EKSNodeHealthTool, EKSListDeploymentsTool, EKSDeploymentStatusTool, EKSListNamespacesTool. 4. eks_k8s_client.build_k8s_clients accepts an optional credentials kwarg. Resolution priority: 1. explicit credentials kwarg (stored-integration, new) 2. role_arn AssumeRole (existing production path) 3. ambient botocore chain (#724 fallback) The two secondary fixes from #724 are preserved (empty SessionToken coerced to None, regional STS endpoint). Addresses Greptile P1 --------------------- The initial commit only wired the two ends of the pipeline (detect_sources and build_k8s_clients); Greptile's review of #976 correctly pointed out that the shared _eks_creds helper and the seven tool call sites in between were never updated, leaving the new credentials branch unreachable. This revision wires credentials through every intermediate layer so each of the seven EKS tools passes credentials=credentials to build_k8s_clients. Test plan --------- - py_compile passes on all 10 changed files. - ruff check + ruff format clean on all 10 changed files. - Regression safety: the role_arn (AssumeRole) and ambient-botocore branches are unchanged; the new explicit branch only runs when the credentials kwarg is truthy and has access_key_id + secret_access_key. Related ------- - Fixes #969. - Follow-up to #724. - Greptile pre-approved this approach in #724 (comment) - Greptile P1 on #976 that drove the helper+tools fixups: #976 (comment) * chore: trigger Greptile re-review after end-to-end credentials wiring The amend (ac2d841) wired credentials through _eks_creds helper + 7 EKS tool call sites per Greptile P1 on the initial commit b03928d. Pushing an empty commit to ensure Greptile re-reviews the new diff (10 files, +114/-11 vs the original 2 files, +50/-4). * chore: second re-trigger Greptile re-review First re-trigger commit (61f15ed) did not produce a Greptile re-review after ~4h. Pushing another empty commit to kick Greptile again so the end-to-end credentials wiring (helper + 7 EKS tools in ac2d841) gets a fresh review. * fix(eks): thread stored credentials through EKSClient-backed tools PR #976 wired stored AWS-integration credentials end-to-end for the seven k8s-API EKS tools that route through `build_k8s_clients`, but missed the four EKS tools that still go through `EKSClient`: `list_eks_clusters`, `describe_eks_cluster`, `describe_eks_addon`, `get_eks_nodegroup_health`. Their `_eks_creds` helper already exposes `credentials`, but each tool absorbed it via `**_kwargs` and never forwarded it, so `EKSClient._build` unconditionally called `sts.assume_role(RoleArn="", ...)` and raised `ParamValidationError` for IAM-user-only integrations at the cluster-discovery step. Greptile flagged this exact gap on the (now-closed) parallel PR #990. This commit closes it on #976. Changes ------- - `EKSClient.__init__` takes an optional `credentials` kwarg. `_build` prefers stored IAM user keys over STS AssumeRole; if neither is provided it raises `ValueError` (no silent fallthrough). - New shared helper `_stored_credentials_to_aws_creds` normalises the stored dict into the AssumeRole-shaped triplet (empty `session_token` → None, both required keys present). `build_k8s_clients` now uses the same helper so both code paths stay in sync. - The four EKSClient-backed tools accept `credentials` in the signature, expose it in `input_schema`, and pass it into the constructor. Tests ----- - New `tests/services/test_eks_client.py` covers the helper (full triplet / empty session token / missing required keys / None / empty dict) and EKSClient credential resolution (stored skips AssumeRole, empty session token coerced, partial dict falls back to AssumeRole, no creds + no role_arn raises ValueError, AssumeRole external_id passthrough stays intact). - `test_eks_list_clusters_tool.py` grows two forwarding tests: the kwarg threads into EKSClient, and stays None by default for existing role-based callers. Quality gate ------------ make lint / format-check / typecheck / test-cov all green (3328 passed, 2 skipped, 1 xfailed). Related ------- - Extends #976 - Fixes #969 - Closes Greptile P1 on #990 (#990 (comment)) * refactor(eks): address maintainer review feedback Addresses all five inline comments from @muddlebee on #976: 1. **eks_client.py** — add `from __future__ import annotations` (project style convention). 2. **utils module, public helper** — move `_stored_credentials_to_aws_creds` out of `eks_client.py` into a new `app/services/eks/utils.py` and drop the leading underscore (`stored_credentials_to_aws_creds`). The import contract between `eks_client` and `eks_k8s_client` is now explicit public API rather than two services reaching into each other for a private helper. 3. **eks_k8s_client.py** — promote the `import botocore.session` that lived in the ambient-fallback branch to the top-level imports. No circular-import concern (`botocore` is already imported at module level). 4. **eks_k8s_client.py** — normalize `frozen.token or None` at the source instead of relying on the downstream `or None` guard. This matches the behaviour of `stored_credentials_to_aws_creds` and removes the hidden dependency. 5. **All 11 EKS tool files** — tighten `credentials` parameter type annotation from `dict | None` to `dict[str, Any] | None`, matching the service-layer style and keeping mypy strict mode happy. The `tests/services/test_eks_client.py` import also switches to the new public name and module path. Quality gate ------------ make lint → All checks passed! make format-check → 906 files already formatted make typecheck → Success: no issues found in 392 source files make test-cov → 3328 passed, 2 skipped, 1 xfailed in 43.36s No behaviour changes — pure refactor of public surface + style. --------- Co-authored-by: weichao <[email protected]> Co-authored-by: chaosreload <[email protected]>
Fixes #723
Summary
When the AWS integration is configured with IAM user credentials (env AK/SK or a stored
access_key_id/secret_access_keywith norole_arn, no injected_backend),opensre investigateon a Kubernetes alert silently plans zero EKS tools — EKS pods/events/nodes are never queried even thoughcluster_nameis present in the alert and the credentials work for STS.This PR fixes that by letting
_eks_int.get("credentials")gate the EKS branch indetect_sources, matching the pattern already used bymodels.py,verify.py, andintegration_health.py.Full problem statement, reproduction, and root-cause analysis are in #723.
Change
One-line change in
app/nodes/plan_actions/detect_sources.py(plusruff formatwrapping it across three lines for line-length):Risk / Backward Compatibility
Strict superset of current behaviour:
credentialsdict produced bycatalog.resolve_integrations.detect_sources.pyL693) already tolerates an emptyrole_arnvia_eks_int.get("role_arn", ""), so the planning layer needs no other change.Same-pattern audit
I grepped the codebase for the same
role_arn/_backend-without-credentialsgate pattern.detect_sources.py:660is the only remaining site. Other auth gates already includecredentials:app/integrations/models.py::_require_auth_method—self.role_arn or self.credentials✔app/integrations/verify.py::_build_sts_client—if role_arn: ... else credentials branch✔app/cli/wizard/integration_health.py— same two-branch pattern ✔Local checks
Follow-up
The EKS k8s client that consumes these credentials still needs to learn to honour them (currently calls
_assume_role(role_arn, ...)unconditionally with an emptyRoleArn, fails before describe_cluster). A follow-up PR will address that and will explicitly declareDepends on #<this PR>in its body.