Skip to content

feat(network): NetworkTopology collector + l8k library integration#1474

Merged
mchmarny merged 1 commit into
NVIDIA:mainfrom
almaslennikov:network-collector-v2
Jun 25, 2026
Merged

feat(network): NetworkTopology collector + l8k library integration#1474
mchmarny merged 1 commit into
NVIDIA:mainfrom
almaslennikov:network-collector-v2

Conversation

@almaslennikov

@almaslennikov almaslennikov commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a NetworkTopology Measurement collector that ingests k8s-launch-kit (l8k) cluster-config — either from a file (--cluster-config <path>) or via live discovery (--discover-network) — and emits the canonical NetworkTopology shape documented in docs/integrator/measurement-api.md.

Re-submission of #1472 (closed). The blocker on the prior PR was verify-licenses rejecting MPL-2.0 hashicorp/go-multierror, transitively pulled in via Helm SDK from pkg/networkoperatorplugin. l8k PR #104 (merged as a99c9684) split discovery into a Helm-free sub-package pkg/networkoperatorplugin/discovery; this PR pins that merge and migrates the import. The MPL-2.0 dep is gone and the aicr vendor tree shrinks by ~950 files.

Motivation / Context

aicr's snapshot model gained a NetworkTopology Measurement type in Phase 1 (shape contract in docs/integrator/measurement-api.md) but nothing populated it. This PR wires the producer side: per-hardware-group PFs (PCI/PSID/partNumber/RDMA-device/rail/NUMA), per-group capabilities (sriov/rdma/ib), linkType, kernel-module lists, machine/GPU type, recommended firmware — all captured by l8k discovery and translated into one Measurement.

Fixes: N/A
Related: closes #1472 (this PR re-submits with the verify-licenses gate unblocked); l8k PR #104 (the upstream split that made this possible)

Type of Change

  • New feature (non-breaking change that adds functionality)

Component(s) Affected

  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • CLI (cmd/aicr, pkg/cli) — new --cluster-config / --discover-network flags
  • Core libraries (pkg/errors, pkg/k8s) — added client.ResolveKubeconfigPath helper

Implementation Notes

New package pkg/collector/network with two source modes:

  • ClusterConfigPath — parse cluster-config.yaml from disk; no cluster contact.
  • DiscoverNetwork — call l8k's live Discover against the cluster. NOT read-only: writes nvidia.kubernetes-launch-kit.{machine,gpu} labels and patches NicClusterPolicy via server-side-apply.

When both are set, file path wins. When neither is set, the collector is "inactive" — Collect returns (nil, nil). Multi-group input is rejected with ErrCodeInvalidRequest + groupCount context; multi-group support is a future iteration.

Translator converts l8kconfig.LaunchKitConfigNetworkTopology Measurement with identity / capabilities / pfs / kernel-modules subtypes. Per-PF Context carries model, connectedGPU, gpuProximity alongside pciAddress, deviceID, psid, partNumber, rdmaDevice, networkInterface. Per-PF Data carries rail, numaNode, traffic.

Cross-repo dependency. go.mod pins github.com/nvidia/k8s-launch-kit v0.0.0-20260625130457-a99c9684627e — the merge of l8k PR #104. No local replace. The discovery sub-package is Helm-free by construction; go list -deps ./pkg/collector/network | grep -E 'helm\.sh|hashicorp/go-multierror' is empty.

CLI: aicr snapshot --cluster-config <path> | --discover-network with env-var bindings (AICR_CLUSTER_CONFIG_PATH, AICR_DISCOVER_NETWORK) for in-pod Job mode. In local mode (AICR_AGENT_MODE=true), the CLI seeds KUBECONFIG from --kubeconfig before calling Measure so the l8k client and aicr's K8s client resolve the same path.

Job-mode RBAC for --discover-network lives in pkg/k8s/agent/rbac.go: ensureClusterRole conditionally appends discoverNetworkClusterRules() (CRDs, namespaces, daemonsets, serviceaccounts/configmaps, roles/rolebindings, pods/exec, nodes/patch, configuration.net.nvidia.com/nicdevices, mellanox.com/nicclusterpolicies) when Config.DiscoverNetwork is true. Non-network snapshots keep the existing minimal-priv 3-rule ClusterRole.

Shared helper pkg/k8s/client.ResolveKubeconfigPath — lifted the explicit-path → KUBECONFIG env → ~/.kube/config → in-cluster chain out of BuildKubeClient so aicr's K8s client and the l8k discovery client resolve kubeconfigs identically.

Snapshotter hardening: collectSafe now skips nil Measurements (required for the network collector's inactive-mode contract; harmless to existing collectors).

Testing

make qualify   # full gate
  • go test -race ./pkg/collector/network/... — all tests pass; 9 cases in translate_test.go, 6 cases in network_test.go covering inactive contract / file-mode success / missing path / invalid YAML / multi-group rejection / file-path-wins-over-discover.
  • go test -race ./pkg/k8s/agent/... — the new discover-network RBAC subtest pins that the extra rules (mellanox.com/nicclusterpolicies patch, pods/exec create, nodes/patch) land when Config.DiscoverNetwork is set.
  • golangci-lint run ./pkg/... — 0 issues.
  • go list -deps ./pkg/collector/network | grep -E 'helm\.sh|hashicorp/go-multierror|containerd|lib/pq|jmoiron/sqlx' — empty.

End-to-end discovery smoke-tested against a 4-node ThinkSystem SR675-V3 RTX-PRO-6000 cluster — AICR_AGENT_MODE=true aicr snapshot --discover-network writes a NetworkTopology Measurement with the contracted shape including gpuProximity / connectedGPU per-PF Context overlaid from l8k preset topology. The output was captured in the closed PR #1472's commit message; behaviour is unchanged from there modulo the API migration.

Risk Assessment

  • Medium — New collector + new RBAC + cross-repo dependency on a freshly-merged l8k API. Behind opt-in CLI flags (--cluster-config / --discover-network); off-by-default for existing snapshot users. Discovery is cluster-mutating (writes node labels + patches NicClusterPolicy) but only when the user opts in.

Rollout notes: Backwards compatible for existing snapshot users — no new flags required. Job-mode --discover-network requires the new RBAC rules in ensureClusterRole; the ClusterRole is recreated on each snapshot Job deploy so existing deployments pick it up automatically. The l8k pseudo-version is the canonical merged-main commit, not a branch tip.

Not in this PR

  • ConfigMap mounting for Job-mode --cluster-config — file source works in local mode (AICR_AGENT_MODE=true) today.
  • Multi-group NetworkTopology emission. The single-group enforcement is in toMeasurement; lifting it requires multi-Measurement-per-Type consumer rewrites in constraints extractor, recipe validation, diff indexing, and fingerprint — tracked as a follow-up phase.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (golangci-lint run ./pkg/...)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed (docs/integrator/measurement-api.md updated in the schema-foundations work; this PR's new flags are CLI-only)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a network-topology collection path that can read an l8k cluster-config file or perform live discovery, and translates the result into NetworkTopology measurements with identity, capabilities, PF, and kernel-module subtypes. The PR threads cluster-config/discovery settings through CLI flags, snapshotter env parsing, agent deployment, kubeconfig resolution, and RBAC, and adds tests plus measurement API documentation updates.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/aicr#1417: Introduced the NetworkTopology measurement schema and subtype/item conventions that this PR’s translator and docs build on.

Suggested labels

area/tests

Suggested reviewers

  • mchmarny
  • njhensley
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a NetworkTopology collector and integrating l8k.
Description check ✅ Passed The description is directly about the NetworkTopology collector, its modes, and the related integration work.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/cli/snapshot.go`:
- Around line 550-557: The kubeconfig seeding logic in snapshot.go preserves an
inherited KUBECONFIG even when opts.kubeconfig is explicitly set, which can
cause l8k and the rest of the CLI to target different clusters. Update the
kubeconfig setup in the branch around the KUBECONFIG assignment so that an
explicit opts.kubeconfig always wins and is written into the environment before
the network collector runs, ensuring kubeclient.New and the rest of the
agent-mode flow use the same resolved kubeconfig.

In `@pkg/collector/network/network_test.go`:
- Around line 66-163: Refactor the existing Collector.Collect tests in
network_test.go into a single table-driven test with subtests, since they all
share the same setup/assertion pattern across different inputs and outcomes.
Keep the coverage for inactive, file success, missing path, invalid YAML,
multi-group rejection, and file-path-vs-discovery behavior, but move the
per-case expectations into a test table and use t.Run for each scenario. Use the
existing TestCollect_* cases, Collector.Collect, and writeTempConfig as the key
anchors while consolidating repeated error/type checks into shared helper
assertions if needed.

In `@pkg/collector/network/network.go`:
- Around line 159-163: The Discover call in network collector error handling is
wrapping every failure as ErrCodeUnavailable, which hides timeout/cancellation
signals. In the error path after l8kdisc.Discover in the network collector flow,
check for context.DeadlineExceeded and context.Canceled first and return those
directly so the 10-minute deadline still surfaces as a timeout/cancel. Only wrap
non-context errors with errors.Wrap(errors.ErrCodeUnavailable, ...) to keep real
availability failures distinct.
- Around line 113-122: Reject oversized cluster-config input before parsing by
checking the actual file size or otherwise confirming the reader hit the byte
cap in the network collector flow. In the code around l8kdisc.ParseClusterConfig
and the limited reader in network.go, add an explicit oversize guard that
returns an invalid-request error when the config exceeds
defaults.MaxClusterConfigBytes, rather than relying on parsing to fail. Keep the
fix localized to the cluster-config loading path so oversized files cannot
produce a partial NetworkTopology snapshot.
- Around line 105-109: The ClusterConfigPath open path currently maps every
stdos.Open failure to ErrCodeNotFound, so adjust the error handling around that
open call to only wrap true missing-file cases with errors.ErrCodeNotFound. In
the logic that opens c.ClusterConfigPath, inspect the returned error and
preserve non-not-found failures (such as permission or ENOTDIR) with their real
error code/context, while keeping the not-found translation only for the actual
missing-file case.

In `@pkg/collector/network/translate_test.go`:
- Around line 29-73: The PF translation test fixture currently only validates
always-present fields, so it can miss regressions in the optional PF context
keys. Update fixtureSingleGroup and the translate test assertions to cover a
populated PF case for model, connectedGPU, and gpuProximity, and also add an
explicit check that these keys are omitted when unset. Use the existing
PFConfig/LaunchKitConfig setup and the translate-related assertions in
translate_test.go to keep the coverage tied to the actual output shape.

In `@pkg/k8s/client/client.go`:
- Around line 83-85: ResolveKubeconfigPath is currently returning the raw
KUBECONFIG env value, which breaks multi-file kubeconfig setups when
clientcmd.BuildConfigFromFlags treats it as a single path. Update the kubeconfig
loading flow around ResolveKubeconfigPath and the BuildConfigFromFlags call to
either rely on client-go’s default loading rules for KUBECONFIG or explicitly
split/merge the colon-separated list before creating the config. Ensure the
client creation path in client.go handles KUBECONFIG=/base:/overlay correctly.

In `@pkg/snapshotter/agent.go`:
- Around line 162-163: The Job-mode snapshot flow in agent.go is passing
ClusterConfigPath through unchanged, which leaves the pod pointing at an
unreadable host path. Update the code that builds the Job spec or agent config
to either mount the cluster config file and rewrite ClusterConfigPath to the
in-pod location, or explicitly reject ClusterConfigPath when launching a Job
until mounting support exists. Use the Snapshotter/Job construction path and the
ClusterConfigPath/DiscoverNetwork handling in agent.go to locate the fix.

In `@pkg/snapshotter/snapshot_test.go`:
- Around line 126-131: The snapshot tests are still unable to model an inactive
network collector, so the new nil-skip path in collectSafe remains untested.
Update the mockCollector/CreateNetworkCollector test setup to support an
explicit “inactive” state or a distinct override that can return a real nil
Measurement with no error, then use that in the snapshot tests and assertions
that currently rely on networkMeasurement so the nil branch is exercised.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 3457da35-5c2b-4e0b-b083-78f3d6dcf176

📥 Commits

Reviewing files that changed from the base of the PR and between ddc4dde and 3ba60d8.

⛔ Files ignored due to path filters (89)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/pkg/consts/consts.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/LICENSE is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/LICENSE is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmpl is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/.travis.yml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE.libyaml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/NOTICE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/README.md is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/apic.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/decode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/emitterc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/encode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/parserc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/readerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/resolve.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/scannerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/sorter.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/writerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yaml.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlh.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlprivateh.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
  • vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.go is excluded by !vendor/**
📒 Files selected for processing (17)
  • docs/integrator/measurement-api.md
  • go.mod
  • pkg/cli/snapshot.go
  • pkg/collector/factory.go
  • pkg/collector/network/network.go
  • pkg/collector/network/network_test.go
  • pkg/collector/network/translate.go
  • pkg/collector/network/translate_test.go
  • pkg/defaults/timeouts.go
  • pkg/k8s/agent/deployer_test.go
  • pkg/k8s/agent/job.go
  • pkg/k8s/agent/rbac.go
  • pkg/k8s/agent/types.go
  • pkg/k8s/client/client.go
  • pkg/snapshotter/agent.go
  • pkg/snapshotter/snapshot.go
  • pkg/snapshotter/snapshot_test.go

Comment thread pkg/cli/snapshot.go Outdated
Comment thread pkg/collector/network/network_test.go
Comment thread pkg/collector/network/network.go Outdated
Comment thread pkg/collector/network/network.go Outdated
Comment thread pkg/collector/network/network.go
Comment thread pkg/collector/network/translate_test.go
Comment thread pkg/snapshotter/agent.go
Comment thread pkg/snapshotter/snapshot_test.go
@almaslennikov almaslennikov force-pushed the network-collector-v2 branch from 3ba60d8 to a24c845 Compare June 25, 2026 13:42

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/k8s/agent/job.go (1)

109-139: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Passing AICR_CLUSTER_CONFIG_PATH into the Job isn't enough.

buildPodSpec only mounts /tmp, so this env var points at a path from the submitting process, not a file that exists inside the pod. Because the network collector checks ClusterConfigPath before DiscoverNetwork, agent-mode jobs will take the file branch and fail to open the path instead of discovering from the cluster. Either stage the file into the pod (for example via ConfigMap/volume) or reject ClusterConfigPath for agent deployments.

Also applies to: 322-334

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/k8s/agent/job.go` around lines 109 - 139, The Job pod spec in
buildPodSpec does not make AICR_CLUSTER_CONFIG_PATH available inside the pod, so
agent-mode jobs will try to open a host path that does not exist in the
container. Update the Deployer/buildPodSpec flow to either mount or stage the
cluster config file into the pod (for example through a volume or ConfigMap) so
the path is valid, or explicitly reject ClusterConfigPath for agent deployments
before constructing the PodSpec. Ensure the fix is applied where buildEnvVars is
consumed by buildPodSpec so the NetworkCollector can safely decide between
ClusterConfigPath and DiscoverNetwork.
pkg/k8s/agent/rbac.go (1)

186-205: 🎯 Functional Correctness | 🟠 Major

pkg/k8s/agent/rbac.go:186-205 — Preserve resourceVersion before updating the existing ClusterRole.

The AlreadyExists path calls Update on a newly constructed ClusterRole with no metadata.resourceVersion, so the apiserver will reject the write and --discover-network won’t upgrade an existing role. Fetch the current object first (or use patch/server-side apply) and carry its resourceVersion into the update.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/k8s/agent/rbac.go` around lines 186 - 205, The ClusterRole reconciliation
in createClusterRole currently retries an Update after AlreadyExists using a
freshly constructed rbacv1.ClusterRole, but that object has no resourceVersion
so the apiserver will reject it. Fix the AlreadyExists path by first fetching
the existing ClusterRole and copying its resourceVersion into the object before
calling d.clientset.RbacV1().ClusterRoles().Update, or switch this flow to
patch/server-side apply; keep the change localized around createClusterRole and
the ClusterRoles().Create/Update sequence.
♻️ Duplicate comments (4)
pkg/k8s/client/client.go (1)

62-84: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't pass raw KUBECONFIG through BuildConfigFromFlags.

ResolveKubeconfigPath returns the trimmed KUBECONFIG value as-is, but BuildConfigFromFlags("", kubeconfig) turns that into ClientConfigLoadingRules{ExplicitPath: kubeconfigPath}. In client-go, ExplicitPath means “load only this file”, while KUBECONFIG is a list of files to merge, so values like /base:/overlay will break here instead of using the merged config. Use default loading rules when the source is KUBECONFIG, or split/merge the list before building the config. (kubernetes.io)

Also applies to: 138-149

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/k8s/client/client.go` around lines 62 - 84, BuildKubeClient is treating
the resolved KUBECONFIG value as a single explicit file, which breaks merged
kubeconfig lists. Update the config निर्माण path so BuildConfigFromFlags is not
used with a raw KUBECONFIG list from ResolveKubeconfigPath; instead, use
client-go default loading rules when the source is KUBECONFIG or split/merge the
list before creating the config. Keep the existing ResolveKubeconfigPath
behavior for explicit paths, and adjust the BuildKubeClient flow accordingly
wherever it consumes the kubeconfig string.
pkg/collector/network/network.go (1)

113-122: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Still unresolved: oversized cluster-config files can be accepted as truncated YAML.

io.LimitReader(..., MaxClusterConfigBytes+1) only limits what ParseClusterConfig can read; it never proves the file was within the cap. If the first MaxClusterConfigBytes+1 bytes are still valid YAML, this path can parse a truncated prefix and emit a partial NetworkTopology snapshot. Reject the file once you know it exceeded MaxClusterConfigBytes instead of relying on parse failure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/collector/network/network.go` around lines 113 - 122, The cluster-config
read path in network.go still accepts oversized files if the first bytes form
valid YAML, because ParseClusterConfig is fed through io.LimitReader without
proving the file stayed within MaxClusterConfigBytes. Update the network config
loading flow around ParseClusterConfig so it explicitly detects when the
underlying file exceeds defaults.MaxClusterConfigBytes and returns an
invalid-request error before parsing or accepting the result, rather than
relying on a parse failure from a truncated prefix.
pkg/snapshotter/agent.go (1)

147-163: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Reject --cluster-config in Job mode until the file is mounted into the pod.

deployAndWaitForResult forwards the caller’s host path straight into agent.Config, and the agent Job only republishes it as AICR_CLUSTER_CONFIG_PATH. Nothing in this path mounts the file into the pod, so a normal aicr snapshot --cluster-config /host/path points the in-pod collector at a file that does not exist and the network measurement fails.

Minimal safe guard
 func deployAndWaitForResult(ctx context.Context, clientset k8sclient.Interface, config *AgentConfig, agentOutput string) ([]byte, error) {
+	if config.ClusterConfigPath != "" {
+		return nil, errors.New(
+			errors.ErrCodeInvalidRequest,
+			"--cluster-config is not supported in agent Job mode until the file is mounted into the pod; use --discover-network or AICR_AGENT_MODE=true",
+		)
+	}
+
 	// Auto-inject GPU node selector when no placement constraints are set.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/snapshotter/agent.go` around lines 147 - 163, `deployAndWaitForResult` is
passing a host-side `--cluster-config` path into
`agent.Config.ClusterConfigPath` without ensuring it exists in the Job pod, so
reject or ignore `ClusterConfigPath` in Job mode until the file is actually
mounted. Update the job setup in `deployAndWaitForResult` and any related
`agent.Config` handling to validate this case early and return a clear error
before creating the agent Job, using the existing `agent.Config` and
`AICR_CLUSTER_CONFIG_PATH` flow as the place to enforce the safeguard.
pkg/cli/snapshot.go (1)

543-557: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Let explicit --kubeconfig override inherited KUBECONFIG.

In the AICR_AGENT_MODE=true path you clear ns.Factory and rebuild collector options from env. If the shell already exports KUBECONFIG for cluster A but the user passes --kubeconfig for cluster B, this guard preserves cluster A for l8k while the rest of the CLI is configured for cluster B. With --discover-network, that can label or patch the wrong cluster.

Use the explicit kubeconfig whenever one is set
-				if opts.kubeconfig != "" && os.Getenv("KUBECONFIG") == "" {
-					_ = os.Setenv("KUBECONFIG", opts.kubeconfig)
+				if opts.kubeconfig != "" {
+					if setErr := os.Setenv("KUBECONFIG", opts.kubeconfig); setErr != nil {
+						return errors.Wrap(errors.ErrCodeInvalidRequest, "failed to set KUBECONFIG", setErr)
+					}
 				}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/snapshot.go` around lines 543 - 557, The AICR_AGENT_MODE kubeconfig
seeding logic in snapshot.go should let the explicit opts.kubeconfig override
any inherited KUBECONFIG so the collector and CLI use the same cluster context.
Update the env setup around the AICR_AGENT_MODE block to always set KUBECONFIG
from opts.kubeconfig when it is non-empty, instead of skipping when KUBECONFIG
is already present; keep the existing handling for opts.clusterConfigPath and
opts.discoverNetwork intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/collector/network/translate.go`:
- Around line 142-144: The NetworkTopology translation is skipping
identity.context.linkType when g.LinkType is empty, which changes the expected
API shape. Update the logic in the translate.go path around the WithContext call
so the field is written unconditionally in the NetworkTopology builder, using
the empty string as the unknown sentinel instead of omitting the key entirely.

In `@pkg/k8s/agent/deployer_test.go`:
- Around line 168-224: The current test only verifies first-time creation of the
ClusterRole, but it misses the upgrade/mutation path for existing resources. In
deployer_test.go, extend the ensureClusterRole coverage around NewDeployer and
ensureClusterRole by adding a case that pre-creates the base aicr-node-reader
ClusterRole, then reruns ensureClusterRole with DiscoverNetwork enabled and
asserts the existing role gains the discovery rules. Keep the existing
assertions for the added rules, but make sure the new subtest specifically
exercises the update path on an already-present ClusterRole.

In `@pkg/k8s/agent/rbac.go`:
- Around line 290-337: The discoverNetworkClusterRules cluster-role set is
over-broad because the bootstrap namespace resources are still granted via
cluster-wide RBAC. Update discoverNetworkClusterRules to keep only truly
cluster-scoped permissions, and move the bootstrap namespace resources handled
in the discovery flow (daemonsets, serviceaccounts, configmaps, roles,
rolebindings) into a namespace-scoped Role and RoleBinding for
nvidia-k8s-launch-kit so the permissions are limited to that namespace.

---

Outside diff comments:
In `@pkg/k8s/agent/job.go`:
- Around line 109-139: The Job pod spec in buildPodSpec does not make
AICR_CLUSTER_CONFIG_PATH available inside the pod, so agent-mode jobs will try
to open a host path that does not exist in the container. Update the
Deployer/buildPodSpec flow to either mount or stage the cluster config file into
the pod (for example through a volume or ConfigMap) so the path is valid, or
explicitly reject ClusterConfigPath for agent deployments before constructing
the PodSpec. Ensure the fix is applied where buildEnvVars is consumed by
buildPodSpec so the NetworkCollector can safely decide between ClusterConfigPath
and DiscoverNetwork.

In `@pkg/k8s/agent/rbac.go`:
- Around line 186-205: The ClusterRole reconciliation in createClusterRole
currently retries an Update after AlreadyExists using a freshly constructed
rbacv1.ClusterRole, but that object has no resourceVersion so the apiserver will
reject it. Fix the AlreadyExists path by first fetching the existing ClusterRole
and copying its resourceVersion into the object before calling
d.clientset.RbacV1().ClusterRoles().Update, or switch this flow to
patch/server-side apply; keep the change localized around createClusterRole and
the ClusterRoles().Create/Update sequence.

---

Duplicate comments:
In `@pkg/cli/snapshot.go`:
- Around line 543-557: The AICR_AGENT_MODE kubeconfig seeding logic in
snapshot.go should let the explicit opts.kubeconfig override any inherited
KUBECONFIG so the collector and CLI use the same cluster context. Update the env
setup around the AICR_AGENT_MODE block to always set KUBECONFIG from
opts.kubeconfig when it is non-empty, instead of skipping when KUBECONFIG is
already present; keep the existing handling for opts.clusterConfigPath and
opts.discoverNetwork intact.

In `@pkg/collector/network/network.go`:
- Around line 113-122: The cluster-config read path in network.go still accepts
oversized files if the first bytes form valid YAML, because ParseClusterConfig
is fed through io.LimitReader without proving the file stayed within
MaxClusterConfigBytes. Update the network config loading flow around
ParseClusterConfig so it explicitly detects when the underlying file exceeds
defaults.MaxClusterConfigBytes and returns an invalid-request error before
parsing or accepting the result, rather than relying on a parse failure from a
truncated prefix.

In `@pkg/k8s/client/client.go`:
- Around line 62-84: BuildKubeClient is treating the resolved KUBECONFIG value
as a single explicit file, which breaks merged kubeconfig lists. Update the
config निर्माण path so BuildConfigFromFlags is not used with a raw KUBECONFIG
list from ResolveKubeconfigPath; instead, use client-go default loading rules
when the source is KUBECONFIG or split/merge the list before creating the
config. Keep the existing ResolveKubeconfigPath behavior for explicit paths, and
adjust the BuildKubeClient flow accordingly wherever it consumes the kubeconfig
string.

In `@pkg/snapshotter/agent.go`:
- Around line 147-163: `deployAndWaitForResult` is passing a host-side
`--cluster-config` path into `agent.Config.ClusterConfigPath` without ensuring
it exists in the Job pod, so reject or ignore `ClusterConfigPath` in Job mode
until the file is actually mounted. Update the job setup in
`deployAndWaitForResult` and any related `agent.Config` handling to validate
this case early and return a clear error before creating the agent Job, using
the existing `agent.Config` and `AICR_CLUSTER_CONFIG_PATH` flow as the place to
enforce the safeguard.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: c58932a3-6167-450a-a930-69d6b1f06bb3

📥 Commits

Reviewing files that changed from the base of the PR and between 3ba60d8 and a24c845.

⛔ Files ignored due to path filters (89)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/pkg/consts/consts.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/LICENSE is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/LICENSE is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmpl is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/.travis.yml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE.libyaml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/NOTICE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/README.md is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/apic.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/decode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/emitterc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/encode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/parserc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/readerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/resolve.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/scannerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/sorter.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/writerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yaml.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlh.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlprivateh.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
  • vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.go is excluded by !vendor/**
📒 Files selected for processing (18)
  • docs/integrator/measurement-api.md
  • go.mod
  • pkg/cli/snapshot.go
  • pkg/collector/factory.go
  • pkg/collector/network/network.go
  • pkg/collector/network/network_test.go
  • pkg/collector/network/translate.go
  • pkg/collector/network/translate_test.go
  • pkg/defaults/timeouts.go
  • pkg/k8s/agent/consts.go
  • pkg/k8s/agent/deployer_test.go
  • pkg/k8s/agent/job.go
  • pkg/k8s/agent/rbac.go
  • pkg/k8s/agent/types.go
  • pkg/k8s/client/client.go
  • pkg/snapshotter/agent.go
  • pkg/snapshotter/snapshot.go
  • pkg/snapshotter/snapshot_test.go

Comment thread pkg/collector/network/translate.go Outdated
Comment thread pkg/k8s/agent/deployer_test.go
Comment thread pkg/k8s/agent/rbac.go
@almaslennikov almaslennikov force-pushed the network-collector-v2 branch from a24c845 to df3d81d Compare June 25, 2026 14:13
@almaslennikov

Copy link
Copy Markdown
Contributor Author

Addressing CodeRabbit feedback in df3d81d6 (amend of a24c8453, force-pushed with --force-with-lease).

Fixed (this revision):

# Path Status
1 pkg/cli/snapshot.go:557 ✅ Explicit --kubeconfig now wins over inherited KUBECONFIG — dropped the == "" guard. Cluster-mutating discovery should never silently target a stale env-pointed cluster when the user explicitly specified --kubeconfig.
3 pkg/collector/network/network.go:109 ✅ Discriminate fs.ErrNotExist (→ ErrCodeNotFound) from fs.ErrPermission (→ ErrCodeUnauthorized) from everything else (→ ErrCodeInternal). Applied to both os.Stat and os.Open.
4 pkg/collector/network/network.go:122 os.Stat now runs first and rejects oversize input with ErrCodeInvalidRequest + size/limit context. The io.LimitReader stays as a defense-in-depth bound on parser memory.
5 pkg/collector/network/network.go:163 context.DeadlineExceeded / context.Canceled from Discover are now surfaced as ErrCodeTimeout. Only non-context errors get ErrCodeUnavailable. The 10-min CollectorNetworkTimeout now reads as a timeout, not a cluster-availability problem.
7 pkg/k8s/client/client.go:85 ResolveKubeconfigPath returns "" when KUBECONFIG env contains a path separator (: / ;) rather than handing a multi-file value to BuildConfigFromFlags("", v) (which treats v as one explicit path and breaks the merge). Empty path lets clientcmd's default loading rules honor KUBECONFIG natively.
8 pkg/snapshotter/agent.go:163 deployAndWaitForResult now rejects ClusterConfigPath != "" up front with a clear ErrCodeInvalidRequest message pointing users at --discover-network (Job mode) or AICR_AGENT_MODE=true (local mode). ConfigMap-backed mounting is tracked as a follow-up — the previous behavior silently forwarded an unreadable host path.
10 pkg/collector/network/translate.go:144 identity.context.linkType is now emitted unconditionally — empty string is the documented "unknown fabric" sentinel per docs/integrator/measurement-api.md.

Deferred to follow-up (not blocking this PR):

# Path Reason
2 pkg/collector/network/network_test.go:163 Table-driven refactor is stylistic. Current per-case tests use distinct error-code assertions and errors.As matchers per case; collapsing them into one table risks losing the per-case context strings that make failures easy to triage. Open to doing it in a follow-up if the team prefers.
6 pkg/collector/network/translate_test.go:73 Optional PF context coverage (model / connectedGPU / gpuProximity). Worth adding but a test gap, not a correctness regression. Tracked.
9 pkg/snapshotter/snapshot_test.go:131 Mock shape needs an "inactive" override to exercise collectSafe's nil-skip branch. Same — test gap, not correctness. Tracked.
11 pkg/k8s/agent/deployer_test.go:224 ClusterRole upgrade-path coverage. ensureClusterRole already handles the AlreadyExistsUpdate path; the production code uses it on every redeploy. Will add the explicit test case in a follow-up.
12 pkg/k8s/agent/rbac.go:337 Scoping the bootstrap-namespace resources (daemonsets, serviceaccounts, configmaps, roles, rolebindings) into a namespace-scoped Role/RoleBinding for nvidia-k8s-launch-kit is the right long-term shape. It's a non-trivial restructure: aicr would need to pre-create the namespace + namespaced Role before the snapshot Job runs (today l8k creates the namespace during discovery). Worth a dedicated PR. The current ClusterRole still beats cluster-admin and is gated behind --discover-network.

CI: build / lint / test green locally. Re-pushed, awaiting fresh CI cycle.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
pkg/collector/network/network.go (1)

112-157: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

The oversize guard is still racy after Stat().

Stat() only validates the file state you looked at before opening it. If the path is replaced or grows between Stat() and ParseClusterConfig, this code still parses a truncated LimitReader, and a valid prefix can be accepted as a smaller config. That can emit a partial NetworkTopology snapshot from input that should have been rejected.

A safer pattern here is to read at most MaxClusterConfigBytes+1 bytes into a buffer first, reject len(buf) > MaxClusterConfigBytes, and then parse from that same stable buffer.

One localized fix
+import (
+	"bytes"
+	...
+)
...
-	limited := io.LimitReader(f, defaults.MaxClusterConfigBytes+1)
-
-	cfg, err := l8kdisc.ParseClusterConfig(limited)
+	buf, err := io.ReadAll(io.LimitReader(f, defaults.MaxClusterConfigBytes+1))
+	if err != nil {
+		return nil, errors.WrapWithContext(errors.ErrCodeInternal,
+			"failed to read cluster-config file", err,
+			map[string]any{"path": c.ClusterConfigPath})
+	}
+	if int64(len(buf)) > defaults.MaxClusterConfigBytes {
+		return nil, errors.NewWithContext(errors.ErrCodeInvalidRequest,
+			"cluster-config file exceeds size limit",
+			map[string]any{
+				"path":  c.ClusterConfigPath,
+				"size":  len(buf),
+				"limit": defaults.MaxClusterConfigBytes,
+			})
+	}
+
+	cfg, err := l8kdisc.ParseClusterConfig(bytes.NewReader(buf))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/collector/network/network.go` around lines 112 - 157, The size check in
the cluster-config loading path is still race-prone because `Stat()` and
`ParseClusterConfig` read from different file states. Update the
`NetworkTopology`/cluster-config read flow to use one stable read source: open
the file once in the loader, read up to `defaults.MaxClusterConfigBytes+1` into
a buffer, reject when the buffer exceeds the limit, and then pass that buffer to
`l8kdisc.ParseClusterConfig` instead of parsing directly from `io.LimitReader`.
Keep the existing error mapping and context around `stdos.Open`,
`errors.WrapWithContext`, and `defaults.MaxClusterConfigBytes`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/cli/snapshot.go`:
- Around line 398-403: The `cluster-config` flag help text currently implies it
works in the normal `aicr snapshot` Job flow, but `pkg/snapshotter/agent.go`
rejects it outside the local-agent path. Update the `&cli.StringFlag` definition
for `cluster-config` in `snapshot.go` so the usage clearly says this flag is
only supported for local-agent mode in this iteration, while keeping the
existing network-topology and precedence wording intact.
- Around line 543-559: The env seeding in snapshot.go only writes the
true/non-empty cases, so inherited AICR_CLUSTER_CONFIG_PATH and
AICR_DISCOVER_NETWORK values can leak through after ns.Measure() reruns with
ns.Factory = nil. Update the AICR_AGENT_MODE block in snapshot setup to mirror
the resolved CLI state back into the environment unconditionally by setting or
clearing those vars based on opts.clusterConfigPath and opts.discoverNetwork,
alongside the existing KUBECONFIG handling.

In `@pkg/collector/network/network_test.go`:
- Around line 83-163: Add a file-mode test that exercises the new size limit in
Collector.Collect by creating a temp config larger than
defaults.MaxClusterConfigBytes and calling Collect with ClusterConfigPath set.
Assert it returns a structured error with errors.ErrCodeInvalidRequest, using
the existing file-mode test pattern in TestCollect_FileMode_* to keep the
behavior locked down.

In `@pkg/collector/network/network.go`:
- Around line 104-106: The repeated "path" key used in the network collector
logging is triggering goconst, so deduplicate it by hoisting the key into a
shared constant and reuse it in the slog.String calls across the affected
logging sites in the network collector. Update the log statements in the
collector flow so the constant is referenced consistently from the relevant
functions in this package, then run golangci-lint on the affected package paths
to verify the lint passes.

In `@pkg/collector/network/translate_test.go`:
- Around line 36-40: Add a test case in translate_test.go for the
unknown-linkType path so identity.context.linkType is asserted as present with
an empty string, not omitted. Extend the existing table-driven tests around the
translate/selector generation logic to include a case where LinkType is empty,
and verify the resulting NodeSelector (or equivalent output from the relevant
translate helper) still contains the identity.context.linkType key with ""
alongside the existing symbols such as Identifier, MachineType, and GPUType.

In `@pkg/k8s/agent/deployer_test.go`:
- Around line 181-223: The RBAC checks in the test are using three separate
boolean scans; refactor this block in deployer_test.go into a table-driven test
that iterates over expected {apiGroup, resource, verb} cases. Update the
existing rule validation around cr.Rules to compare each case against the
discovered rules and report missing permissions per entry, while keeping the
same coverage for mellanox.com/nicclusterpolicies patch, pods/exec create, and
nodes patch.

In `@pkg/k8s/client/client.go`:
- Around line 89-99: ResolveKubeconfigPath currently returns an empty path for
multi-file KUBECONFIG values, which causes BuildKubeClient to fall back to
in-cluster config instead of honoring clientcmd merging. Update BuildKubeClient
to use clientcmd.NewDefaultClientConfigLoadingRules() for the multi-file case,
or otherwise defer loading so the raw KUBECONFIG precedence is preserved; keep
ResolveKubeconfigPath behavior aligned with that flow.

---

Duplicate comments:
In `@pkg/collector/network/network.go`:
- Around line 112-157: The size check in the cluster-config loading path is
still race-prone because `Stat()` and `ParseClusterConfig` read from different
file states. Update the `NetworkTopology`/cluster-config read flow to use one
stable read source: open the file once in the loader, read up to
`defaults.MaxClusterConfigBytes+1` into a buffer, reject when the buffer exceeds
the limit, and then pass that buffer to `l8kdisc.ParseClusterConfig` instead of
parsing directly from `io.LimitReader`. Keep the existing error mapping and
context around `stdos.Open`, `errors.WrapWithContext`, and
`defaults.MaxClusterConfigBytes`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 11cc1d5b-7d97-4600-ac6e-d98939ef84ad

📥 Commits

Reviewing files that changed from the base of the PR and between a24c845 and df3d81d.

⛔ Files ignored due to path filters (89)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/pkg/consts/consts.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/LICENSE is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/LICENSE is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmpl is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/.travis.yml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE.libyaml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/NOTICE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/README.md is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/apic.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/decode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/emitterc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/encode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/parserc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/readerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/resolve.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/scannerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/sorter.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/writerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yaml.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlh.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlprivateh.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
  • vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.go is excluded by !vendor/**
📒 Files selected for processing (18)
  • docs/integrator/measurement-api.md
  • go.mod
  • pkg/cli/snapshot.go
  • pkg/collector/factory.go
  • pkg/collector/network/network.go
  • pkg/collector/network/network_test.go
  • pkg/collector/network/translate.go
  • pkg/collector/network/translate_test.go
  • pkg/defaults/timeouts.go
  • pkg/k8s/agent/consts.go
  • pkg/k8s/agent/deployer_test.go
  • pkg/k8s/agent/job.go
  • pkg/k8s/agent/rbac.go
  • pkg/k8s/agent/types.go
  • pkg/k8s/client/client.go
  • pkg/snapshotter/agent.go
  • pkg/snapshotter/snapshot.go
  • pkg/snapshotter/snapshot_test.go

Comment thread pkg/cli/snapshot.go
Comment thread pkg/cli/snapshot.go Outdated
Comment thread pkg/collector/network/network_test.go
Comment thread pkg/collector/network/network.go
Comment thread pkg/collector/network/translate_test.go
Comment on lines +181 to +223
hasNCPPatch := false
hasPodsExec := false
hasNodesPatch := false
for _, r := range cr.Rules {
for _, g := range r.APIGroups {
if g == "mellanox.com" {
for _, res := range r.Resources {
if res == "nicclusterpolicies" {
for _, v := range r.Verbs {
if v == "patch" {
hasNCPPatch = true
}
}
}
}
}
}
for _, res := range r.Resources {
if res == "pods/exec" {
for _, v := range r.Verbs {
if v == "create" {
hasPodsExec = true
}
}
}
if res == "nodes" {
for _, v := range r.Verbs {
if v == "patch" {
hasNodesPatch = true
}
}
}
}
}
if !hasNCPPatch {
t.Error("expected mellanox.com/nicclusterpolicies patch rule")
}
if !hasPodsExec {
t.Error("expected pods/exec create rule")
}
if !hasNodesPatch {
t.Error("expected nodes/patch rule")
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make the RBAC expectations table-driven.

The three expected discovery rules are independent cases; a table of {apiGroup, resource, verb} entries would match the project’s Go test guidance and make missing-rule additions easier to review. As per coding guidelines, “Write table-driven tests for multiple test cases.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/k8s/agent/deployer_test.go` around lines 181 - 223, The RBAC checks in
the test are using three separate boolean scans; refactor this block in
deployer_test.go into a table-driven test that iterates over expected {apiGroup,
resource, verb} cases. Update the existing rule validation around cr.Rules to
compare each case against the discovered rules and report missing permissions
per entry, while keeping the same coverage for mellanox.com/nicclusterpolicies
patch, pods/exec create, and nodes patch.

Source: Coding guidelines

Comment thread pkg/k8s/client/client.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/cli/snapshot.go (1)

398-403: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Help text still implies --cluster-config works in the default Job flow.

pkg/snapshotter/agent.go rejects --cluster-config in Job mode (returns ErrCodeInvalidRequest); it is supported only in local agent mode (AICR_AGENT_MODE=true) this iteration. Call that limitation out in the usage string so the CLI contract matches runtime behavior.

Suggested wording
 		&cli.StringFlag{
 			Name:     "cluster-config",
-			Usage:    "Path to a pre-existing k8s-launch-kit (l8k) cluster-config.yaml. Ingests the file's network topology into the snapshot as a NetworkTopology Measurement. Mutually exclusive with --discover-network at the collector level — file path wins when both are set.",
+			Usage:    "Path to a pre-existing k8s-launch-kit (l8k) cluster-config.yaml. Ingests the file's network topology into the snapshot as a NetworkTopology Measurement. Local agent mode only for now (AICR_AGENT_MODE=true); the default Job flow rejects it because host files are not mounted into the pod. Mutually exclusive with --discover-network at the collector level — file path wins when both are set.",
 			Sources:  cli.EnvVars("AICR_CLUSTER_CONFIG_PATH"),
 			Category: catAgentDeployment,
 		},
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/cli/snapshot.go` around lines 398 - 403, The `cluster-config` flag help
text in `snapshotFlags` still suggests it is generally available, but `agent.go`
only accepts it in local agent mode and rejects it in Job mode via
`ErrCodeInvalidRequest`. Update the `Usage` string for the
`&cli.StringFlag{Name: "cluster-config"}` entry to explicitly say it is
supported only when `AICR_AGENT_MODE=true` and not in the default Job flow, so
the CLI description matches the behavior enforced by the agent validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/k8s/agent/types.go`:
- Around line 51-56: Update the ClusterConfigPath comment in types.go to remove
the incorrect “local-mode” reference tied to AICR_AGENT_MODE=true; the Job-based
in-pod agent uses that env var, not a local-mode file-input workflow. Reword the
note so it points users to the correct usage for file-based input with
ClusterConfigPath and the relevant Job/agent path, keeping the guidance aligned
with the ClusterConfigPath field and AICR_AGENT_MODE mention.

---

Duplicate comments:
In `@pkg/cli/snapshot.go`:
- Around line 398-403: The `cluster-config` flag help text in `snapshotFlags`
still suggests it is generally available, but `agent.go` only accepts it in
local agent mode and rejects it in Job mode via `ErrCodeInvalidRequest`. Update
the `Usage` string for the `&cli.StringFlag{Name: "cluster-config"}` entry to
explicitly say it is supported only when `AICR_AGENT_MODE=true` and not in the
default Job flow, so the CLI description matches the behavior enforced by the
agent validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1c711efd-0fbb-4356-81f1-7151d26a0ec5

📥 Commits

Reviewing files that changed from the base of the PR and between df3d81d and 475e337.

⛔ Files ignored due to path filters (89)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/pkg/consts/consts.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/LICENSE is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/LICENSE is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmpl is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/.travis.yml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE.libyaml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/NOTICE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/README.md is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/apic.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/decode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/emitterc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/encode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/parserc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/readerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/resolve.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/scannerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/sorter.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/writerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yaml.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlh.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlprivateh.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
  • vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.go is excluded by !vendor/**
📒 Files selected for processing (18)
  • docs/integrator/measurement-api.md
  • go.mod
  • pkg/cli/snapshot.go
  • pkg/collector/factory.go
  • pkg/collector/network/network.go
  • pkg/collector/network/network_test.go
  • pkg/collector/network/translate.go
  • pkg/collector/network/translate_test.go
  • pkg/defaults/timeouts.go
  • pkg/k8s/agent/consts.go
  • pkg/k8s/agent/deployer_test.go
  • pkg/k8s/agent/job.go
  • pkg/k8s/agent/rbac.go
  • pkg/k8s/agent/types.go
  • pkg/k8s/client/client.go
  • pkg/snapshotter/agent.go
  • pkg/snapshotter/snapshot.go
  • pkg/snapshotter/snapshot_test.go

Comment thread pkg/k8s/agent/types.go

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-engineered PR — the new network collector follows existing collector patterns closely, uses pkg/errors codes + context timeouts throughout, bounds the file read (os.Stat + LimitReader), fails closed on parse/garbage-env, and has a clean inactive-collector (nil, nil) contract that the snapshotter now honors. Test coverage is good (inactive/file-success/missing/invalid/multi-group/file-wins + the RBAC discovery subtest), and the MPL-2.0 blocker from #1472 is genuinely resolved via the Helm-free l8k discovery sub-package. CI is green (Lint, Test, Security Scan, verify-licenses, CLI E2E, GPU Smoke).

A few non-blocking notes inline:

  • Docs (medium): the two new CLI flags aren't added to the aicr snapshot table in docs/user/cli-reference.md.
  • Comment accuracy (low): the multi-file KUBECONFIG rationale in ResolveKubeconfigPath overstates the outcome for the BuildKubeClient caller (it goes to in-cluster config, not clientcmd loading rules).
  • Operator awareness (nit): the discovery ClusterRole's pods/exec + roles/rolebindings grants are the broadest in the role — opt-in and well-documented, just flagging.

Nothing blocks merge. The doc table update is the one I'd ask for before merge; the other two are optional.

Comment thread pkg/cli/snapshot.go
Comment thread pkg/k8s/client/client.go Outdated
Comment thread pkg/k8s/agent/rbac.go
@almaslennikov almaslennikov force-pushed the network-collector-v2 branch from 475e337 to 53c103e Compare June 25, 2026 15:17
@almaslennikov

Copy link
Copy Markdown
Contributor Author

Addressing the new review (mchmarny + CodeRabbit re-review of 475e337d) in 53c103ee.

@mchmarny:

  1. docs/user/cli-reference.md rows for --cluster-config / --discover-network — ✅ added in this revision. Both flags now appear in the aicr snapshot flag table with their full Usage text, mutual-exclusion behavior, and the local-mode-vs-Job-mode constraints called out.
  2. pkg/k8s/client/client.go doc-comment scoping — ✅ tightened. The comment now explicitly scopes the "clientcmd loading rules merge multi-file KUBECONFIG" claim to clientcmd-loading-rules callers (l8k's kubeclient.New(""), anything using NewNonInteractiveDeferredLoadingClientConfig), and explicitly notes that this package's own BuildKubeClient falls through to InClusterConfig() on empty path — so a multi-file KUBECONFIG run locally through BuildKubeClient is not supported (and wasn't before this change either).
  3. pkg/k8s/agent/rbac.go operator-awareness on pods/exec + namespaces+roles+rolebindings — agreed, non-blocking. Will tighten the scope when l8k confirms its bootstrap namespace is stable (it has been nvidia-k8s-launch-kit for several releases now); right now the rules are intentionally cluster-scoped because aicr can't pre-create the bootstrap namespace before l8k discovery runs. Tracking as a follow-up.

CodeRabbit new findings on 475e337d:

Path Status
pkg/cli/snapshot.go:403 (Minor) — Usage text didn't mention --cluster-config is local-only ✅ Usage now reads "Local agent mode only (AICR_AGENT_MODE=true) — Job mode rejects this flag with INVALID_REQUEST..."
pkg/cli/snapshot.go:559 (Major) — env not unset on negative case → could inherit stale AICR_DISCOVER_NETWORK ✅ Both AICR_CLUSTER_CONFIG_PATH and AICR_DISCOVER_NETWORK are now os.Setenv on positive, os.Unsetenv on negative, so the Measure rebuild sees exactly the resolved CLI state
pkg/k8s/agent/types.go:56 (Minor) — AgentConfig.ClusterConfigPath doc-comment confused local-mode with Job-mode ✅ Comment rewritten — now explains Job-mode rejects this with ErrCodeInvalidRequest, file ingestion is local-mode-only
pkg/collector/network/network_test.go:163 (Trivial) — no test for the MaxClusterConfigBytes rejection ✅ Added TestCollect_FileMode_OversizeRejected — writes a defaults.MaxClusterConfigBytes+1 file and asserts ErrCodeInvalidRequest
pkg/collector/network/translate_test.go:40 (Trivial) — no test for the empty-linkType sentinel ✅ Added TestToMeasurement_IdentitySubtype_UnknownLinkType — sets LinkType = "" on the fixture and asserts the key is present with empty value
pkg/k8s/client/client.go:99 (Major, same as mchmarny #2) — multi-file KUBECONFIG scope claim ✅ Same doc-comment fix above

CodeRabbit also re-flagged several items I already addressed in df3d81d6 (it appears to be re-running against the latest commit and reposting). For the record:

Path Status (already fixed)
pkg/cli/snapshot.go:559 #1 kubeconfig override ✅ fixed in df3d81d
pkg/collector/network/network.go:216 #5 ctx-cancel ✅ fixed in df3d81d
pkg/snapshotter/agent.go:178 #8 Job-mode reject ✅ fixed in df3d81d
pkg/collector/network/network.go:111 (goconst) ✅ fixed in 475e337 (ctxKeyPath const)

Still deferred to follow-up:

CI: build / lint / test green locally. Re-pushed, awaiting fresh CI cycle.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
pkg/collector/network/network.go (1)

112-167: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Make the read-side size check authoritative.

os.Stat only proves the file size at one instant. If the path is replaced or grows between Stat and Open, this branch can still parse a truncated io.LimitReader prefix and emit a partial NetworkTopology instead of rejecting oversize input.

Suggested fix
-	// Read +1 over the cap as a defense-in-depth guard: the os.Stat
-	// check above is the primary gate (it catches files that grew on
-	// disk between Stat and Open just as well), but pairing it with a
-	// LimitReader keeps parser memory bounded if a future change drops
-	// the Stat path.
-	limited := io.LimitReader(f, defaults.MaxClusterConfigBytes+1)
-
-	cfg, err := l8kdisc.ParseClusterConfig(limited)
+	data, err := io.ReadAll(io.LimitReader(f, defaults.MaxClusterConfigBytes+1))
+	if err != nil {
+		return nil, errors.WrapWithContext(errors.ErrCodeInternal,
+			"failed to read cluster-config file", err,
+			map[string]any{ctxKeyPath: c.ClusterConfigPath})
+	}
+	if int64(len(data)) > defaults.MaxClusterConfigBytes {
+		return nil, errors.NewWithContext(errors.ErrCodeInvalidRequest,
+			"cluster-config file exceeds size limit",
+			map[string]any{
+				ctxKeyPath: c.ClusterConfigPath,
+				"size":     len(data),
+				"limit":    defaults.MaxClusterConfigBytes,
+			})
+	}
+
+	cfg, err := l8kdisc.ParseClusterConfig(bytes.NewReader(data))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/collector/network/network.go` around lines 112 - 167, The read-side
validation in the network config loader is still not authoritative because
`os.Stat` can be stale between `Stat` and `Open`, allowing `ParseClusterConfig`
to consume a truncated prefix via `io.LimitReader`. Update the `cluster-config`
read path in `LoadNetworkTopology` so the actual read enforces the size limit
and rejects any oversize input discovered after opening, rather than relying on
the earlier `Stat` check alone.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/cli/snapshot.go`:
- Around line 543-572: Do not clear ns.Factory in AICR_AGENT_MODE unless you
also mirror every factory-driving option into env first. NodeSnapshotter.measure
is rebuilding from env, but only the network vars and KUBECONFIG are being
copied, so CLI-only settings like --os and --max-nodes-per-entry get lost. Fix
this by either preserving the pre-parsed factory on ns.Factory or setting the
missing env inputs before the rebuild path runs, using the same option sources
that snapshotter/snapshot.go expects.

In `@pkg/collector/network/network.go`:
- Around line 117-153: Reject non-regular cluster-config paths in the collector
before calling stdos.Open in CollectorNetwork.collectFromFile. After the
existing stat and size checks, verify that the file mode is a regular file and
return an invalid-request error with the path context if it is a FIFO, device,
or other special file. Keep the existing not-found/permission handling for Stat
and Open, but add the regular-file guard so Open is only reached for valid
files.

In `@pkg/k8s/client/client.go`:
- Around line 97-111: In the kubeconfig lookup helper in client.go, the
multi-file KUBECONFIG case is still falling through to the default
~/.kube/config path, which breaks clientcmd merge semantics. Update the
kubeconfig resolution logic around the existing env check and defaultPath
fallback so that when KUBECONFIG contains multiple paths (the containsAny case),
the helper returns immediately or otherwise signals that loading-rules should
handle it, and only fall back to ~/.kube/config when no kubeconfig env is
present. Keep the behavior aligned with callers like the network collector that
rely on this helper.

---

Duplicate comments:
In `@pkg/collector/network/network.go`:
- Around line 112-167: The read-side validation in the network config loader is
still not authoritative because `os.Stat` can be stale between `Stat` and
`Open`, allowing `ParseClusterConfig` to consume a truncated prefix via
`io.LimitReader`. Update the `cluster-config` read path in `LoadNetworkTopology`
so the actual read enforces the size limit and rejects any oversize input
discovered after opening, rather than relying on the earlier `Stat` check alone.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b3694a61-ec95-42cc-a4ca-c048761600bb

📥 Commits

Reviewing files that changed from the base of the PR and between 475e337 and 53c103e.

⛔ Files ignored due to path filters (89)
  • go.sum is excluded by !**/*.sum
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/network-operator/pkg/consts/consts.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/LICENSE is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.go is excluded by !vendor/**
  • vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/LICENSE is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.go is excluded by !vendor/**
  • vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/LICENSE is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.ids is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmpl is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yaml is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.go is excluded by !vendor/**
  • vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/.travis.yml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/LICENSE.libyaml is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/NOTICE is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/README.md is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/apic.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/decode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/emitterc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/encode.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/parserc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/readerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/resolve.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/scannerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/sorter.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/writerc.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yaml.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlh.go is excluded by !vendor/**
  • vendor/gopkg.in/yaml.v2/yamlprivateh.go is excluded by !vendor/**
  • vendor/modules.txt is excluded by !vendor/**
  • vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.go is excluded by !vendor/**
📒 Files selected for processing (19)
  • docs/integrator/measurement-api.md
  • docs/user/cli-reference.md
  • go.mod
  • pkg/cli/snapshot.go
  • pkg/collector/factory.go
  • pkg/collector/network/network.go
  • pkg/collector/network/network_test.go
  • pkg/collector/network/translate.go
  • pkg/collector/network/translate_test.go
  • pkg/defaults/timeouts.go
  • pkg/k8s/agent/consts.go
  • pkg/k8s/agent/deployer_test.go
  • pkg/k8s/agent/job.go
  • pkg/k8s/agent/rbac.go
  • pkg/k8s/agent/types.go
  • pkg/k8s/client/client.go
  • pkg/snapshotter/agent.go
  • pkg/snapshotter/snapshot.go
  • pkg/snapshotter/snapshot_test.go

Comment thread pkg/cli/snapshot.go Outdated
Comment thread pkg/collector/network/network.go
Comment thread pkg/k8s/client/client.go
@almaslennikov almaslennikov force-pushed the network-collector-v2 branch from 53c103e to d091f52 Compare June 25, 2026 15:48
@almaslennikov

Copy link
Copy Markdown
Contributor Author

Addressing the 3 new CodeRabbit findings on 53c103ee in d091f52a (signed amend, force-pushed).

  • pkg/cli/snapshot.go:572 [Major] — agent-mode bypass dropped the pre-built factory and lost CLI-only options (--os, --max-nodes-per-entry) that have no env source. Fixed by removing ns.Factory = nil and the redundant env-mirroring: factory at line 504 already carries every CLI-resolved option (cluster-config-path, discover-network, kubeconfig, os, max-nodes-per-entry) via the With* options, and the flags' cli.EnvVars Sources already populated opts from Job-set env before this point, so reusing ns.Factory directly is correct for both the in-pod Job path and the dev-bypass path.
  • pkg/collector/network/network.go:153 [Major] — FIFO / device paths passed the size check and blocked on Open. Added an st.Mode().IsRegular() gate right after the Stat check, rejecting non-regular files with ErrCodeInvalidRequest + the mode in the error context.
  • pkg/k8s/client/client.go:111 [Major] — multi-file KUBECONFIG was falling through to ~/.kube/config. Fixed: multi-path env now returns "" immediately so loading-rules-aware callers (l8k's kubeclient.New(""), anything using NewNonInteractiveDeferredLoadingClientConfig) honor the env natively. A user who explicitly set KUBECONFIG=/a:/b no longer gets their home kubeconfig silently substituted for the merge they asked for.

Build / lint / test green locally. Re-pushed, awaiting fresh CI cycle.

# Why

aicr's snapshot model gained a NetworkTopology Measurement type in Phase 1
but nothing populated it. Phase 3 wires the producer: a new collector that
ingests k8s-launch-kit (l8k) cluster-config data — either from a file or
via live discovery — and translates it into the canonical NetworkTopology
shape documented in docs/integrator/measurement-api.md.

This is the consumer side of the cross-repo integration. l8k's library
API landed in two upstream PRs: NVIDIA#102 added the initial Discover/
ParseClusterConfig entry points; NVIDIA#104 split them into a helm-free
sub-package at pkg/networkoperatorplugin/discovery, dropping the
MPL-2.0 hashicorp/go-multierror dep (and ~950 transitive vendor files)
that aicr's verify-licenses gate was rejecting. aicr pins the merge
of NVIDIA#104 and imports the new sub-package.

# What changes

## New package pkg/collector/network

  - Collector type with two source modes, selected by struct fields:
    ClusterConfigPath (file source — no cluster contact) and
    DiscoverNetwork (live l8k discovery — writes node labels, patches
    NicClusterPolicy). When both are set, file path wins so callers can
    default --discover-network from a flag without inadvertent cluster
    contact. When neither is set, the collector is inactive — Collect
    returns (nil, nil) and the snapshotter's collectSafe wrapper
    (hardened in this commit) treats nil Measurement as a no-op.

  - Translator (translate.go) converts l8k.LaunchKitConfig into the
    contracted NetworkTopology Measurement: one Measurement with
    identity / capabilities / pfs (Items) / kernel-modules subtypes.
    Multi-group input is rejected with ErrCodeInvalidRequest and a
    groupCount field on the error context. Multi-group support is a
    planned future iteration; rejection guards the snapshot from
    silently dropping groups.

  - Per-PF Context carries model, connectedGPU, gpuProximity in
    addition to pciAddress, deviceID, psid, partNumber, rdmaDevice,
    networkInterface — every descriptive identifier l8k populates.
    Per-PF Data carries rail, numaNode, traffic (scalars).

  - Routes controller-runtime's internal logger through aicr's slog
    via l8kplugin.WithLogger(logr.FromSlogHandler(...)). Without this,
    callers hit a one-time "log.SetLogger(...) was never called"
    warning from controller-runtime and lose l8k's structured
    discovery output.

## CLI: aicr snapshot --cluster-config / --discover-network

  - --cluster-config <path> (env: AICR_CLUSTER_CONFIG_PATH) ingests
    a pre-existing l8k cluster-config.yaml.
  - --discover-network (env: AICR_DISCOVER_NETWORK) opts into live
    discovery. Discovery is NOT read-only.
  - Both flags off-by-default; an explicit opt-in is required because
    discovery is cluster-mutating and the file-source path needs a
    user-supplied YAML.
  - In AICR_AGENT_MODE=true (developer-bypass that runs collectors
    in-process), the CLI seeds KUBECONFIG from --kubeconfig before
    calling Measure so the network collector's l8k client resolves
    the same kubeconfig the rest of aicr uses.

## snapshotter.AgentConfig + agent.Config + Job env

  - ClusterConfigPath and DiscoverNetwork on snapshotter.AgentConfig
    forward into agent.Config, which renders AICR_CLUSTER_CONFIG_PATH
    and AICR_DISCOVER_NETWORK env vars on the agent Job.
  - In Job-mode, the in-pod CLI consumes those env vars via the
    flag's Sources binding (cli.EnvVars), so the same factory
    construction path serves both local and Job execution.
  - collectSafe in pkg/snapshotter/snapshot.go now skips nil
    Measurements — required for the network collector's
    inactive-mode contract; harmless to existing collectors that
    always return a Measurement.

## Job-mode RBAC for --discover-network

  - pkg/k8s/agent/rbac.go ensureClusterRole conditionally appends a
    discoverNetworkClusterRules() rule set when
    deployer.config.DiscoverNetwork is true. Non-network snapshots
    stay minimal-priv; discover-network snapshots gain the cluster-
    scoped grants l8k actually exercises:
      * apiextensions.k8s.io customresourcedefinitions (get/list/
        create/update/patch) — l8k installs the nic-configuration-
        operator CRDs if missing.
      * namespaces (get/create/delete) — the nvidia-k8s-launch-kit
        bootstrap namespace, created and deleted each run.
      * apps daemonsets (get/list/watch/create/delete) — the nic-
        configuration-daemon bootstrap DaemonSet.
      * serviceaccounts/configmaps (get/create/delete) and
        rbac.authorization.k8s.io roles/rolebindings (get/create/
        delete) — the daemon's supporting resources.
      * pods/exec create — probing daemon pods for VPD / link
        metadata.
      * nodes patch — writing the nvidia.kubernetes-launch-kit.
        machine and .gpu labels onto matched nodes.
      * configuration.net.nvidia.com nicdevices (get/list) — the
        per-PF CRs the daemon publishes.
      * mellanox.com nicclusterpolicies (get/patch) — server-side
        apply of the NicConfigurationOperator section onto the
        user's NicClusterPolicy.
  - Job-mode --discover-network is now usable without escalating the
    Job ServiceAccount to cluster-admin.

## Shared helper pkg/k8s/client.ResolveKubeconfigPath

  - Lifted the explicit-path → KUBECONFIG env → ~/.kube/config →
    in-cluster resolution chain out of BuildKubeClient into an
    exported helper. The network collector uses it before handing
    a path to l8k's kubeclient.New, so aicr's K8s client and the
    l8k discovery client resolve kubeconfigs identically.

## Factory wiring + defaults

  - pkg/collector/factory.go gains CreateNetworkCollector() and three
    Option functions (WithClusterConfigPath, WithDiscoverNetwork,
    WithKubeconfigPath). Existing patterns matched (struct-literal
    construction, no New on the collector itself).
  - pkg/defaults adds CollectorNetworkTimeout (10m) — capped above
    l8k's internal DaemonSet wait — and MaxClusterConfigBytes (1
    MiB) for the file-source io.LimitReader.

## Cross-repo dependency

  - go.mod requires github.com/nvidia/k8s-launch-kit at the merged
    PR NVIDIA#104 commit (pseudo-version v0.0.0-20260625130457-a99c9684627e).
    No local replace — pure upstream pin.
  - Import path migrated:
        l8kplugin "github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin"
    →   l8kdisc   "github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery"
    Plus l8kconfig "github.com/nvidia/k8s-launch-kit/pkg/config" for
    DefaultLaunchKitConfig (used as the cfg arg to Discover).
  - Discover signature change: the cfg is now a required positional
    argument (l8k NVIDIA#104 dropped WithBaseConfig). aicr passes
    DefaultLaunchKitConfig() — l8k's recommended Network Operator
    release line is baked into the embedded default (currently 26.4);
    aicr's snapshot path doesn't pin a release here, that decision
    lives in the recipe/deploy stage.
  - vendor/ tree shrinks by ~950 files: helm.sh/helm/v3 and its
    transitive baggage (containerd, lib/pq, jmoiron/sqlx, lann/builder,
    chai2010/gettext-go, gosuri, peterbourgon/diskv, mitchellh/*,
    moby/term, Masterminds/squirrel, sigs.k8s.io/kustomize, oras-go,
    Azure/go-ansiterm, plus k8s.io/{cli-runtime,kubectl,apiserver,
    component-base,streaming}) are all gone — none of them are
    reachable from the discovery sub-package.
  - hashicorp/go-multierror (MPL-2.0) is gone too — unblocking the
    verify-licenses CI gate that rejected the prior version.

# Testing

  - pkg/collector/network/translate_test.go: 9 cases covering single-
    group happy path, every subtype's shape, kernel modules, multi-
    group rejection, nil input, empty ClusterConfig, no-PFs (pfs
    subtype skipped).
  - pkg/collector/network/network_test.go: 6 cases covering inactive
    contract, file mode success, missing path, invalid YAML, multi-
    group surfaced from collector, file-path-wins-over-discover.
    Tests use real YAML round-tripped through l8k's parser — no test
    seams.
  - pkg/k8s/agent/deployer_test.go: a new TestDeployer_EnsureRBAC
    subtest asserts the discover-network rule set lands when
    Config.DiscoverNetwork is set (mellanox.com/nicclusterpolicies
    patch, pods/exec create, nodes patch all present); the existing
    default subtest still asserts exactly 3 rules so non-discovery
    snapshots stay minimal-priv.
  - go test -race ./... passes. Lint clean (golangci-lint v2.12.2,
    pinned in .settings.yaml).

# End-to-end discovery against a real cluster

Running --discover-network in local mode against a 4-node ThinkSystem
SR675-V3 RTX-PRO-6000 cluster produced a NetworkTopology Measurement
with the expected identity / capabilities / pfs shape, including the
gpuProximity / connectedGPU per-PF Context that l8k overlays from
preset topology:

    apiVersion: aicr.nvidia.com/v1alpha1
    fingerprint:
      accelerator:
        source: nodeTopology.label.nvidia.com/gpu.product
        value: rtx-pro-6000
      gpuNodeCount:
        source: nodeTopology.label.nvidia.com/gpu.product
        value: 3
      k8sVersion:
        source: k8s.server.version
        value: 1.34.3
      nodeCount:
        source: nodeTopology.summary.node-count
        value: 4
      os:
        value: ""
      service:
        value: ""
    kind: Snapshot
    measurements:
      # ...
      - subtypes:
          - context:
              gpuType: NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
              identifier: thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715
              linkType: Ethernet
              machineType: ThinkSystem-SR675-V3
              nodeSelector: nvidia.kubernetes-launch-kit.machine=ThinkSystem-SR675-V3-NVIDIA-RTX-PRO-6000-Blackwell-Ser-20a0a715
            data:
              pf-count: 6
              rail-count: 4
            subtype: identity
          - data:
              ib: true
              rdma: true
              sriov: true
            subtype: capabilities
          - items:
              - context:
                  connectedGPU: GPU0
                  deviceID: a2dc
                  gpuProximity: PIX
                  model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112
                  networkInterface: ens15f0np0
                  partNumber: 900-9D3D4-00EN-HA0
                  pciAddress: "0000:05:00.0"
                  psid: mt_0000001069
                  rdmaDevice: mlx5_1
                data:
                  numaNode: 0
                  rail: 0
                  traffic: east-west
              - context:
                  deviceID: a2dc
                  model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112
                  networkInterface: ens1f0np0
                  partNumber: 900-9D3B6-00CV-AA0
                  pciAddress: 0000:61:00.0
                  psid: mt_0000000884
                  rdmaDevice: mlx5_bond_0
                data:
                  numaNode: 0
                  traffic: north-south
              - context:
                  deviceID: a2dc
                  model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112
                  networkInterface: ens1f1np1
                  partNumber: 900-9D3B6-00CV-AA0
                  pciAddress: 0000:61:00.1
                  psid: mt_0000000884
                data:
                  numaNode: 0
                  traffic: north-south
              - context:
                  connectedGPU: GPU2
                  deviceID: a2dc
                  gpuProximity: PIX
                  model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112
                  networkInterface: ens16f0np0
                  partNumber: 900-9D3D4-00EN-HA0
                  pciAddress: 0000:75:00.0
                  psid: mt_0000001069
                  rdmaDevice: mlx5_0
                data:
                  numaNode: 0
                  rail: 1
                  traffic: east-west
              - context:
                  connectedGPU: GPU4
                  deviceID: a2dc
                  gpuProximity: PIX
                  model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112
                  networkInterface: ens20f0np0
                  partNumber: 900-9D3D4-00EN-HA0
                  pciAddress: 0000:85:00.0
                  psid: mt_0000001069
                  rdmaDevice: mlx5_3
                data:
                  numaNode: 1
                  rail: 2
                  traffic: east-west
              - context:
                  connectedGPU: GPU6
                  deviceID: a2dc
                  gpuProximity: PIX
                  model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112
                  networkInterface: ens21f0np0
                  partNumber: 900-9D3D4-00EN-HA0
                  pciAddress: 0000:f5:00.0
                  psid: mt_0000001069
                  rdmaDevice: mlx5_2
                data:
                  numaNode: 1
                  rail: 3
                  traffic: east-west
            subtype: pfs
        type: NetworkTopology
    metadata:
      source-node: ""
      timestamp: "2026-06-25T10:40:47Z"
      version: dev

CLI invocation and l8k discovery log (truncated to network-relevant
lines; the proc/sys collectors warn-and-skip because the local-mode
host isn't a cluster node):

    AICR_AGENT_MODE=true aicr snapshot \
      --discover-network \
      --kubeconfig ~/workspace/k8s-launch-kit/rtx-cluster/kubeconfig \
      -o /Users/amaslennikov/workspace/tmp/aicr-discovery/discovery.yaml
    [cli] collecting SystemD service configurations
    [cli] collecting GPU information
    [cli] collecting node topology information
    [cli] collecting OS configuration
    [cli] failed to collect grub - skipping: collector=grub error=[INTERNAL] failed to read GRUB params from /proc/cmdline: [INTERNAL] failed to open file "/proc/cmdline": open /proc/cmdline: no such file or directory
    [cli] collecting network topology via l8k discovery
    [cli] failed to collect sysctl - skipping: collector=sysctl error=[INTERNAL] failed to collect sysctl parameters: [INTERNAL] failed to walk directory /proc/sys: lstat /proc/sys: no such file or directory
    [cli] failed to collect kmod - skipping: collector=kmod error=[INTERNAL] failed to read kernel modules from /proc/modules: [INTERNAL] failed to open file "/proc/modules": open /proc/modules: no such file or directory
    [cli] GPU hardware detection failed: error=[INTERNAL] NFD PCI discovery failed: failed to detect PCI devices: open /sys/bus/pci/devices: no such file or directory
    [cli] failed to collect release - skipping: collector=release error=[INTERNAL] failed to read os release from /usr/lib/os-release: [INTERNAL] failed to open file "/usr/lib/os-release": open /usr/lib/os-release: no such file or directory
    [cli] collecting Kubernetes cluster information
    [cli] D-Bus not available - no systemd data will be collected: error=dial unix /run/systemd/private: connect: no such file or directory hint=systemd/D-Bus is required for service status collection
    [cli] failed to collect node - skipping: collector=node error=[INVALID_REQUEST] node name not set in environment
    [cli] node topology collection complete: nodes=4 taints=1 labels=232
    [cli] Bootstrapping NIC configuration daemon for discovery: namespace=nvidia-k8s-launch-kit image=nvcr.io/nvidia/mellanox/nic-configuration-operator-daemon:network-operator-v26.4.0
    [cli] Filtered nodes by NVIDIA NIC presence: candidates=4 withNICs=4
    [cli] Skipping discovered group with no east-west NICs (north-south only); not added to cluster-config: nodes=[mdf-prod-2852-k8m-02] pfCount=2
    [cli] Discovery summary: groupCount=1 eastWestPFs=4 northSouthPFsFiltered=2 presetMatches=1 presetDeviationGroups=0 machineLabelledGroups=1
    [cli] Deleted bootstrap namespace: namespace=nvidia-k8s-launch-kit
    [cli] network topology collection complete: identifier=thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715 machineType=ThinkSystem-SR675-V3 gpuType=NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition pfs=6 rails=4

# Not in this commit

  - ConfigMap mounting for Job-mode --cluster-config — file source
    works in local mode only today.
  - Multi-group NetworkTopology emission. The single-group
    enforcement is in toMeasurement; lifting it requires the
    multi-Measurement-per-Type consumer rewrites in constraints
    extractor, recipe validation, diff indexing, and fingerprint
    that are tracked as a follow-up phase.
@almaslennikov almaslennikov force-pushed the network-collector-v2 branch from d091f52 to d1d12aa Compare June 25, 2026 16:18

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All three of my prior comments are addressed in d1d12aa6 — verified against the current head:

  1. docs/user/cli-reference.md — both --cluster-config and --discover-network rows added to the aicr snapshot table with full usage, mutual-exclusion, and local-mode-vs-Job-mode constraints. ✅
  2. pkg/k8s/client/client.go — doc comment now correctly scopes the "clientcmd merges" claim to loading-rules-aware callers and explicitly distinguishes BuildKubeClient (falls through to InClusterConfig() on empty path). Behavior matches: multi-file KUBECONFIG returns "" instead of silently substituting ~/.kube/config. ✅
  3. pkg/k8s/agent/rbac.go — non-blocking operator-awareness nit; the broad cluster-scoped grant (pods/exec, roles/rolebindings, namespaces) is now documented per-rule in discoverNetworkClusterRules() and at the call site, gated behind --discover-network. Namespace-scoping deferred to an l8k-side follow-up, as expected. ✅

Original verify-licenses blocker is resolved (MPL-2.0 go-multierror gone via the l8k discovery split). LGTM.

Note for merge: branch is behind main — rebase before merge, and let the in-flight GPU/E2E/analyze checks finish.

@mchmarny mchmarny merged commit dc23abf into NVIDIA:main Jun 25, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants