feat(network): NetworkTopology collector + l8k library integration#1474
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a network-topology collection path that can read an Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/cli/snapshot.go`:
- Around line 550-557: The kubeconfig seeding logic in snapshot.go preserves an
inherited KUBECONFIG even when opts.kubeconfig is explicitly set, which can
cause l8k and the rest of the CLI to target different clusters. Update the
kubeconfig setup in the branch around the KUBECONFIG assignment so that an
explicit opts.kubeconfig always wins and is written into the environment before
the network collector runs, ensuring kubeclient.New and the rest of the
agent-mode flow use the same resolved kubeconfig.
In `@pkg/collector/network/network_test.go`:
- Around line 66-163: Refactor the existing Collector.Collect tests in
network_test.go into a single table-driven test with subtests, since they all
share the same setup/assertion pattern across different inputs and outcomes.
Keep the coverage for inactive, file success, missing path, invalid YAML,
multi-group rejection, and file-path-vs-discovery behavior, but move the
per-case expectations into a test table and use t.Run for each scenario. Use the
existing TestCollect_* cases, Collector.Collect, and writeTempConfig as the key
anchors while consolidating repeated error/type checks into shared helper
assertions if needed.
In `@pkg/collector/network/network.go`:
- Around line 159-163: The Discover call in network collector error handling is
wrapping every failure as ErrCodeUnavailable, which hides timeout/cancellation
signals. In the error path after l8kdisc.Discover in the network collector flow,
check for context.DeadlineExceeded and context.Canceled first and return those
directly so the 10-minute deadline still surfaces as a timeout/cancel. Only wrap
non-context errors with errors.Wrap(errors.ErrCodeUnavailable, ...) to keep real
availability failures distinct.
- Around line 113-122: Reject oversized cluster-config input before parsing by
checking the actual file size or otherwise confirming the reader hit the byte
cap in the network collector flow. In the code around l8kdisc.ParseClusterConfig
and the limited reader in network.go, add an explicit oversize guard that
returns an invalid-request error when the config exceeds
defaults.MaxClusterConfigBytes, rather than relying on parsing to fail. Keep the
fix localized to the cluster-config loading path so oversized files cannot
produce a partial NetworkTopology snapshot.
- Around line 105-109: The ClusterConfigPath open path currently maps every
stdos.Open failure to ErrCodeNotFound, so adjust the error handling around that
open call to only wrap true missing-file cases with errors.ErrCodeNotFound. In
the logic that opens c.ClusterConfigPath, inspect the returned error and
preserve non-not-found failures (such as permission or ENOTDIR) with their real
error code/context, while keeping the not-found translation only for the actual
missing-file case.
In `@pkg/collector/network/translate_test.go`:
- Around line 29-73: The PF translation test fixture currently only validates
always-present fields, so it can miss regressions in the optional PF context
keys. Update fixtureSingleGroup and the translate test assertions to cover a
populated PF case for model, connectedGPU, and gpuProximity, and also add an
explicit check that these keys are omitted when unset. Use the existing
PFConfig/LaunchKitConfig setup and the translate-related assertions in
translate_test.go to keep the coverage tied to the actual output shape.
In `@pkg/k8s/client/client.go`:
- Around line 83-85: ResolveKubeconfigPath is currently returning the raw
KUBECONFIG env value, which breaks multi-file kubeconfig setups when
clientcmd.BuildConfigFromFlags treats it as a single path. Update the kubeconfig
loading flow around ResolveKubeconfigPath and the BuildConfigFromFlags call to
either rely on client-go’s default loading rules for KUBECONFIG or explicitly
split/merge the colon-separated list before creating the config. Ensure the
client creation path in client.go handles KUBECONFIG=/base:/overlay correctly.
In `@pkg/snapshotter/agent.go`:
- Around line 162-163: The Job-mode snapshot flow in agent.go is passing
ClusterConfigPath through unchanged, which leaves the pod pointing at an
unreadable host path. Update the code that builds the Job spec or agent config
to either mount the cluster config file and rewrite ClusterConfigPath to the
in-pod location, or explicitly reject ClusterConfigPath when launching a Job
until mounting support exists. Use the Snapshotter/Job construction path and the
ClusterConfigPath/DiscoverNetwork handling in agent.go to locate the fix.
In `@pkg/snapshotter/snapshot_test.go`:
- Around line 126-131: The snapshot tests are still unable to model an inactive
network collector, so the new nil-skip path in collectSafe remains untested.
Update the mockCollector/CreateNetworkCollector test setup to support an
explicit “inactive” state or a distinct override that can return a real nil
Measurement with no error, then use that in the snapshot tests and assertions
that currently rely on networkMeasurement so the nil branch is exercised.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 3457da35-5c2b-4e0b-b083-78f3d6dcf176
⛔ Files ignored due to path filters (89)
go.sumis excluded by!**/*.sumvendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/pkg/consts/consts.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/LICENSEis excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/LICENSEis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmplis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/.travis.ymlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSE.libyamlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/NOTICEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/README.mdis excluded by!vendor/**vendor/gopkg.in/yaml.v2/apic.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/decode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/emitterc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/encode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/parserc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/readerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/resolve.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/scannerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/sorter.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/writerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yaml.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlh.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlprivateh.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.gois excluded by!vendor/**
📒 Files selected for processing (17)
docs/integrator/measurement-api.mdgo.modpkg/cli/snapshot.gopkg/collector/factory.gopkg/collector/network/network.gopkg/collector/network/network_test.gopkg/collector/network/translate.gopkg/collector/network/translate_test.gopkg/defaults/timeouts.gopkg/k8s/agent/deployer_test.gopkg/k8s/agent/job.gopkg/k8s/agent/rbac.gopkg/k8s/agent/types.gopkg/k8s/client/client.gopkg/snapshotter/agent.gopkg/snapshotter/snapshot.gopkg/snapshotter/snapshot_test.go
3ba60d8 to
a24c845
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pkg/k8s/agent/job.go (1)
109-139: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winPassing
AICR_CLUSTER_CONFIG_PATHinto the Job isn't enough.
buildPodSpeconly mounts/tmp, so this env var points at a path from the submitting process, not a file that exists inside the pod. Because the network collector checksClusterConfigPathbeforeDiscoverNetwork, agent-mode jobs will take the file branch and fail to open the path instead of discovering from the cluster. Either stage the file into the pod (for example via ConfigMap/volume) or rejectClusterConfigPathfor agent deployments.Also applies to: 322-334
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/k8s/agent/job.go` around lines 109 - 139, The Job pod spec in buildPodSpec does not make AICR_CLUSTER_CONFIG_PATH available inside the pod, so agent-mode jobs will try to open a host path that does not exist in the container. Update the Deployer/buildPodSpec flow to either mount or stage the cluster config file into the pod (for example through a volume or ConfigMap) so the path is valid, or explicitly reject ClusterConfigPath for agent deployments before constructing the PodSpec. Ensure the fix is applied where buildEnvVars is consumed by buildPodSpec so the NetworkCollector can safely decide between ClusterConfigPath and DiscoverNetwork.pkg/k8s/agent/rbac.go (1)
186-205: 🎯 Functional Correctness | 🟠 Majorpkg/k8s/agent/rbac.go:186-205 — Preserve
resourceVersionbefore updating the existing ClusterRole.The
AlreadyExistspath callsUpdateon a newly constructedClusterRolewith nometadata.resourceVersion, so the apiserver will reject the write and--discover-networkwon’t upgrade an existing role. Fetch the current object first (or use patch/server-side apply) and carry itsresourceVersioninto the update.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/k8s/agent/rbac.go` around lines 186 - 205, The ClusterRole reconciliation in createClusterRole currently retries an Update after AlreadyExists using a freshly constructed rbacv1.ClusterRole, but that object has no resourceVersion so the apiserver will reject it. Fix the AlreadyExists path by first fetching the existing ClusterRole and copying its resourceVersion into the object before calling d.clientset.RbacV1().ClusterRoles().Update, or switch this flow to patch/server-side apply; keep the change localized around createClusterRole and the ClusterRoles().Create/Update sequence.
♻️ Duplicate comments (4)
pkg/k8s/client/client.go (1)
62-84:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDon't pass raw
KUBECONFIGthroughBuildConfigFromFlags.
ResolveKubeconfigPathreturns the trimmedKUBECONFIGvalue as-is, butBuildConfigFromFlags("", kubeconfig)turns that intoClientConfigLoadingRules{ExplicitPath: kubeconfigPath}. In client-go,ExplicitPathmeans “load only this file”, whileKUBECONFIGis a list of files to merge, so values like/base:/overlaywill break here instead of using the merged config. Use default loading rules when the source isKUBECONFIG, or split/merge the list before building the config. (kubernetes.io)Also applies to: 138-149
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/k8s/client/client.go` around lines 62 - 84, BuildKubeClient is treating the resolved KUBECONFIG value as a single explicit file, which breaks merged kubeconfig lists. Update the config निर्माण path so BuildConfigFromFlags is not used with a raw KUBECONFIG list from ResolveKubeconfigPath; instead, use client-go default loading rules when the source is KUBECONFIG or split/merge the list before creating the config. Keep the existing ResolveKubeconfigPath behavior for explicit paths, and adjust the BuildKubeClient flow accordingly wherever it consumes the kubeconfig string.pkg/collector/network/network.go (1)
113-122:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStill unresolved: oversized cluster-config files can be accepted as truncated YAML.
io.LimitReader(..., MaxClusterConfigBytes+1)only limits whatParseClusterConfigcan read; it never proves the file was within the cap. If the firstMaxClusterConfigBytes+1bytes are still valid YAML, this path can parse a truncated prefix and emit a partialNetworkTopologysnapshot. Reject the file once you know it exceededMaxClusterConfigBytesinstead of relying on parse failure.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/collector/network/network.go` around lines 113 - 122, The cluster-config read path in network.go still accepts oversized files if the first bytes form valid YAML, because ParseClusterConfig is fed through io.LimitReader without proving the file stayed within MaxClusterConfigBytes. Update the network config loading flow around ParseClusterConfig so it explicitly detects when the underlying file exceeds defaults.MaxClusterConfigBytes and returns an invalid-request error before parsing or accepting the result, rather than relying on a parse failure from a truncated prefix.pkg/snapshotter/agent.go (1)
147-163: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winReject
--cluster-configin Job mode until the file is mounted into the pod.
deployAndWaitForResultforwards the caller’s host path straight intoagent.Config, and the agent Job only republishes it asAICR_CLUSTER_CONFIG_PATH. Nothing in this path mounts the file into the pod, so a normalaicr snapshot --cluster-config /host/pathpoints the in-pod collector at a file that does not exist and the network measurement fails.Minimal safe guard
func deployAndWaitForResult(ctx context.Context, clientset k8sclient.Interface, config *AgentConfig, agentOutput string) ([]byte, error) { + if config.ClusterConfigPath != "" { + return nil, errors.New( + errors.ErrCodeInvalidRequest, + "--cluster-config is not supported in agent Job mode until the file is mounted into the pod; use --discover-network or AICR_AGENT_MODE=true", + ) + } + // Auto-inject GPU node selector when no placement constraints are set.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/snapshotter/agent.go` around lines 147 - 163, `deployAndWaitForResult` is passing a host-side `--cluster-config` path into `agent.Config.ClusterConfigPath` without ensuring it exists in the Job pod, so reject or ignore `ClusterConfigPath` in Job mode until the file is actually mounted. Update the job setup in `deployAndWaitForResult` and any related `agent.Config` handling to validate this case early and return a clear error before creating the agent Job, using the existing `agent.Config` and `AICR_CLUSTER_CONFIG_PATH` flow as the place to enforce the safeguard.pkg/cli/snapshot.go (1)
543-557: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winLet explicit
--kubeconfigoverride inheritedKUBECONFIG.In the
AICR_AGENT_MODE=truepath you clearns.Factoryand rebuild collector options from env. If the shell already exportsKUBECONFIGfor cluster A but the user passes--kubeconfigfor cluster B, this guard preserves cluster A for l8k while the rest of the CLI is configured for cluster B. With--discover-network, that can label or patch the wrong cluster.Use the explicit kubeconfig whenever one is set
- if opts.kubeconfig != "" && os.Getenv("KUBECONFIG") == "" { - _ = os.Setenv("KUBECONFIG", opts.kubeconfig) + if opts.kubeconfig != "" { + if setErr := os.Setenv("KUBECONFIG", opts.kubeconfig); setErr != nil { + return errors.Wrap(errors.ErrCodeInvalidRequest, "failed to set KUBECONFIG", setErr) + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/snapshot.go` around lines 543 - 557, The AICR_AGENT_MODE kubeconfig seeding logic in snapshot.go should let the explicit opts.kubeconfig override any inherited KUBECONFIG so the collector and CLI use the same cluster context. Update the env setup around the AICR_AGENT_MODE block to always set KUBECONFIG from opts.kubeconfig when it is non-empty, instead of skipping when KUBECONFIG is already present; keep the existing handling for opts.clusterConfigPath and opts.discoverNetwork intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/collector/network/translate.go`:
- Around line 142-144: The NetworkTopology translation is skipping
identity.context.linkType when g.LinkType is empty, which changes the expected
API shape. Update the logic in the translate.go path around the WithContext call
so the field is written unconditionally in the NetworkTopology builder, using
the empty string as the unknown sentinel instead of omitting the key entirely.
In `@pkg/k8s/agent/deployer_test.go`:
- Around line 168-224: The current test only verifies first-time creation of the
ClusterRole, but it misses the upgrade/mutation path for existing resources. In
deployer_test.go, extend the ensureClusterRole coverage around NewDeployer and
ensureClusterRole by adding a case that pre-creates the base aicr-node-reader
ClusterRole, then reruns ensureClusterRole with DiscoverNetwork enabled and
asserts the existing role gains the discovery rules. Keep the existing
assertions for the added rules, but make sure the new subtest specifically
exercises the update path on an already-present ClusterRole.
In `@pkg/k8s/agent/rbac.go`:
- Around line 290-337: The discoverNetworkClusterRules cluster-role set is
over-broad because the bootstrap namespace resources are still granted via
cluster-wide RBAC. Update discoverNetworkClusterRules to keep only truly
cluster-scoped permissions, and move the bootstrap namespace resources handled
in the discovery flow (daemonsets, serviceaccounts, configmaps, roles,
rolebindings) into a namespace-scoped Role and RoleBinding for
nvidia-k8s-launch-kit so the permissions are limited to that namespace.
---
Outside diff comments:
In `@pkg/k8s/agent/job.go`:
- Around line 109-139: The Job pod spec in buildPodSpec does not make
AICR_CLUSTER_CONFIG_PATH available inside the pod, so agent-mode jobs will try
to open a host path that does not exist in the container. Update the
Deployer/buildPodSpec flow to either mount or stage the cluster config file into
the pod (for example through a volume or ConfigMap) so the path is valid, or
explicitly reject ClusterConfigPath for agent deployments before constructing
the PodSpec. Ensure the fix is applied where buildEnvVars is consumed by
buildPodSpec so the NetworkCollector can safely decide between ClusterConfigPath
and DiscoverNetwork.
In `@pkg/k8s/agent/rbac.go`:
- Around line 186-205: The ClusterRole reconciliation in createClusterRole
currently retries an Update after AlreadyExists using a freshly constructed
rbacv1.ClusterRole, but that object has no resourceVersion so the apiserver will
reject it. Fix the AlreadyExists path by first fetching the existing ClusterRole
and copying its resourceVersion into the object before calling
d.clientset.RbacV1().ClusterRoles().Update, or switch this flow to
patch/server-side apply; keep the change localized around createClusterRole and
the ClusterRoles().Create/Update sequence.
---
Duplicate comments:
In `@pkg/cli/snapshot.go`:
- Around line 543-557: The AICR_AGENT_MODE kubeconfig seeding logic in
snapshot.go should let the explicit opts.kubeconfig override any inherited
KUBECONFIG so the collector and CLI use the same cluster context. Update the env
setup around the AICR_AGENT_MODE block to always set KUBECONFIG from
opts.kubeconfig when it is non-empty, instead of skipping when KUBECONFIG is
already present; keep the existing handling for opts.clusterConfigPath and
opts.discoverNetwork intact.
In `@pkg/collector/network/network.go`:
- Around line 113-122: The cluster-config read path in network.go still accepts
oversized files if the first bytes form valid YAML, because ParseClusterConfig
is fed through io.LimitReader without proving the file stayed within
MaxClusterConfigBytes. Update the network config loading flow around
ParseClusterConfig so it explicitly detects when the underlying file exceeds
defaults.MaxClusterConfigBytes and returns an invalid-request error before
parsing or accepting the result, rather than relying on a parse failure from a
truncated prefix.
In `@pkg/k8s/client/client.go`:
- Around line 62-84: BuildKubeClient is treating the resolved KUBECONFIG value
as a single explicit file, which breaks merged kubeconfig lists. Update the
config निर्माण path so BuildConfigFromFlags is not used with a raw KUBECONFIG
list from ResolveKubeconfigPath; instead, use client-go default loading rules
when the source is KUBECONFIG or split/merge the list before creating the
config. Keep the existing ResolveKubeconfigPath behavior for explicit paths, and
adjust the BuildKubeClient flow accordingly wherever it consumes the kubeconfig
string.
In `@pkg/snapshotter/agent.go`:
- Around line 147-163: `deployAndWaitForResult` is passing a host-side
`--cluster-config` path into `agent.Config.ClusterConfigPath` without ensuring
it exists in the Job pod, so reject or ignore `ClusterConfigPath` in Job mode
until the file is actually mounted. Update the job setup in
`deployAndWaitForResult` and any related `agent.Config` handling to validate
this case early and return a clear error before creating the agent Job, using
the existing `agent.Config` and `AICR_CLUSTER_CONFIG_PATH` flow as the place to
enforce the safeguard.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: c58932a3-6167-450a-a930-69d6b1f06bb3
⛔ Files ignored due to path filters (89)
go.sumis excluded by!**/*.sumvendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/pkg/consts/consts.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/LICENSEis excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/LICENSEis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmplis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/.travis.ymlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSE.libyamlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/NOTICEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/README.mdis excluded by!vendor/**vendor/gopkg.in/yaml.v2/apic.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/decode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/emitterc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/encode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/parserc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/readerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/resolve.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/scannerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/sorter.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/writerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yaml.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlh.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlprivateh.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.gois excluded by!vendor/**
📒 Files selected for processing (18)
docs/integrator/measurement-api.mdgo.modpkg/cli/snapshot.gopkg/collector/factory.gopkg/collector/network/network.gopkg/collector/network/network_test.gopkg/collector/network/translate.gopkg/collector/network/translate_test.gopkg/defaults/timeouts.gopkg/k8s/agent/consts.gopkg/k8s/agent/deployer_test.gopkg/k8s/agent/job.gopkg/k8s/agent/rbac.gopkg/k8s/agent/types.gopkg/k8s/client/client.gopkg/snapshotter/agent.gopkg/snapshotter/snapshot.gopkg/snapshotter/snapshot_test.go
a24c845 to
df3d81d
Compare
|
Addressing CodeRabbit feedback in Fixed (this revision):
Deferred to follow-up (not blocking this PR):
CI: build / lint / test green locally. Re-pushed, awaiting fresh CI cycle. |
There was a problem hiding this comment.
Actionable comments posted: 7
♻️ Duplicate comments (1)
pkg/collector/network/network.go (1)
112-157: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winThe oversize guard is still racy after
Stat().
Stat()only validates the file state you looked at before opening it. If the path is replaced or grows betweenStat()andParseClusterConfig, this code still parses a truncatedLimitReader, and a valid prefix can be accepted as a smaller config. That can emit a partialNetworkTopologysnapshot from input that should have been rejected.A safer pattern here is to read at most
MaxClusterConfigBytes+1bytes into a buffer first, rejectlen(buf) > MaxClusterConfigBytes, and then parse from that same stable buffer.One localized fix
+import ( + "bytes" + ... +) ... - limited := io.LimitReader(f, defaults.MaxClusterConfigBytes+1) - - cfg, err := l8kdisc.ParseClusterConfig(limited) + buf, err := io.ReadAll(io.LimitReader(f, defaults.MaxClusterConfigBytes+1)) + if err != nil { + return nil, errors.WrapWithContext(errors.ErrCodeInternal, + "failed to read cluster-config file", err, + map[string]any{"path": c.ClusterConfigPath}) + } + if int64(len(buf)) > defaults.MaxClusterConfigBytes { + return nil, errors.NewWithContext(errors.ErrCodeInvalidRequest, + "cluster-config file exceeds size limit", + map[string]any{ + "path": c.ClusterConfigPath, + "size": len(buf), + "limit": defaults.MaxClusterConfigBytes, + }) + } + + cfg, err := l8kdisc.ParseClusterConfig(bytes.NewReader(buf))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/collector/network/network.go` around lines 112 - 157, The size check in the cluster-config loading path is still race-prone because `Stat()` and `ParseClusterConfig` read from different file states. Update the `NetworkTopology`/cluster-config read flow to use one stable read source: open the file once in the loader, read up to `defaults.MaxClusterConfigBytes+1` into a buffer, reject when the buffer exceeds the limit, and then pass that buffer to `l8kdisc.ParseClusterConfig` instead of parsing directly from `io.LimitReader`. Keep the existing error mapping and context around `stdos.Open`, `errors.WrapWithContext`, and `defaults.MaxClusterConfigBytes`.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/cli/snapshot.go`:
- Around line 398-403: The `cluster-config` flag help text currently implies it
works in the normal `aicr snapshot` Job flow, but `pkg/snapshotter/agent.go`
rejects it outside the local-agent path. Update the `&cli.StringFlag` definition
for `cluster-config` in `snapshot.go` so the usage clearly says this flag is
only supported for local-agent mode in this iteration, while keeping the
existing network-topology and precedence wording intact.
- Around line 543-559: The env seeding in snapshot.go only writes the
true/non-empty cases, so inherited AICR_CLUSTER_CONFIG_PATH and
AICR_DISCOVER_NETWORK values can leak through after ns.Measure() reruns with
ns.Factory = nil. Update the AICR_AGENT_MODE block in snapshot setup to mirror
the resolved CLI state back into the environment unconditionally by setting or
clearing those vars based on opts.clusterConfigPath and opts.discoverNetwork,
alongside the existing KUBECONFIG handling.
In `@pkg/collector/network/network_test.go`:
- Around line 83-163: Add a file-mode test that exercises the new size limit in
Collector.Collect by creating a temp config larger than
defaults.MaxClusterConfigBytes and calling Collect with ClusterConfigPath set.
Assert it returns a structured error with errors.ErrCodeInvalidRequest, using
the existing file-mode test pattern in TestCollect_FileMode_* to keep the
behavior locked down.
In `@pkg/collector/network/network.go`:
- Around line 104-106: The repeated "path" key used in the network collector
logging is triggering goconst, so deduplicate it by hoisting the key into a
shared constant and reuse it in the slog.String calls across the affected
logging sites in the network collector. Update the log statements in the
collector flow so the constant is referenced consistently from the relevant
functions in this package, then run golangci-lint on the affected package paths
to verify the lint passes.
In `@pkg/collector/network/translate_test.go`:
- Around line 36-40: Add a test case in translate_test.go for the
unknown-linkType path so identity.context.linkType is asserted as present with
an empty string, not omitted. Extend the existing table-driven tests around the
translate/selector generation logic to include a case where LinkType is empty,
and verify the resulting NodeSelector (or equivalent output from the relevant
translate helper) still contains the identity.context.linkType key with ""
alongside the existing symbols such as Identifier, MachineType, and GPUType.
In `@pkg/k8s/agent/deployer_test.go`:
- Around line 181-223: The RBAC checks in the test are using three separate
boolean scans; refactor this block in deployer_test.go into a table-driven test
that iterates over expected {apiGroup, resource, verb} cases. Update the
existing rule validation around cr.Rules to compare each case against the
discovered rules and report missing permissions per entry, while keeping the
same coverage for mellanox.com/nicclusterpolicies patch, pods/exec create, and
nodes patch.
In `@pkg/k8s/client/client.go`:
- Around line 89-99: ResolveKubeconfigPath currently returns an empty path for
multi-file KUBECONFIG values, which causes BuildKubeClient to fall back to
in-cluster config instead of honoring clientcmd merging. Update BuildKubeClient
to use clientcmd.NewDefaultClientConfigLoadingRules() for the multi-file case,
or otherwise defer loading so the raw KUBECONFIG precedence is preserved; keep
ResolveKubeconfigPath behavior aligned with that flow.
---
Duplicate comments:
In `@pkg/collector/network/network.go`:
- Around line 112-157: The size check in the cluster-config loading path is
still race-prone because `Stat()` and `ParseClusterConfig` read from different
file states. Update the `NetworkTopology`/cluster-config read flow to use one
stable read source: open the file once in the loader, read up to
`defaults.MaxClusterConfigBytes+1` into a buffer, reject when the buffer exceeds
the limit, and then pass that buffer to `l8kdisc.ParseClusterConfig` instead of
parsing directly from `io.LimitReader`. Keep the existing error mapping and
context around `stdos.Open`, `errors.WrapWithContext`, and
`defaults.MaxClusterConfigBytes`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 11cc1d5b-7d97-4600-ac6e-d98939ef84ad
⛔ Files ignored due to path filters (89)
go.sumis excluded by!**/*.sumvendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/pkg/consts/consts.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/LICENSEis excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/LICENSEis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmplis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/.travis.ymlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSE.libyamlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/NOTICEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/README.mdis excluded by!vendor/**vendor/gopkg.in/yaml.v2/apic.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/decode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/emitterc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/encode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/parserc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/readerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/resolve.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/scannerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/sorter.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/writerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yaml.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlh.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlprivateh.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.gois excluded by!vendor/**
📒 Files selected for processing (18)
docs/integrator/measurement-api.mdgo.modpkg/cli/snapshot.gopkg/collector/factory.gopkg/collector/network/network.gopkg/collector/network/network_test.gopkg/collector/network/translate.gopkg/collector/network/translate_test.gopkg/defaults/timeouts.gopkg/k8s/agent/consts.gopkg/k8s/agent/deployer_test.gopkg/k8s/agent/job.gopkg/k8s/agent/rbac.gopkg/k8s/agent/types.gopkg/k8s/client/client.gopkg/snapshotter/agent.gopkg/snapshotter/snapshot.gopkg/snapshotter/snapshot_test.go
| hasNCPPatch := false | ||
| hasPodsExec := false | ||
| hasNodesPatch := false | ||
| for _, r := range cr.Rules { | ||
| for _, g := range r.APIGroups { | ||
| if g == "mellanox.com" { | ||
| for _, res := range r.Resources { | ||
| if res == "nicclusterpolicies" { | ||
| for _, v := range r.Verbs { | ||
| if v == "patch" { | ||
| hasNCPPatch = true | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| for _, res := range r.Resources { | ||
| if res == "pods/exec" { | ||
| for _, v := range r.Verbs { | ||
| if v == "create" { | ||
| hasPodsExec = true | ||
| } | ||
| } | ||
| } | ||
| if res == "nodes" { | ||
| for _, v := range r.Verbs { | ||
| if v == "patch" { | ||
| hasNodesPatch = true | ||
| } | ||
| } | ||
| } | ||
| } | ||
| } | ||
| if !hasNCPPatch { | ||
| t.Error("expected mellanox.com/nicclusterpolicies patch rule") | ||
| } | ||
| if !hasPodsExec { | ||
| t.Error("expected pods/exec create rule") | ||
| } | ||
| if !hasNodesPatch { | ||
| t.Error("expected nodes/patch rule") | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win
Make the RBAC expectations table-driven.
The three expected discovery rules are independent cases; a table of {apiGroup, resource, verb} entries would match the project’s Go test guidance and make missing-rule additions easier to review. As per coding guidelines, “Write table-driven tests for multiple test cases.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/k8s/agent/deployer_test.go` around lines 181 - 223, The RBAC checks in
the test are using three separate boolean scans; refactor this block in
deployer_test.go into a table-driven test that iterates over expected {apiGroup,
resource, verb} cases. Update the existing rule validation around cr.Rules to
compare each case against the discovered rules and report missing permissions
per entry, while keeping the same coverage for mellanox.com/nicclusterpolicies
patch, pods/exec create, and nodes patch.
Source: Coding guidelines
df3d81d to
475e337
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/cli/snapshot.go (1)
398-403: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick winHelp text still implies
--cluster-configworks in the default Job flow.
pkg/snapshotter/agent.gorejects--cluster-configin Job mode (returnsErrCodeInvalidRequest); it is supported only in local agent mode (AICR_AGENT_MODE=true) this iteration. Call that limitation out in the usage string so the CLI contract matches runtime behavior.Suggested wording
&cli.StringFlag{ Name: "cluster-config", - Usage: "Path to a pre-existing k8s-launch-kit (l8k) cluster-config.yaml. Ingests the file's network topology into the snapshot as a NetworkTopology Measurement. Mutually exclusive with --discover-network at the collector level — file path wins when both are set.", + Usage: "Path to a pre-existing k8s-launch-kit (l8k) cluster-config.yaml. Ingests the file's network topology into the snapshot as a NetworkTopology Measurement. Local agent mode only for now (AICR_AGENT_MODE=true); the default Job flow rejects it because host files are not mounted into the pod. Mutually exclusive with --discover-network at the collector level — file path wins when both are set.", Sources: cli.EnvVars("AICR_CLUSTER_CONFIG_PATH"), Category: catAgentDeployment, },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cli/snapshot.go` around lines 398 - 403, The `cluster-config` flag help text in `snapshotFlags` still suggests it is generally available, but `agent.go` only accepts it in local agent mode and rejects it in Job mode via `ErrCodeInvalidRequest`. Update the `Usage` string for the `&cli.StringFlag{Name: "cluster-config"}` entry to explicitly say it is supported only when `AICR_AGENT_MODE=true` and not in the default Job flow, so the CLI description matches the behavior enforced by the agent validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/k8s/agent/types.go`:
- Around line 51-56: Update the ClusterConfigPath comment in types.go to remove
the incorrect “local-mode” reference tied to AICR_AGENT_MODE=true; the Job-based
in-pod agent uses that env var, not a local-mode file-input workflow. Reword the
note so it points users to the correct usage for file-based input with
ClusterConfigPath and the relevant Job/agent path, keeping the guidance aligned
with the ClusterConfigPath field and AICR_AGENT_MODE mention.
---
Duplicate comments:
In `@pkg/cli/snapshot.go`:
- Around line 398-403: The `cluster-config` flag help text in `snapshotFlags`
still suggests it is generally available, but `agent.go` only accepts it in
local agent mode and rejects it in Job mode via `ErrCodeInvalidRequest`. Update
the `Usage` string for the `&cli.StringFlag{Name: "cluster-config"}` entry to
explicitly say it is supported only when `AICR_AGENT_MODE=true` and not in the
default Job flow, so the CLI description matches the behavior enforced by the
agent validation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1c711efd-0fbb-4356-81f1-7151d26a0ec5
⛔ Files ignored due to path filters (89)
go.sumis excluded by!**/*.sumvendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/pkg/consts/consts.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/LICENSEis excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/LICENSEis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmplis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/.travis.ymlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSE.libyamlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/NOTICEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/README.mdis excluded by!vendor/**vendor/gopkg.in/yaml.v2/apic.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/decode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/emitterc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/encode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/parserc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/readerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/resolve.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/scannerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/sorter.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/writerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yaml.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlh.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlprivateh.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.gois excluded by!vendor/**
📒 Files selected for processing (18)
docs/integrator/measurement-api.mdgo.modpkg/cli/snapshot.gopkg/collector/factory.gopkg/collector/network/network.gopkg/collector/network/network_test.gopkg/collector/network/translate.gopkg/collector/network/translate_test.gopkg/defaults/timeouts.gopkg/k8s/agent/consts.gopkg/k8s/agent/deployer_test.gopkg/k8s/agent/job.gopkg/k8s/agent/rbac.gopkg/k8s/agent/types.gopkg/k8s/client/client.gopkg/snapshotter/agent.gopkg/snapshotter/snapshot.gopkg/snapshotter/snapshot_test.go
mchmarny
left a comment
There was a problem hiding this comment.
Solid, well-engineered PR — the new network collector follows existing collector patterns closely, uses pkg/errors codes + context timeouts throughout, bounds the file read (os.Stat + LimitReader), fails closed on parse/garbage-env, and has a clean inactive-collector (nil, nil) contract that the snapshotter now honors. Test coverage is good (inactive/file-success/missing/invalid/multi-group/file-wins + the RBAC discovery subtest), and the MPL-2.0 blocker from #1472 is genuinely resolved via the Helm-free l8k discovery sub-package. CI is green (Lint, Test, Security Scan, verify-licenses, CLI E2E, GPU Smoke).
A few non-blocking notes inline:
- Docs (medium): the two new CLI flags aren't added to the
aicr snapshottable indocs/user/cli-reference.md. - Comment accuracy (low): the multi-file
KUBECONFIGrationale inResolveKubeconfigPathoverstates the outcome for theBuildKubeClientcaller (it goes to in-cluster config, not clientcmd loading rules). - Operator awareness (nit): the discovery ClusterRole's
pods/exec+roles/rolebindingsgrants are the broadest in the role — opt-in and well-documented, just flagging.
Nothing blocks merge. The doc table update is the one I'd ask for before merge; the other two are optional.
475e337 to
53c103e
Compare
|
Addressing the new review (mchmarny + CodeRabbit re-review of
CodeRabbit new findings on
CodeRabbit also re-flagged several items I already addressed in
Still deferred to follow-up:
CI: build / lint / test green locally. Re-pushed, awaiting fresh CI cycle. |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
pkg/collector/network/network.go (1)
112-167: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick winMake the read-side size check authoritative.
os.Statonly proves the file size at one instant. If the path is replaced or grows betweenStatandOpen, this branch can still parse a truncatedio.LimitReaderprefix and emit a partialNetworkTopologyinstead of rejecting oversize input.Suggested fix
- // Read +1 over the cap as a defense-in-depth guard: the os.Stat - // check above is the primary gate (it catches files that grew on - // disk between Stat and Open just as well), but pairing it with a - // LimitReader keeps parser memory bounded if a future change drops - // the Stat path. - limited := io.LimitReader(f, defaults.MaxClusterConfigBytes+1) - - cfg, err := l8kdisc.ParseClusterConfig(limited) + data, err := io.ReadAll(io.LimitReader(f, defaults.MaxClusterConfigBytes+1)) + if err != nil { + return nil, errors.WrapWithContext(errors.ErrCodeInternal, + "failed to read cluster-config file", err, + map[string]any{ctxKeyPath: c.ClusterConfigPath}) + } + if int64(len(data)) > defaults.MaxClusterConfigBytes { + return nil, errors.NewWithContext(errors.ErrCodeInvalidRequest, + "cluster-config file exceeds size limit", + map[string]any{ + ctxKeyPath: c.ClusterConfigPath, + "size": len(data), + "limit": defaults.MaxClusterConfigBytes, + }) + } + + cfg, err := l8kdisc.ParseClusterConfig(bytes.NewReader(data))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/collector/network/network.go` around lines 112 - 167, The read-side validation in the network config loader is still not authoritative because `os.Stat` can be stale between `Stat` and `Open`, allowing `ParseClusterConfig` to consume a truncated prefix via `io.LimitReader`. Update the `cluster-config` read path in `LoadNetworkTopology` so the actual read enforces the size limit and rejects any oversize input discovered after opening, rather than relying on the earlier `Stat` check alone.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/cli/snapshot.go`:
- Around line 543-572: Do not clear ns.Factory in AICR_AGENT_MODE unless you
also mirror every factory-driving option into env first. NodeSnapshotter.measure
is rebuilding from env, but only the network vars and KUBECONFIG are being
copied, so CLI-only settings like --os and --max-nodes-per-entry get lost. Fix
this by either preserving the pre-parsed factory on ns.Factory or setting the
missing env inputs before the rebuild path runs, using the same option sources
that snapshotter/snapshot.go expects.
In `@pkg/collector/network/network.go`:
- Around line 117-153: Reject non-regular cluster-config paths in the collector
before calling stdos.Open in CollectorNetwork.collectFromFile. After the
existing stat and size checks, verify that the file mode is a regular file and
return an invalid-request error with the path context if it is a FIFO, device,
or other special file. Keep the existing not-found/permission handling for Stat
and Open, but add the regular-file guard so Open is only reached for valid
files.
In `@pkg/k8s/client/client.go`:
- Around line 97-111: In the kubeconfig lookup helper in client.go, the
multi-file KUBECONFIG case is still falling through to the default
~/.kube/config path, which breaks clientcmd merge semantics. Update the
kubeconfig resolution logic around the existing env check and defaultPath
fallback so that when KUBECONFIG contains multiple paths (the containsAny case),
the helper returns immediately or otherwise signals that loading-rules should
handle it, and only fall back to ~/.kube/config when no kubeconfig env is
present. Keep the behavior aligned with callers like the network collector that
rely on this helper.
---
Duplicate comments:
In `@pkg/collector/network/network.go`:
- Around line 112-167: The read-side validation in the network config loader is
still not authoritative because `os.Stat` can be stale between `Stat` and
`Open`, allowing `ParseClusterConfig` to consume a truncated prefix via
`io.LimitReader`. Update the `cluster-config` read path in `LoadNetworkTopology`
so the actual read enforces the size limit and rejects any oversize input
discovered after opening, rather than relying on the earlier `Stat` check alone.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b3694a61-ec95-42cc-a4ca-c048761600bb
⛔ Files ignored due to path filters (89)
go.sumis excluded by!**/*.sumvendor/github.com/Mellanox/doca-driver-build/entrypoint/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/doca-driver-build/entrypoint/pkg/mofedmodules/mofedmodules.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/hostdevicenetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/ipoibnetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/macvlannetwork_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nic_policy_cr.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicclusterpolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/nicnodepolicy_types.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/utils.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/Mellanox/network-operator/pkg/consts/consts.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/LICENSEis excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/doc.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/groupversion_info.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicconfigurationtemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicdevice_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaresource_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicfirmwaretemplate_types.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/nicinterfacenametemplate.gois excluded by!vendor/**vendor/github.com/Mellanox/nic-configuration-operator/api/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/LICENSEis excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/upgrade_spec.gois excluded by!vendor/**vendor/github.com/NVIDIA/k8s-operator-libs/api/upgrade/v1alpha1/zz_generated.deepcopy.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/LICENSEis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/comments.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/config.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/config/default-config.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/exec.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/kubeclient/kubeclient.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/discover.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/gputopology.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/library.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery/ns-product-idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/nvidia.idsis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pciids/pciids.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/internal/pfutil/pfutil.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/releases/releases.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicconfigurationtemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicdevices.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaresources.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicfirmwaretemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/crds/configuration.net.nvidia.com_nicinterfacenametemplates.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/daemon.yaml.tmplis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/assets/embed.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/bootstrap.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/nicconfigdaemon/teardown.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presetmatch/presetmatch.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-R760xa-H100-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE7745-RTX-PRO-4500/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/PowerEdge-XE9680-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR650-V4-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-H200-NVL/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR675-V3-System-Board-RTX-PRO-6000/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/ThinkSystem-SR680a-V3-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/data/UCSC-885A-M8-H22-H200/topology.yamlis excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/download.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/presets.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/synthesize.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/presets/validate.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/context.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/json_output.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/progress.gois excluded by!vendor/**vendor/github.com/nvidia/k8s-launch-kit/pkg/ui/ui.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/.travis.ymlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/LICENSE.libyamlis excluded by!vendor/**vendor/gopkg.in/yaml.v2/NOTICEis excluded by!vendor/**vendor/gopkg.in/yaml.v2/README.mdis excluded by!vendor/**vendor/gopkg.in/yaml.v2/apic.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/decode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/emitterc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/encode.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/parserc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/readerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/resolve.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/scannerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/sorter.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/writerc.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yaml.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlh.gois excluded by!vendor/**vendor/gopkg.in/yaml.v2/yamlprivateh.gois excluded by!vendor/**vendor/modules.txtis excluded by!vendor/**vendor/sigs.k8s.io/controller-runtime/pkg/scheme/scheme.gois excluded by!vendor/**
📒 Files selected for processing (19)
docs/integrator/measurement-api.mddocs/user/cli-reference.mdgo.modpkg/cli/snapshot.gopkg/collector/factory.gopkg/collector/network/network.gopkg/collector/network/network_test.gopkg/collector/network/translate.gopkg/collector/network/translate_test.gopkg/defaults/timeouts.gopkg/k8s/agent/consts.gopkg/k8s/agent/deployer_test.gopkg/k8s/agent/job.gopkg/k8s/agent/rbac.gopkg/k8s/agent/types.gopkg/k8s/client/client.gopkg/snapshotter/agent.gopkg/snapshotter/snapshot.gopkg/snapshotter/snapshot_test.go
53c103e to
d091f52
Compare
|
Addressing the 3 new CodeRabbit findings on
Build / lint / test green locally. Re-pushed, awaiting fresh CI cycle. |
# Why aicr's snapshot model gained a NetworkTopology Measurement type in Phase 1 but nothing populated it. Phase 3 wires the producer: a new collector that ingests k8s-launch-kit (l8k) cluster-config data — either from a file or via live discovery — and translates it into the canonical NetworkTopology shape documented in docs/integrator/measurement-api.md. This is the consumer side of the cross-repo integration. l8k's library API landed in two upstream PRs: NVIDIA#102 added the initial Discover/ ParseClusterConfig entry points; NVIDIA#104 split them into a helm-free sub-package at pkg/networkoperatorplugin/discovery, dropping the MPL-2.0 hashicorp/go-multierror dep (and ~950 transitive vendor files) that aicr's verify-licenses gate was rejecting. aicr pins the merge of NVIDIA#104 and imports the new sub-package. # What changes ## New package pkg/collector/network - Collector type with two source modes, selected by struct fields: ClusterConfigPath (file source — no cluster contact) and DiscoverNetwork (live l8k discovery — writes node labels, patches NicClusterPolicy). When both are set, file path wins so callers can default --discover-network from a flag without inadvertent cluster contact. When neither is set, the collector is inactive — Collect returns (nil, nil) and the snapshotter's collectSafe wrapper (hardened in this commit) treats nil Measurement as a no-op. - Translator (translate.go) converts l8k.LaunchKitConfig into the contracted NetworkTopology Measurement: one Measurement with identity / capabilities / pfs (Items) / kernel-modules subtypes. Multi-group input is rejected with ErrCodeInvalidRequest and a groupCount field on the error context. Multi-group support is a planned future iteration; rejection guards the snapshot from silently dropping groups. - Per-PF Context carries model, connectedGPU, gpuProximity in addition to pciAddress, deviceID, psid, partNumber, rdmaDevice, networkInterface — every descriptive identifier l8k populates. Per-PF Data carries rail, numaNode, traffic (scalars). - Routes controller-runtime's internal logger through aicr's slog via l8kplugin.WithLogger(logr.FromSlogHandler(...)). Without this, callers hit a one-time "log.SetLogger(...) was never called" warning from controller-runtime and lose l8k's structured discovery output. ## CLI: aicr snapshot --cluster-config / --discover-network - --cluster-config <path> (env: AICR_CLUSTER_CONFIG_PATH) ingests a pre-existing l8k cluster-config.yaml. - --discover-network (env: AICR_DISCOVER_NETWORK) opts into live discovery. Discovery is NOT read-only. - Both flags off-by-default; an explicit opt-in is required because discovery is cluster-mutating and the file-source path needs a user-supplied YAML. - In AICR_AGENT_MODE=true (developer-bypass that runs collectors in-process), the CLI seeds KUBECONFIG from --kubeconfig before calling Measure so the network collector's l8k client resolves the same kubeconfig the rest of aicr uses. ## snapshotter.AgentConfig + agent.Config + Job env - ClusterConfigPath and DiscoverNetwork on snapshotter.AgentConfig forward into agent.Config, which renders AICR_CLUSTER_CONFIG_PATH and AICR_DISCOVER_NETWORK env vars on the agent Job. - In Job-mode, the in-pod CLI consumes those env vars via the flag's Sources binding (cli.EnvVars), so the same factory construction path serves both local and Job execution. - collectSafe in pkg/snapshotter/snapshot.go now skips nil Measurements — required for the network collector's inactive-mode contract; harmless to existing collectors that always return a Measurement. ## Job-mode RBAC for --discover-network - pkg/k8s/agent/rbac.go ensureClusterRole conditionally appends a discoverNetworkClusterRules() rule set when deployer.config.DiscoverNetwork is true. Non-network snapshots stay minimal-priv; discover-network snapshots gain the cluster- scoped grants l8k actually exercises: * apiextensions.k8s.io customresourcedefinitions (get/list/ create/update/patch) — l8k installs the nic-configuration- operator CRDs if missing. * namespaces (get/create/delete) — the nvidia-k8s-launch-kit bootstrap namespace, created and deleted each run. * apps daemonsets (get/list/watch/create/delete) — the nic- configuration-daemon bootstrap DaemonSet. * serviceaccounts/configmaps (get/create/delete) and rbac.authorization.k8s.io roles/rolebindings (get/create/ delete) — the daemon's supporting resources. * pods/exec create — probing daemon pods for VPD / link metadata. * nodes patch — writing the nvidia.kubernetes-launch-kit. machine and .gpu labels onto matched nodes. * configuration.net.nvidia.com nicdevices (get/list) — the per-PF CRs the daemon publishes. * mellanox.com nicclusterpolicies (get/patch) — server-side apply of the NicConfigurationOperator section onto the user's NicClusterPolicy. - Job-mode --discover-network is now usable without escalating the Job ServiceAccount to cluster-admin. ## Shared helper pkg/k8s/client.ResolveKubeconfigPath - Lifted the explicit-path → KUBECONFIG env → ~/.kube/config → in-cluster resolution chain out of BuildKubeClient into an exported helper. The network collector uses it before handing a path to l8k's kubeclient.New, so aicr's K8s client and the l8k discovery client resolve kubeconfigs identically. ## Factory wiring + defaults - pkg/collector/factory.go gains CreateNetworkCollector() and three Option functions (WithClusterConfigPath, WithDiscoverNetwork, WithKubeconfigPath). Existing patterns matched (struct-literal construction, no New on the collector itself). - pkg/defaults adds CollectorNetworkTimeout (10m) — capped above l8k's internal DaemonSet wait — and MaxClusterConfigBytes (1 MiB) for the file-source io.LimitReader. ## Cross-repo dependency - go.mod requires github.com/nvidia/k8s-launch-kit at the merged PR NVIDIA#104 commit (pseudo-version v0.0.0-20260625130457-a99c9684627e). No local replace — pure upstream pin. - Import path migrated: l8kplugin "github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin" → l8kdisc "github.com/nvidia/k8s-launch-kit/pkg/networkoperatorplugin/discovery" Plus l8kconfig "github.com/nvidia/k8s-launch-kit/pkg/config" for DefaultLaunchKitConfig (used as the cfg arg to Discover). - Discover signature change: the cfg is now a required positional argument (l8k NVIDIA#104 dropped WithBaseConfig). aicr passes DefaultLaunchKitConfig() — l8k's recommended Network Operator release line is baked into the embedded default (currently 26.4); aicr's snapshot path doesn't pin a release here, that decision lives in the recipe/deploy stage. - vendor/ tree shrinks by ~950 files: helm.sh/helm/v3 and its transitive baggage (containerd, lib/pq, jmoiron/sqlx, lann/builder, chai2010/gettext-go, gosuri, peterbourgon/diskv, mitchellh/*, moby/term, Masterminds/squirrel, sigs.k8s.io/kustomize, oras-go, Azure/go-ansiterm, plus k8s.io/{cli-runtime,kubectl,apiserver, component-base,streaming}) are all gone — none of them are reachable from the discovery sub-package. - hashicorp/go-multierror (MPL-2.0) is gone too — unblocking the verify-licenses CI gate that rejected the prior version. # Testing - pkg/collector/network/translate_test.go: 9 cases covering single- group happy path, every subtype's shape, kernel modules, multi- group rejection, nil input, empty ClusterConfig, no-PFs (pfs subtype skipped). - pkg/collector/network/network_test.go: 6 cases covering inactive contract, file mode success, missing path, invalid YAML, multi- group surfaced from collector, file-path-wins-over-discover. Tests use real YAML round-tripped through l8k's parser — no test seams. - pkg/k8s/agent/deployer_test.go: a new TestDeployer_EnsureRBAC subtest asserts the discover-network rule set lands when Config.DiscoverNetwork is set (mellanox.com/nicclusterpolicies patch, pods/exec create, nodes patch all present); the existing default subtest still asserts exactly 3 rules so non-discovery snapshots stay minimal-priv. - go test -race ./... passes. Lint clean (golangci-lint v2.12.2, pinned in .settings.yaml). # End-to-end discovery against a real cluster Running --discover-network in local mode against a 4-node ThinkSystem SR675-V3 RTX-PRO-6000 cluster produced a NetworkTopology Measurement with the expected identity / capabilities / pfs shape, including the gpuProximity / connectedGPU per-PF Context that l8k overlays from preset topology: apiVersion: aicr.nvidia.com/v1alpha1 fingerprint: accelerator: source: nodeTopology.label.nvidia.com/gpu.product value: rtx-pro-6000 gpuNodeCount: source: nodeTopology.label.nvidia.com/gpu.product value: 3 k8sVersion: source: k8s.server.version value: 1.34.3 nodeCount: source: nodeTopology.summary.node-count value: 4 os: value: "" service: value: "" kind: Snapshot measurements: # ... - subtypes: - context: gpuType: NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition identifier: thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715 linkType: Ethernet machineType: ThinkSystem-SR675-V3 nodeSelector: nvidia.kubernetes-launch-kit.machine=ThinkSystem-SR675-V3-NVIDIA-RTX-PRO-6000-Blackwell-Ser-20a0a715 data: pf-count: 6 rail-count: 4 subtype: identity - data: ib: true rdma: true sriov: true subtype: capabilities - items: - context: connectedGPU: GPU0 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens15f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: "0000:05:00.0" psid: mt_0000001069 rdmaDevice: mlx5_1 data: numaNode: 0 rail: 0 traffic: east-west - context: deviceID: a2dc model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112 networkInterface: ens1f0np0 partNumber: 900-9D3B6-00CV-AA0 pciAddress: 0000:61:00.0 psid: mt_0000000884 rdmaDevice: mlx5_bond_0 data: numaNode: 0 traffic: north-south - context: deviceID: a2dc model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112 networkInterface: ens1f1np1 partNumber: 900-9D3B6-00CV-AA0 pciAddress: 0000:61:00.1 psid: mt_0000000884 data: numaNode: 0 traffic: north-south - context: connectedGPU: GPU2 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens16f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:75:00.0 psid: mt_0000001069 rdmaDevice: mlx5_0 data: numaNode: 0 rail: 1 traffic: east-west - context: connectedGPU: GPU4 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens20f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:85:00.0 psid: mt_0000001069 rdmaDevice: mlx5_3 data: numaNode: 1 rail: 2 traffic: east-west - context: connectedGPU: GPU6 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens21f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:f5:00.0 psid: mt_0000001069 rdmaDevice: mlx5_2 data: numaNode: 1 rail: 3 traffic: east-west subtype: pfs type: NetworkTopology metadata: source-node: "" timestamp: "2026-06-25T10:40:47Z" version: dev CLI invocation and l8k discovery log (truncated to network-relevant lines; the proc/sys collectors warn-and-skip because the local-mode host isn't a cluster node): AICR_AGENT_MODE=true aicr snapshot \ --discover-network \ --kubeconfig ~/workspace/k8s-launch-kit/rtx-cluster/kubeconfig \ -o /Users/amaslennikov/workspace/tmp/aicr-discovery/discovery.yaml [cli] collecting SystemD service configurations [cli] collecting GPU information [cli] collecting node topology information [cli] collecting OS configuration [cli] failed to collect grub - skipping: collector=grub error=[INTERNAL] failed to read GRUB params from /proc/cmdline: [INTERNAL] failed to open file "/proc/cmdline": open /proc/cmdline: no such file or directory [cli] collecting network topology via l8k discovery [cli] failed to collect sysctl - skipping: collector=sysctl error=[INTERNAL] failed to collect sysctl parameters: [INTERNAL] failed to walk directory /proc/sys: lstat /proc/sys: no such file or directory [cli] failed to collect kmod - skipping: collector=kmod error=[INTERNAL] failed to read kernel modules from /proc/modules: [INTERNAL] failed to open file "/proc/modules": open /proc/modules: no such file or directory [cli] GPU hardware detection failed: error=[INTERNAL] NFD PCI discovery failed: failed to detect PCI devices: open /sys/bus/pci/devices: no such file or directory [cli] failed to collect release - skipping: collector=release error=[INTERNAL] failed to read os release from /usr/lib/os-release: [INTERNAL] failed to open file "/usr/lib/os-release": open /usr/lib/os-release: no such file or directory [cli] collecting Kubernetes cluster information [cli] D-Bus not available - no systemd data will be collected: error=dial unix /run/systemd/private: connect: no such file or directory hint=systemd/D-Bus is required for service status collection [cli] failed to collect node - skipping: collector=node error=[INVALID_REQUEST] node name not set in environment [cli] node topology collection complete: nodes=4 taints=1 labels=232 [cli] Bootstrapping NIC configuration daemon for discovery: namespace=nvidia-k8s-launch-kit image=nvcr.io/nvidia/mellanox/nic-configuration-operator-daemon:network-operator-v26.4.0 [cli] Filtered nodes by NVIDIA NIC presence: candidates=4 withNICs=4 [cli] Skipping discovered group with no east-west NICs (north-south only); not added to cluster-config: nodes=[mdf-prod-2852-k8m-02] pfCount=2 [cli] Discovery summary: groupCount=1 eastWestPFs=4 northSouthPFsFiltered=2 presetMatches=1 presetDeviationGroups=0 machineLabelledGroups=1 [cli] Deleted bootstrap namespace: namespace=nvidia-k8s-launch-kit [cli] network topology collection complete: identifier=thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715 machineType=ThinkSystem-SR675-V3 gpuType=NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition pfs=6 rails=4 # Not in this commit - ConfigMap mounting for Job-mode --cluster-config — file source works in local mode only today. - Multi-group NetworkTopology emission. The single-group enforcement is in toMeasurement; lifting it requires the multi-Measurement-per-Type consumer rewrites in constraints extractor, recipe validation, diff indexing, and fingerprint that are tracked as a follow-up phase.
d091f52 to
d1d12aa
Compare
mchmarny
left a comment
There was a problem hiding this comment.
All three of my prior comments are addressed in d1d12aa6 — verified against the current head:
docs/user/cli-reference.md— both--cluster-configand--discover-networkrows added to theaicr snapshottable with full usage, mutual-exclusion, and local-mode-vs-Job-mode constraints. ✅pkg/k8s/client/client.go— doc comment now correctly scopes the "clientcmd merges" claim to loading-rules-aware callers and explicitly distinguishesBuildKubeClient(falls through toInClusterConfig()on empty path). Behavior matches: multi-fileKUBECONFIGreturns""instead of silently substituting~/.kube/config. ✅pkg/k8s/agent/rbac.go— non-blocking operator-awareness nit; the broad cluster-scoped grant (pods/exec, roles/rolebindings, namespaces) is now documented per-rule indiscoverNetworkClusterRules()and at the call site, gated behind--discover-network. Namespace-scoping deferred to an l8k-side follow-up, as expected. ✅
Original verify-licenses blocker is resolved (MPL-2.0 go-multierror gone via the l8k discovery split). LGTM.
Note for merge: branch is behind main — rebase before merge, and let the in-flight GPU/E2E/analyze checks finish.
Summary
Adds a
NetworkTopologyMeasurement collector that ingests k8s-launch-kit (l8k) cluster-config — either from a file (--cluster-config <path>) or via live discovery (--discover-network) — and emits the canonicalNetworkTopologyshape documented indocs/integrator/measurement-api.md.Re-submission of #1472 (closed). The blocker on the prior PR was
verify-licensesrejecting MPL-2.0hashicorp/go-multierror, transitively pulled in via Helm SDK frompkg/networkoperatorplugin. l8k PR #104 (merged asa99c9684) split discovery into a Helm-free sub-packagepkg/networkoperatorplugin/discovery; this PR pins that merge and migrates the import. The MPL-2.0 dep is gone and the aicr vendor tree shrinks by ~950 files.Motivation / Context
aicr's snapshot model gained a
NetworkTopologyMeasurement type in Phase 1 (shape contract indocs/integrator/measurement-api.md) but nothing populated it. This PR wires the producer side: per-hardware-group PFs (PCI/PSID/partNumber/RDMA-device/rail/NUMA), per-group capabilities (sriov/rdma/ib),linkType, kernel-module lists, machine/GPU type, recommended firmware — all captured by l8k discovery and translated into one Measurement.Fixes: N/A
Related: closes #1472 (this PR re-submits with the verify-licenses gate unblocked); l8k PR #104 (the upstream split that made this possible)
Type of Change
Component(s) Affected
pkg/collector,pkg/snapshotter)cmd/aicr,pkg/cli) — new--cluster-config/--discover-networkflagspkg/errors,pkg/k8s) — addedclient.ResolveKubeconfigPathhelperImplementation Notes
New package
pkg/collector/networkwith two source modes:ClusterConfigPath— parsecluster-config.yamlfrom disk; no cluster contact.DiscoverNetwork— call l8k's liveDiscoveragainst the cluster. NOT read-only: writesnvidia.kubernetes-launch-kit.{machine,gpu}labels and patchesNicClusterPolicyvia server-side-apply.When both are set, file path wins. When neither is set, the collector is "inactive" —
Collectreturns(nil, nil). Multi-group input is rejected withErrCodeInvalidRequest+groupCountcontext; multi-group support is a future iteration.Translator converts
l8kconfig.LaunchKitConfig→NetworkTopologyMeasurement withidentity / capabilities / pfs / kernel-modulessubtypes. Per-PF Context carriesmodel,connectedGPU,gpuProximityalongsidepciAddress,deviceID,psid,partNumber,rdmaDevice,networkInterface. Per-PF Data carriesrail,numaNode,traffic.Cross-repo dependency.
go.modpinsgithub.com/nvidia/k8s-launch-kit v0.0.0-20260625130457-a99c9684627e— the merge of l8k PR #104. No localreplace. The discovery sub-package is Helm-free by construction;go list -deps ./pkg/collector/network | grep -E 'helm\.sh|hashicorp/go-multierror'is empty.CLI:
aicr snapshot --cluster-config <path> | --discover-networkwith env-var bindings (AICR_CLUSTER_CONFIG_PATH,AICR_DISCOVER_NETWORK) for in-pod Job mode. In local mode (AICR_AGENT_MODE=true), the CLI seedsKUBECONFIGfrom--kubeconfigbefore calling Measure so the l8k client and aicr's K8s client resolve the same path.Job-mode RBAC for
--discover-networklives inpkg/k8s/agent/rbac.go:ensureClusterRoleconditionally appendsdiscoverNetworkClusterRules()(CRDs, namespaces, daemonsets, serviceaccounts/configmaps, roles/rolebindings, pods/exec, nodes/patch,configuration.net.nvidia.com/nicdevices,mellanox.com/nicclusterpolicies) whenConfig.DiscoverNetworkis true. Non-network snapshots keep the existing minimal-priv 3-rule ClusterRole.Shared helper
pkg/k8s/client.ResolveKubeconfigPath— lifted the explicit-path →KUBECONFIGenv →~/.kube/config→ in-cluster chain out ofBuildKubeClientso aicr's K8s client and the l8k discovery client resolve kubeconfigs identically.Snapshotter hardening:
collectSafenow skips nil Measurements (required for the network collector's inactive-mode contract; harmless to existing collectors).Testing
make qualify # full gatego test -race ./pkg/collector/network/...— all tests pass; 9 cases intranslate_test.go, 6 cases innetwork_test.gocovering inactive contract / file-mode success / missing path / invalid YAML / multi-group rejection / file-path-wins-over-discover.go test -race ./pkg/k8s/agent/...— the new discover-network RBAC subtest pins that the extra rules (mellanox.com/nicclusterpolicies patch, pods/exec create, nodes/patch) land whenConfig.DiscoverNetworkis set.golangci-lint run ./pkg/...— 0 issues.go list -deps ./pkg/collector/network | grep -E 'helm\.sh|hashicorp/go-multierror|containerd|lib/pq|jmoiron/sqlx'— empty.End-to-end discovery smoke-tested against a 4-node ThinkSystem SR675-V3 RTX-PRO-6000 cluster —
AICR_AGENT_MODE=true aicr snapshot --discover-networkwrites a NetworkTopology Measurement with the contracted shape includinggpuProximity/connectedGPUper-PF Context overlaid from l8k preset topology. The output was captured in the closed PR #1472's commit message; behaviour is unchanged from there modulo the API migration.Risk Assessment
--cluster-config/--discover-network); off-by-default for existing snapshot users. Discovery is cluster-mutating (writes node labels + patchesNicClusterPolicy) but only when the user opts in.Rollout notes: Backwards compatible for existing snapshot users — no new flags required. Job-mode
--discover-networkrequires the new RBAC rules inensureClusterRole; the ClusterRole is recreated on each snapshot Job deploy so existing deployments pick it up automatically. The l8k pseudo-version is the canonical merged-main commit, not a branch tip.Not in this PR
--cluster-config— file source works in local mode (AICR_AGENT_MODE=true) today.toMeasurement; lifting it requires multi-Measurement-per-Type consumer rewrites in constraints extractor, recipe validation, diff indexing, and fingerprint — tracked as a follow-up phase.Checklist
make testwith-race)golangci-lint run ./pkg/...)docs/integrator/measurement-api.mdupdated in the schema-foundations work; this PR's new flags are CLI-only)git commit -S)