feat(network): NetworkTopology collector + l8k library integration#1472
feat(network): NetworkTopology collector + l8k library integration#1472almaslennikov wants to merge 1 commit into
Conversation
# Why aicr's snapshot model gained a NetworkTopology Measurement type in Phase 1 but nothing populated it. Phase 3 wires the producer: a new collector that ingests k8s-launch-kit (l8k) cluster-config data — either from a file or via live discovery — and translates it into the canonical NetworkTopology shape documented in docs/integrator/measurement-api.md. This is the consumer side of the cross-repo integration. l8k's library API landed upstream as PR NVIDIA#102 (merged commit f00c3cdde8c9); aicr pins the corresponding pseudo-version and uses two l8k entry points: ParseClusterConfig for the file-source path and Discover for the live path. # What changes ## New package pkg/collector/network - Collector type with two source modes, selected by struct fields: ClusterConfigPath (file source — no cluster contact) and DiscoverNetwork (live l8k discovery — writes node labels, patches NicClusterPolicy). When both are set, file path wins so callers can default --discover-network from a flag without inadvertent cluster contact. When neither is set, the collector is inactive — Collect returns (nil, nil) and the snapshotter's collectSafe wrapper (hardened in this commit) treats nil Measurement as a no-op. - Translator (translate.go) converts l8k.LaunchKitConfig into the contracted NetworkTopology Measurement: one Measurement with identity / capabilities / pfs (Items) / kernel-modules subtypes. Multi-group input is rejected with ErrCodeInvalidRequest and a groupCount field on the error context. Multi-group support is a planned future iteration; rejection guards the snapshot from silently dropping groups. - Per-PF Context carries model, connectedGPU, gpuProximity in addition to pciAddress, deviceID, psid, partNumber, rdmaDevice, networkInterface — every descriptive identifier l8k populates. Per-PF Data carries rail, numaNode, traffic (scalars). - Routes controller-runtime's internal logger through aicr's slog via l8kplugin.WithLogger(logr.FromSlogHandler(...)). Without this, callers hit a one-time "log.SetLogger(...) was never called" warning from controller-runtime and lose l8k's structured discovery output. ## CLI: aicr snapshot --cluster-config / --discover-network - --cluster-config <path> (env: AICR_CLUSTER_CONFIG_PATH) ingests a pre-existing l8k cluster-config.yaml. - --discover-network (env: AICR_DISCOVER_NETWORK) opts into live discovery. Discovery is NOT read-only. - Both flags off-by-default; an explicit opt-in is required because discovery is cluster-mutating and the file-source path needs a user-supplied YAML. - In AICR_AGENT_MODE=true (developer-bypass that runs collectors in-process), the CLI seeds KUBECONFIG from --kubeconfig before calling Measure so the network collector's l8k client resolves the same kubeconfig the rest of aicr uses. ## snapshotter.AgentConfig + agent.Config + Job env - ClusterConfigPath and DiscoverNetwork on snapshotter.AgentConfig forward into agent.Config, which renders AICR_CLUSTER_CONFIG_PATH and AICR_DISCOVER_NETWORK env vars on the agent Job. - In Job-mode, the in-pod CLI consumes those env vars via the flag's Sources binding (cli.EnvVars), so the same factory construction path serves both local and Job execution. - collectSafe in pkg/snapshotter/snapshot.go now skips nil Measurements — required for the network collector's inactive-mode contract; harmless to existing collectors that always return a Measurement. ## Job-mode RBAC for --discover-network - pkg/k8s/agent/rbac.go ensureClusterRole conditionally appends a discoverNetworkClusterRules() rule set when deployer.config.DiscoverNetwork is true. Non-network snapshots stay minimal-priv; discover-network snapshots gain the cluster- scoped grants l8k actually exercises: * apiextensions.k8s.io customresourcedefinitions (get/list/ create/update/patch) — l8k installs the nic-configuration- operator CRDs if missing. * namespaces (get/create/delete) — the nvidia-k8s-launch-kit bootstrap namespace, created and deleted each run. * apps daemonsets (get/list/watch/create/delete) — the nic- configuration-daemon bootstrap DaemonSet. * serviceaccounts/configmaps (get/create/delete) and rbac.authorization.k8s.io roles/rolebindings (get/create/ delete) — the daemon's supporting resources. * pods/exec create — probing daemon pods for VPD / link metadata. * nodes patch — writing the nvidia.kubernetes-launch-kit. machine and .gpu labels onto matched nodes. * configuration.net.nvidia.com nicdevices (get/list) — the per-PF CRs the daemon publishes. * mellanox.com nicclusterpolicies (get/patch) — server-side apply of the NicConfigurationOperator section onto the user's NicClusterPolicy. - Job-mode --discover-network is now usable without escalating the Job ServiceAccount to cluster-admin. ## Shared helper pkg/k8s/client.ResolveKubeconfigPath - Lifted the explicit-path → KUBECONFIG env → ~/.kube/config → in-cluster resolution chain out of BuildKubeClient into an exported helper. The network collector uses it before handing a path to l8k's kubeclient.New, so aicr's K8s client and the l8k discovery client resolve kubeconfigs identically. ## Factory wiring + defaults - pkg/collector/factory.go gains CreateNetworkCollector() and three Option functions (WithClusterConfigPath, WithDiscoverNetwork, WithKubeconfigPath). Existing patterns matched (struct-literal construction, no New on the collector itself). - pkg/defaults adds CollectorNetworkTimeout (10m) — capped above l8k's internal DaemonSet wait — and MaxClusterConfigBytes (1 MiB) for the file-source io.LimitReader. ## Cross-repo dependency - go.mod requires github.com/nvidia/k8s-launch-kit at the merged PR NVIDIA#102 SHA (pseudo-version v0.0.0-20260625104911-f00c3cdde8c9). No local replace — pure upstream pin. - vendor/ tree includes l8k and its transitive deps (controller- runtime client/scheme, Mellanox network-operator API types, nic-configuration-operator API types). aicr does not register any of those schemes itself — scheme ownership stays with l8k. # Testing - pkg/collector/network/translate_test.go: 9 cases covering single- group happy path, every subtype's shape, kernel modules, multi- group rejection, nil input, empty ClusterConfig, no-PFs (pfs subtype skipped). - pkg/collector/network/network_test.go: 6 cases covering inactive contract, file mode success, missing path, invalid YAML, multi- group surfaced from collector, file-path-wins-over-discover. Tests use real YAML round-tripped through l8k's parser — no test seams. - pkg/k8s/agent/deployer_test.go: a new TestDeployer_EnsureRBAC subtest asserts the discover-network rule set lands when Config.DiscoverNetwork is set (mellanox.com/nicclusterpolicies patch, pods/exec create, nodes patch all present); the existing default subtest still asserts exactly 3 rules so non-discovery snapshots stay minimal-priv. - go test -race ./... passes. Lint clean (golangci-lint v2.12.2, pinned in .settings.yaml). # End-to-end discovery against a real cluster Running --discover-network in local mode against a 4-node ThinkSystem SR675-V3 RTX-PRO-6000 cluster produced a NetworkTopology Measurement with the expected identity / capabilities / pfs shape, including the gpuProximity / connectedGPU per-PF Context that l8k overlays from preset topology: apiVersion: aicr.nvidia.com/v1alpha1 fingerprint: accelerator: source: nodeTopology.label.nvidia.com/gpu.product value: rtx-pro-6000 gpuNodeCount: source: nodeTopology.label.nvidia.com/gpu.product value: 3 k8sVersion: source: k8s.server.version value: 1.34.3 nodeCount: source: nodeTopology.summary.node-count value: 4 os: value: "" service: value: "" kind: Snapshot measurements: # ... - subtypes: - context: gpuType: NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition identifier: thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715 linkType: Ethernet machineType: ThinkSystem-SR675-V3 nodeSelector: nvidia.kubernetes-launch-kit.machine=ThinkSystem-SR675-V3-NVIDIA-RTX-PRO-6000-Blackwell-Ser-20a0a715 data: pf-count: 6 rail-count: 4 subtype: identity - data: ib: true rdma: true sriov: true subtype: capabilities - items: - context: connectedGPU: GPU0 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens15f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: "0000:05:00.0" psid: mt_0000001069 rdmaDevice: mlx5_1 data: numaNode: 0 rail: 0 traffic: east-west - context: deviceID: a2dc model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112 networkInterface: ens1f0np0 partNumber: 900-9D3B6-00CV-AA0 pciAddress: 0000:61:00.0 psid: mt_0000000884 rdmaDevice: mlx5_bond_0 data: numaNode: 0 traffic: north-south - context: deviceID: a2dc model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112 networkInterface: ens1f1np1 partNumber: 900-9D3B6-00CV-AA0 pciAddress: 0000:61:00.1 psid: mt_0000000884 data: numaNode: 0 traffic: north-south - context: connectedGPU: GPU2 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens16f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:75:00.0 psid: mt_0000001069 rdmaDevice: mlx5_0 data: numaNode: 0 rail: 1 traffic: east-west - context: connectedGPU: GPU4 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens20f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:85:00.0 psid: mt_0000001069 rdmaDevice: mlx5_3 data: numaNode: 1 rail: 2 traffic: east-west - context: connectedGPU: GPU6 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens21f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:f5:00.0 psid: mt_0000001069 rdmaDevice: mlx5_2 data: numaNode: 1 rail: 3 traffic: east-west subtype: pfs type: NetworkTopology metadata: source-node: "" timestamp: "2026-06-25T10:40:47Z" version: dev CLI invocation and l8k discovery log (truncated to network-relevant lines; the proc/sys collectors warn-and-skip because the local-mode host isn't a cluster node): AICR_AGENT_MODE=true aicr snapshot \ --discover-network \ --kubeconfig ~/workspace/k8s-launch-kit/rtx-cluster/kubeconfig \ -o /Users/amaslennikov/workspace/tmp/aicr-discovery/discovery.yaml [cli] collecting SystemD service configurations [cli] collecting GPU information [cli] collecting node topology information [cli] collecting OS configuration [cli] failed to collect grub - skipping: collector=grub error=[INTERNAL] failed to read GRUB params from /proc/cmdline: [INTERNAL] failed to open file "/proc/cmdline": open /proc/cmdline: no such file or directory [cli] collecting network topology via l8k discovery [cli] failed to collect sysctl - skipping: collector=sysctl error=[INTERNAL] failed to collect sysctl parameters: [INTERNAL] failed to walk directory /proc/sys: lstat /proc/sys: no such file or directory [cli] failed to collect kmod - skipping: collector=kmod error=[INTERNAL] failed to read kernel modules from /proc/modules: [INTERNAL] failed to open file "/proc/modules": open /proc/modules: no such file or directory [cli] GPU hardware detection failed: error=[INTERNAL] NFD PCI discovery failed: failed to detect PCI devices: open /sys/bus/pci/devices: no such file or directory [cli] failed to collect release - skipping: collector=release error=[INTERNAL] failed to read os release from /usr/lib/os-release: [INTERNAL] failed to open file "/usr/lib/os-release": open /usr/lib/os-release: no such file or directory [cli] collecting Kubernetes cluster information [cli] D-Bus not available - no systemd data will be collected: error=dial unix /run/systemd/private: connect: no such file or directory hint=systemd/D-Bus is required for service status collection [cli] failed to collect node - skipping: collector=node error=[INVALID_REQUEST] node name not set in environment [cli] node topology collection complete: nodes=4 taints=1 labels=232 [cli] Bootstrapping NIC configuration daemon for discovery: namespace=nvidia-k8s-launch-kit image=nvcr.io/nvidia/mellanox/nic-configuration-operator-daemon:network-operator-v26.4.0 [cli] Filtered nodes by NVIDIA NIC presence: candidates=4 withNICs=4 [cli] Skipping discovered group with no east-west NICs (north-south only); not added to cluster-config: nodes=[mdf-prod-2852-k8m-02] pfCount=2 [cli] Discovery summary: groupCount=1 eastWestPFs=4 northSouthPFsFiltered=2 presetMatches=1 presetDeviationGroups=0 machineLabelledGroups=1 [cli] Deleted bootstrap namespace: namespace=nvidia-k8s-launch-kit [cli] network topology collection complete: identifier=thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715 machineType=ThinkSystem-SR675-V3 gpuType=NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition pfs=6 rails=4 # Not in this commit - ConfigMap mounting for Job-mode --cluster-config — file source works in local mode only today. - Multi-group NetworkTopology emission. The single-group enforcement is in toMeasurement; lifting it requires the multi-Measurement-per-Type consumer rewrites in constraints extractor, recipe validation, diff indexing, and fingerprint that are tracked as a follow-up phase.
aee0c34 to
9ea0f96
Compare
|
Caution Review failedPull request was closed or merged during review 📝 Walkthrough📝 WalkthroughWalkthroughAdds a network topology collector that can read a cluster-config file or run live discovery, translates LaunchKit config into Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/collector/network/network_test.go`:
- Around line 99-144: Consolidate the repeated file-mode error-path coverage in
Collector.Collect into a single table-driven test that exercises the missing
path, invalid YAML, and multi-group rejection cases using shared setup/assertion
logic. Keep the existing checks for errors.StructuredError codes and groupCount
context, but move case-specific inputs and expectations into a case list keyed
by the relevant symbols TestCollect_FileMode_MissingPath,
TestCollect_FileMode_InvalidYAML, and TestCollect_FileMode_MultiGroupRejected.
In `@pkg/collector/network/network.go`:
- Around line 112-121: The cluster-config size check in network.go is being
applied with io.LimitReader before l8kplugin.ParseClusterConfig, which can hide
oversized inputs as parse errors; instead, read or inspect the file size in the
caller around the ParseClusterConfig flow, enforce
defaults.MaxClusterConfigBytes before parsing, and return an explicit
errors.ErrCodeInvalidRequest with the cluster-config path when the limit is
exceeded.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| func TestCollect_FileMode_MissingPath(t *testing.T) { | ||
| _, err := (&Collector{ClusterConfigPath: "/definitely/not/a/real/file.yaml"}).Collect(context.Background()) | ||
| if err == nil { | ||
| t.Fatal("expected error for missing file, got nil") | ||
| } | ||
| se := (*errors.StructuredError)(nil) | ||
| if !stderrors.As(err, &se) || se.Code != errors.ErrCodeNotFound { | ||
| t.Errorf("expected ErrCodeNotFound, got %v", err) | ||
| } | ||
| } | ||
|
|
||
| func TestCollect_FileMode_InvalidYAML(t *testing.T) { | ||
| path := writeTempConfig(t, "not: valid: yaml: at: all: : :") | ||
| _, err := (&Collector{ClusterConfigPath: path}).Collect(context.Background()) | ||
| if err == nil { | ||
| t.Fatal("expected error for invalid YAML, got nil") | ||
| } | ||
| se := (*errors.StructuredError)(nil) | ||
| if !stderrors.As(err, &se) || se.Code != errors.ErrCodeInvalidRequest { | ||
| t.Errorf("expected ErrCodeInvalidRequest, got %v", err) | ||
| } | ||
| } | ||
|
|
||
| // TestCollect_FileMode_MultiGroupRejected anchors that the collector | ||
| // surfaces the translator's single-group enforcement faithfully — a | ||
| // multi-group source returns ErrCodeInvalidRequest with groupCount set. | ||
| func TestCollect_FileMode_MultiGroupRejected(t *testing.T) { | ||
| multi := singleGroupYAML + ` | ||
| - identifier: test-group-2 | ||
| machineType: TestMachine2 | ||
| gpuType: TestGPU | ||
| pfs: [] | ||
| ` | ||
| path := writeTempConfig(t, multi) | ||
| _, err := (&Collector{ClusterConfigPath: path}).Collect(context.Background()) | ||
| if err == nil { | ||
| t.Fatal("expected ErrCodeInvalidRequest for multi-group input, got nil") | ||
| } | ||
| se := (*errors.StructuredError)(nil) | ||
| if !stderrors.As(err, &se) || se.Code != errors.ErrCodeInvalidRequest { | ||
| t.Fatalf("expected ErrCodeInvalidRequest, got %v", err) | ||
| } | ||
| if got := se.Context["groupCount"]; got != 2 { | ||
| t.Errorf("Context[groupCount] = %v, want 2", got) | ||
| } | ||
| } |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Convert the repeated file-mode error-path tests into a table-driven test.
TestCollect_FileMode_MissingPath, TestCollect_FileMode_InvalidYAML, and TestCollect_FileMode_MultiGroupRejected follow the same structure and should be consolidated into a table-driven case set for consistency and easier extension. As per coding guidelines, tests with multiple cases should be table-driven.
Proposed refactor sketch
-func TestCollect_FileMode_MissingPath(t *testing.T) { ... }
-func TestCollect_FileMode_InvalidYAML(t *testing.T) { ... }
-func TestCollect_FileMode_MultiGroupRejected(t *testing.T) { ... }
+func TestCollect_FileMode_Errors(t *testing.T) {
+ cases := []struct {
+ name string
+ path func(t *testing.T) string
+ wantCode errors.ErrorCode
+ checkMore func(t *testing.T, se *errors.StructuredError)
+ }{
+ // missing path / invalid yaml / multi-group...
+ }
+ for _, tc := range cases {
+ t.Run(tc.name, func(t *testing.T) {
+ _, err := (&Collector{ClusterConfigPath: tc.path(t)}).Collect(context.Background())
+ if err == nil {
+ t.Fatal("expected error, got nil")
+ }
+ se := (*errors.StructuredError)(nil)
+ if !stderrors.As(err, &se) || se.Code != tc.wantCode {
+ t.Fatalf("expected %v, got %v", tc.wantCode, err)
+ }
+ if tc.checkMore != nil {
+ tc.checkMore(t, se)
+ }
+ })
+ }
+}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestCollect_FileMode_MissingPath(t *testing.T) { | |
| _, err := (&Collector{ClusterConfigPath: "/definitely/not/a/real/file.yaml"}).Collect(context.Background()) | |
| if err == nil { | |
| t.Fatal("expected error for missing file, got nil") | |
| } | |
| se := (*errors.StructuredError)(nil) | |
| if !stderrors.As(err, &se) || se.Code != errors.ErrCodeNotFound { | |
| t.Errorf("expected ErrCodeNotFound, got %v", err) | |
| } | |
| } | |
| func TestCollect_FileMode_InvalidYAML(t *testing.T) { | |
| path := writeTempConfig(t, "not: valid: yaml: at: all: : :") | |
| _, err := (&Collector{ClusterConfigPath: path}).Collect(context.Background()) | |
| if err == nil { | |
| t.Fatal("expected error for invalid YAML, got nil") | |
| } | |
| se := (*errors.StructuredError)(nil) | |
| if !stderrors.As(err, &se) || se.Code != errors.ErrCodeInvalidRequest { | |
| t.Errorf("expected ErrCodeInvalidRequest, got %v", err) | |
| } | |
| } | |
| // TestCollect_FileMode_MultiGroupRejected anchors that the collector | |
| // surfaces the translator's single-group enforcement faithfully — a | |
| // multi-group source returns ErrCodeInvalidRequest with groupCount set. | |
| func TestCollect_FileMode_MultiGroupRejected(t *testing.T) { | |
| multi := singleGroupYAML + ` | |
| - identifier: test-group-2 | |
| machineType: TestMachine2 | |
| gpuType: TestGPU | |
| pfs: [] | |
| ` | |
| path := writeTempConfig(t, multi) | |
| _, err := (&Collector{ClusterConfigPath: path}).Collect(context.Background()) | |
| if err == nil { | |
| t.Fatal("expected ErrCodeInvalidRequest for multi-group input, got nil") | |
| } | |
| se := (*errors.StructuredError)(nil) | |
| if !stderrors.As(err, &se) || se.Code != errors.ErrCodeInvalidRequest { | |
| t.Fatalf("expected ErrCodeInvalidRequest, got %v", err) | |
| } | |
| if got := se.Context["groupCount"]; got != 2 { | |
| t.Errorf("Context[groupCount] = %v, want 2", got) | |
| } | |
| } | |
| func TestCollect_FileMode_Errors(t *testing.T) { | |
| cases := []struct { | |
| name string | |
| path func(t *testing.T) string | |
| wantCode errors.ErrorCode | |
| checkMore func(t *testing.T, se *errors.StructuredError) | |
| }{ | |
| { | |
| name: "missing path", | |
| path: func(t *testing.T) string { | |
| return "/definitely/not/a/real/file.yaml" | |
| }, | |
| wantCode: errors.ErrCodeNotFound, | |
| }, | |
| { | |
| name: "invalid yaml", | |
| path: func(t *testing.T) string { | |
| return writeTempConfig(t, "not: valid: yaml: at: all: : :") | |
| }, | |
| wantCode: errors.ErrCodeInvalidRequest, | |
| }, | |
| { | |
| name: "multi group rejected", | |
| path: func(t *testing.T) string { | |
| multi := singleGroupYAML + ` | |
| - identifier: test-group-2 | |
| machineType: TestMachine2 | |
| gpuType: TestGPU | |
| pfs: [] | |
| ` | |
| return writeTempConfig(t, multi) | |
| }, | |
| wantCode: errors.ErrCodeInvalidRequest, | |
| checkMore: func(t *testing.T, se *errors.StructuredError) { | |
| if got := se.Context["groupCount"]; got != 2 { | |
| t.Errorf("Context[groupCount] = %v, want 2", got) | |
| } | |
| }, | |
| }, | |
| } | |
| for _, tc := range cases { | |
| t.Run(tc.name, func(t *testing.T) { | |
| _, err := (&Collector{ClusterConfigPath: tc.path(t)}).Collect(context.Background()) | |
| if err == nil { | |
| t.Fatal("expected error, got nil") | |
| } | |
| se := (*errors.StructuredError)(nil) | |
| if !stderrors.As(err, &se) || se.Code != tc.wantCode { | |
| t.Fatalf("expected %v, got %v", tc.wantCode, err) | |
| } | |
| if tc.checkMore != nil { | |
| tc.checkMore(t, se) | |
| } | |
| }) | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/collector/network/network_test.go` around lines 99 - 144, Consolidate the
repeated file-mode error-path coverage in Collector.Collect into a single
table-driven test that exercises the missing path, invalid YAML, and multi-group
rejection cases using shared setup/assertion logic. Keep the existing checks for
errors.StructuredError codes and groupCount context, but move case-specific
inputs and expectations into a case list keyed by the relevant symbols
TestCollect_FileMode_MissingPath, TestCollect_FileMode_InvalidYAML, and
TestCollect_FileMode_MultiGroupRejected.
Source: Coding guidelines
| // Read +1 over the cap so an overshoot triggers the size error | ||
| // rather than silently truncating into a parse failure later. | ||
| limited := io.LimitReader(f, defaults.MaxClusterConfigBytes+1) | ||
|
|
||
| cfg, err := l8kplugin.ParseClusterConfig(limited) | ||
| if err != nil { | ||
| return nil, errors.WrapWithContext(errors.ErrCodeInvalidRequest, | ||
| "failed to parse cluster-config file", err, | ||
| map[string]any{"path": c.ClusterConfigPath}) | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
git ls-files 'pkg/collector/network/*' 'pkg/**' | sed -n '1,200p'
echo '---'
rg -n "ParseClusterConfig|MaxClusterConfigBytes|LimitReader|cluster-config" pkg vendor -g '*.go'Repository: NVIDIA/aicr
Length of output: 29085
🏁 Script executed:
#!/bin/bash
set -euo pipefail
fd -a 'network.go' pkg
echo '---'
fd -a 'cluster_config' .
echo '---'
rg -n "func .*ParseClusterConfig|ParseClusterConfig\(" -g '*.go' .Repository: NVIDIA/aicr
Length of output: 495
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "ParseClusterConfig|MaxClusterConfigBytes|cluster-config" . -g '*.go' -g '*.mod' -g '*.sum'Repository: NVIDIA/aicr
Length of output: 6566
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '180,220p' vendor/github.com/NVIDIA/k8s-launch-kit/pkg/networkoperatorplugin/library.go
echo '---'
sed -n '100,130p' pkg/collector/network/network.go
echo '---'
sed -n '1,120p' pkg/bundler/bundler.goRepository: NVIDIA/aicr
Length of output: 7031
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '180,220p' vendor/github.com/NVIDIA/k8s-launch-kit/pkg/networkoperatorplugin/library.go
echo '---'
sed -n '100,130p' pkg/collector/network/network.go
echo '---'
sed -n '1,120p' pkg/bundler/bundler.goRepository: NVIDIA/aicr
Length of output: 7031
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '180,220p' vendor/github.com/NVIDIA/k8s-launch-kit/pkg/networkoperatorplugin/library.go
echo '---'
sed -n '100,130p' pkg/collector/network/network.go
echo '---'
sed -n '1,120p' pkg/bundler/bundler.goRepository: NVIDIA/aicr
Length of output: 7031
Enforce the cluster-config size cap before parsing
ParseClusterConfig reads the entire io.Reader, so io.LimitReader(..., MaxClusterConfigBytes+1) here only truncates oversized files and can turn them into a misleading parse error. Check the byte count before calling ParseClusterConfig, and return an explicit ErrCodeInvalidRequest when the file exceeds the cap.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/collector/network/network.go` around lines 112 - 121, The cluster-config
size check in network.go is being applied with io.LimitReader before
l8kplugin.ParseClusterConfig, which can hide oversized inputs as parse errors;
instead, read or inspect the file size in the caller around the
ParseClusterConfig flow, enforce defaults.MaxClusterConfigBytes before parsing,
and return an explicit errors.ErrCodeInvalidRequest with the cluster-config path
when the limit is exceeded.
Why
aicr's snapshot model gained a NetworkTopology Measurement type in Phase 1 but nothing populated it. Phase 3 wires the producer: a new collector that ingests k8s-launch-kit (l8k) cluster-config data — either from a file or via live discovery — and translates it into the canonical NetworkTopology shape documented in docs/integrator/measurement-api.md.
This is the consumer side of the cross-repo integration. l8k's library API landed upstream as PR #102 (merged commit f00c3cdde8c9); aicr pins the corresponding pseudo-version and uses two l8k entry points: ParseClusterConfig for the file-source path and Discover for the live path.
What changes
New package pkg/collector/network
Collector type with two source modes, selected by struct fields: ClusterConfigPath (file source — no cluster contact) and DiscoverNetwork (live l8k discovery — writes node labels, patches NicClusterPolicy). When both are set, file path wins so callers can default --discover-network from a flag without inadvertent cluster contact. When neither is set, the collector is inactive — Collect returns (nil, nil) and the snapshotter's collectSafe wrapper (hardened in this commit) treats nil Measurement as a no-op.
Translator (translate.go) converts l8k.LaunchKitConfig into the contracted NetworkTopology Measurement: one Measurement with identity / capabilities / pfs (Items) / kernel-modules subtypes. Multi-group input is rejected with ErrCodeInvalidRequest and a groupCount field on the error context. Multi-group support is a planned future iteration; rejection guards the snapshot from silently dropping groups.
Per-PF Context carries model, connectedGPU, gpuProximity in addition to pciAddress, deviceID, psid, partNumber, rdmaDevice, networkInterface — every descriptive identifier l8k populates. Per-PF Data carries rail, numaNode, traffic (scalars).
Routes controller-runtime's internal logger through aicr's slog via l8kplugin.WithLogger(logr.FromSlogHandler(...)). Without this, callers hit a one-time "log.SetLogger(...) was never called" warning from controller-runtime and lose l8k's structured discovery output.
CLI: aicr snapshot --cluster-config / --discover-network
snapshotter.AgentConfig + agent.Config + Job env
Job-mode RBAC for --discover-network
Shared helper pkg/k8s/client.ResolveKubeconfigPath
Factory wiring + defaults
Cross-repo dependency
Testing
End-to-end discovery against a real cluster
Running --discover-network in local mode against a 4-node ThinkSystem SR675-V3 RTX-PRO-6000 cluster produced a NetworkTopology Measurement with the expected identity / capabilities / pfs shape, including the gpuProximity / connectedGPU per-PF Context that l8k overlays from preset topology:
CLI invocation and l8k discovery log (truncated to network-relevant lines; the proc/sys collectors warn-and-skip because the local-mode host isn't a cluster node):
Not in this commit
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)