Skip to content

feat(network): NetworkTopology collector + l8k library integration#1472

Closed
almaslennikov wants to merge 1 commit into
NVIDIA:mainfrom
almaslennikov:network-collector
Closed

feat(network): NetworkTopology collector + l8k library integration#1472
almaslennikov wants to merge 1 commit into
NVIDIA:mainfrom
almaslennikov:network-collector

Conversation

@almaslennikov

Copy link
Copy Markdown
Contributor

Why

aicr's snapshot model gained a NetworkTopology Measurement type in Phase 1 but nothing populated it. Phase 3 wires the producer: a new collector that ingests k8s-launch-kit (l8k) cluster-config data — either from a file or via live discovery — and translates it into the canonical NetworkTopology shape documented in docs/integrator/measurement-api.md.

This is the consumer side of the cross-repo integration. l8k's library API landed upstream as PR #102 (merged commit f00c3cdde8c9); aicr pins the corresponding pseudo-version and uses two l8k entry points: ParseClusterConfig for the file-source path and Discover for the live path.

What changes

New package pkg/collector/network

  • Collector type with two source modes, selected by struct fields: ClusterConfigPath (file source — no cluster contact) and DiscoverNetwork (live l8k discovery — writes node labels, patches NicClusterPolicy). When both are set, file path wins so callers can default --discover-network from a flag without inadvertent cluster contact. When neither is set, the collector is inactive — Collect returns (nil, nil) and the snapshotter's collectSafe wrapper (hardened in this commit) treats nil Measurement as a no-op.

  • Translator (translate.go) converts l8k.LaunchKitConfig into the contracted NetworkTopology Measurement: one Measurement with identity / capabilities / pfs (Items) / kernel-modules subtypes. Multi-group input is rejected with ErrCodeInvalidRequest and a groupCount field on the error context. Multi-group support is a planned future iteration; rejection guards the snapshot from silently dropping groups.

  • Per-PF Context carries model, connectedGPU, gpuProximity in addition to pciAddress, deviceID, psid, partNumber, rdmaDevice, networkInterface — every descriptive identifier l8k populates. Per-PF Data carries rail, numaNode, traffic (scalars).

  • Routes controller-runtime's internal logger through aicr's slog via l8kplugin.WithLogger(logr.FromSlogHandler(...)). Without this, callers hit a one-time "log.SetLogger(...) was never called" warning from controller-runtime and lose l8k's structured discovery output.

CLI: aicr snapshot --cluster-config / --discover-network

  • --cluster-config (env: AICR_CLUSTER_CONFIG_PATH) ingests a pre-existing l8k cluster-config.yaml.
  • --discover-network (env: AICR_DISCOVER_NETWORK) opts into live discovery. Discovery is NOT read-only.
  • Both flags off-by-default; an explicit opt-in is required because discovery is cluster-mutating and the file-source path needs a user-supplied YAML.
  • In AICR_AGENT_MODE=true (developer-bypass that runs collectors in-process), the CLI seeds KUBECONFIG from --kubeconfig before calling Measure so the network collector's l8k client resolves the same kubeconfig the rest of aicr uses.

snapshotter.AgentConfig + agent.Config + Job env

  • ClusterConfigPath and DiscoverNetwork on snapshotter.AgentConfig forward into agent.Config, which renders AICR_CLUSTER_CONFIG_PATH and AICR_DISCOVER_NETWORK env vars on the agent Job.
  • In Job-mode, the in-pod CLI consumes those env vars via the flag's Sources binding (cli.EnvVars), so the same factory construction path serves both local and Job execution.
  • collectSafe in pkg/snapshotter/snapshot.go now skips nil Measurements — required for the network collector's inactive-mode contract; harmless to existing collectors that always return a Measurement.

Job-mode RBAC for --discover-network

  • pkg/k8s/agent/rbac.go ensureClusterRole conditionally appends a discoverNetworkClusterRules() rule set when deployer.config.DiscoverNetwork is true. Non-network snapshots stay minimal-priv; discover-network snapshots gain the cluster- scoped grants l8k actually exercises: * apiextensions.k8s.io customresourcedefinitions (get/list/ create/update/patch) — l8k installs the nic-configuration- operator CRDs if missing. * namespaces (get/create/delete) — the nvidia-k8s-launch-kit bootstrap namespace, created and deleted each run. * apps daemonsets (get/list/watch/create/delete) — the nic- configuration-daemon bootstrap DaemonSet. * serviceaccounts/configmaps (get/create/delete) and rbac.authorization.k8s.io roles/rolebindings (get/create/ delete) — the daemon's supporting resources. * pods/exec create — probing daemon pods for VPD / link metadata. * nodes patch — writing the nvidia.kubernetes-launch-kit. machine and .gpu labels onto matched nodes. * configuration.net.nvidia.com nicdevices (get/list) — the per-PF CRs the daemon publishes. * mellanox.com nicclusterpolicies (get/patch) — server-side apply of the NicConfigurationOperator section onto the user's NicClusterPolicy.
  • Job-mode --discover-network is now usable without escalating the Job ServiceAccount to cluster-admin.

Shared helper pkg/k8s/client.ResolveKubeconfigPath

  • Lifted the explicit-path → KUBECONFIG env → ~/.kube/config → in-cluster resolution chain out of BuildKubeClient into an exported helper. The network collector uses it before handing a path to l8k's kubeclient.New, so aicr's K8s client and the l8k discovery client resolve kubeconfigs identically.

Factory wiring + defaults

  • pkg/collector/factory.go gains CreateNetworkCollector() and three Option functions (WithClusterConfigPath, WithDiscoverNetwork, WithKubeconfigPath). Existing patterns matched (struct-literal construction, no New on the collector itself).
  • pkg/defaults adds CollectorNetworkTimeout (10m) — capped above l8k's internal DaemonSet wait — and MaxClusterConfigBytes (1 MiB) for the file-source io.LimitReader.

Cross-repo dependency

  • go.mod requires github.com/nvidia/k8s-launch-kit at the merged PR test: improve unit test coverage to 70% #102 SHA (pseudo-version v0.0.0-20260625104911-f00c3cdde8c9). No local replace — pure upstream pin.
  • vendor/ tree includes l8k and its transitive deps (controller- runtime client/scheme, Mellanox network-operator API types, nic-configuration-operator API types). aicr does not register any of those schemes itself — scheme ownership stays with l8k.

Testing

  • pkg/collector/network/translate_test.go: 9 cases covering single- group happy path, every subtype's shape, kernel modules, multi- group rejection, nil input, empty ClusterConfig, no-PFs (pfs subtype skipped).
  • pkg/collector/network/network_test.go: 6 cases covering inactive contract, file mode success, missing path, invalid YAML, multi- group surfaced from collector, file-path-wins-over-discover. Tests use real YAML round-tripped through l8k's parser — no test seams.
  • pkg/k8s/agent/deployer_test.go: a new TestDeployer_EnsureRBAC subtest asserts the discover-network rule set lands when Config.DiscoverNetwork is set (mellanox.com/nicclusterpolicies patch, pods/exec create, nodes patch all present); the existing default subtest still asserts exactly 3 rules so non-discovery snapshots stay minimal-priv.
  • go test -race ./... passes. Lint clean (golangci-lint v2.12.2, pinned in .settings.yaml).

End-to-end discovery against a real cluster

Running --discover-network in local mode against a 4-node ThinkSystem SR675-V3 RTX-PRO-6000 cluster produced a NetworkTopology Measurement with the expected identity / capabilities / pfs shape, including the gpuProximity / connectedGPU per-PF Context that l8k overlays from preset topology:

apiVersion: aicr.nvidia.com/v1alpha1
fingerprint:
  accelerator:
    source: nodeTopology.label.nvidia.com/gpu.product
    value: rtx-pro-6000
  gpuNodeCount:
    source: nodeTopology.label.nvidia.com/gpu.product
    value: 3
  k8sVersion:
    source: k8s.server.version
    value: 1.34.3
  nodeCount:
    source: nodeTopology.summary.node-count
    value: 4
  os:
    value: ""
  service:
    value: ""
kind: Snapshot
measurements:
  # ...
  - subtypes:
      - context: gpuType: NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition identifier: thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715 linkType: Ethernet machineType: ThinkSystem-SR675-V3 nodeSelector: nvidia.kubernetes-launch-kit.machine=ThinkSystem-SR675-V3-NVIDIA-RTX-PRO-6000-Blackwell-Ser-20a0a715 data: pf-count: 6 rail-count: 4 subtype: identity
      - data: ib: true rdma: true sriov: true subtype: capabilities - items:
          - context: connectedGPU: GPU0 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens15f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: "0000:05:00.0" psid: mt_0000001069 rdmaDevice: mlx5_1 data: numaNode: 0 rail: 0 traffic: east-west
          - context: deviceID: a2dc model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112 networkInterface: ens1f0np0 partNumber: 900-9D3B6-00CV-AA0 pciAddress: 0000:61:00.0 psid: mt_0000000884 rdmaDevice: mlx5_bond_0 data: numaNode: 0 traffic: north-south - context: deviceID: a2dc model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112 networkInterface: ens1f1np1 partNumber: 900-9D3B6-00CV-AA0 pciAddress: 0000:61:00.1 psid: mt_0000000884 data: numaNode: 0 traffic: north-south - context: connectedGPU: GPU2 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens16f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:75:00.0 psid: mt_0000001069 rdmaDevice: mlx5_0 data: numaNode: 0 rail: 1 traffic: east-west - context: connectedGPU: GPU4 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens20f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:85:00.0 psid: mt_0000001069 rdmaDevice: mlx5_3 data: numaNode: 1 rail: 2 traffic: east-west - context: connectedGPU: GPU6 deviceID: a2dc gpuProximity: PIX model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112 networkInterface: ens21f0np0 partNumber: 900-9D3D4-00EN-HA0 pciAddress: 0000:f5:00.0 psid: mt_0000001069 rdmaDevice: mlx5_2 data: numaNode: 1 rail: 3 traffic: east-west subtype: pfs type: NetworkTopology metadata: source-node: "" timestamp: "2026-06-25T10:40:47Z" version: dev

CLI invocation and l8k discovery log (truncated to network-relevant lines; the proc/sys collectors warn-and-skip because the local-mode host isn't a cluster node):

AICR_AGENT_MODE=true aicr snapshot \
  --discover-network \
  --kubeconfig ~/workspace/k8s-launch-kit/rtx-cluster/kubeconfig \
  -o /Users/amaslennikov/workspace/tmp/aicr-discovery/discovery.yaml
[cli] collecting SystemD service configurations
[cli] collecting GPU information
[cli] collecting node topology information
[cli] collecting OS configuration
[cli] failed to collect grub - skipping: collector=grub error=[INTERNAL] failed to read GRUB params from /proc/cmdline: [INTERNAL] failed to open file "/proc/cmdline": open /proc/cmdline: no such file or directory
[cli] collecting network topology via l8k discovery
[cli] failed to collect sysctl - skipping: collector=sysctl error=[INTERNAL] failed to collect sysctl parameters: [INTERNAL] failed to walk directory /proc/sys: lstat /proc/sys: no such file or directory
[cli] failed to collect kmod - skipping: collector=kmod error=[INTERNAL] failed to read kernel modules from /proc/modules: [INTERNAL] failed to open file "/proc/modules": open /proc/modules: no such file or directory
[cli] GPU hardware detection failed: error=[INTERNAL] NFD PCI discovery failed: failed to detect PCI devices: open /sys/bus/pci/devices: no such file or directory
[cli] failed to collect release - skipping: collector=release error=[INTERNAL] failed to read os release from /usr/lib/os-release: [INTERNAL] failed to open file "/usr/lib/os-release": open /usr/lib/os-release: no such file or directory
[cli] collecting Kubernetes cluster information
[cli] D-Bus not available - no systemd data will be collected: error=dial unix /run/systemd/private: connect: no such file or directory hint=systemd/D-Bus is required for service status collection
[cli] failed to collect node - skipping: collector=node error=[INVALID_REQUEST] node name not set in environment
[cli] node topology collection complete: nodes=4 taints=1 labels=232
[cli] Bootstrapping NIC configuration daemon for discovery: namespace=nvidia-k8s-launch-kit image=nvcr.io/nvidia/mellanox/nic-configuration-operator-daemon:network-operator-v26.4.0
[cli] Filtered nodes by NVIDIA NIC presence: candidates=4 withNICs=4
[cli] Skipping discovered group with no east-west NICs (north-south only); not added to cluster-config: nodes=[mdf-prod-2852-k8m-02] pfCount=2
[cli] Discovery summary: groupCount=1 eastWestPFs=4 northSouthPFsFiltered=2 presetMatches=1 presetDeviationGroups=0 machineLabelledGroups=1
[cli] Deleted bootstrap namespace: namespace=nvidia-k8s-launch-kit
[cli] network topology collection complete: identifier=thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715 machineType=ThinkSystem-SR675-V3 gpuType=NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition pfs=6 rails=4

Not in this commit

  • ConfigMap mounting for Job-mode --cluster-config — file source works in local mode only today.
  • Multi-group NetworkTopology emission. The single-group enforcement is in toMeasurement; lifting it requires the multi-Measurement-per-Type consumer rewrites in constraints extractor, recipe validation, diff indexing, and fingerprint that are tracked as a follow-up phase.

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)

@github-actions

Copy link
Copy Markdown
Contributor

# Why

aicr's snapshot model gained a NetworkTopology Measurement type in Phase 1
but nothing populated it. Phase 3 wires the producer: a new collector that
ingests k8s-launch-kit (l8k) cluster-config data — either from a file or
via live discovery — and translates it into the canonical NetworkTopology
shape documented in docs/integrator/measurement-api.md.

This is the consumer side of the cross-repo integration. l8k's library
API landed upstream as PR NVIDIA#102 (merged commit f00c3cdde8c9); aicr pins
the corresponding pseudo-version and uses two l8k entry points:
ParseClusterConfig for the file-source path and Discover for the live
path.

# What changes

## New package pkg/collector/network

  - Collector type with two source modes, selected by struct fields:
    ClusterConfigPath (file source — no cluster contact) and
    DiscoverNetwork (live l8k discovery — writes node labels, patches
    NicClusterPolicy). When both are set, file path wins so callers can
    default --discover-network from a flag without inadvertent cluster
    contact. When neither is set, the collector is inactive — Collect
    returns (nil, nil) and the snapshotter's collectSafe wrapper
    (hardened in this commit) treats nil Measurement as a no-op.

  - Translator (translate.go) converts l8k.LaunchKitConfig into the
    contracted NetworkTopology Measurement: one Measurement with
    identity / capabilities / pfs (Items) / kernel-modules subtypes.
    Multi-group input is rejected with ErrCodeInvalidRequest and a
    groupCount field on the error context. Multi-group support is a
    planned future iteration; rejection guards the snapshot from
    silently dropping groups.

  - Per-PF Context carries model, connectedGPU, gpuProximity in
    addition to pciAddress, deviceID, psid, partNumber, rdmaDevice,
    networkInterface — every descriptive identifier l8k populates.
    Per-PF Data carries rail, numaNode, traffic (scalars).

  - Routes controller-runtime's internal logger through aicr's slog
    via l8kplugin.WithLogger(logr.FromSlogHandler(...)). Without this,
    callers hit a one-time "log.SetLogger(...) was never called"
    warning from controller-runtime and lose l8k's structured
    discovery output.

## CLI: aicr snapshot --cluster-config / --discover-network

  - --cluster-config <path> (env: AICR_CLUSTER_CONFIG_PATH) ingests
    a pre-existing l8k cluster-config.yaml.
  - --discover-network (env: AICR_DISCOVER_NETWORK) opts into live
    discovery. Discovery is NOT read-only.
  - Both flags off-by-default; an explicit opt-in is required because
    discovery is cluster-mutating and the file-source path needs a
    user-supplied YAML.
  - In AICR_AGENT_MODE=true (developer-bypass that runs collectors
    in-process), the CLI seeds KUBECONFIG from --kubeconfig before
    calling Measure so the network collector's l8k client resolves
    the same kubeconfig the rest of aicr uses.

## snapshotter.AgentConfig + agent.Config + Job env

  - ClusterConfigPath and DiscoverNetwork on snapshotter.AgentConfig
    forward into agent.Config, which renders AICR_CLUSTER_CONFIG_PATH
    and AICR_DISCOVER_NETWORK env vars on the agent Job.
  - In Job-mode, the in-pod CLI consumes those env vars via the
    flag's Sources binding (cli.EnvVars), so the same factory
    construction path serves both local and Job execution.
  - collectSafe in pkg/snapshotter/snapshot.go now skips nil
    Measurements — required for the network collector's
    inactive-mode contract; harmless to existing collectors that
    always return a Measurement.

## Job-mode RBAC for --discover-network

  - pkg/k8s/agent/rbac.go ensureClusterRole conditionally appends a
    discoverNetworkClusterRules() rule set when
    deployer.config.DiscoverNetwork is true. Non-network snapshots
    stay minimal-priv; discover-network snapshots gain the cluster-
    scoped grants l8k actually exercises:
      * apiextensions.k8s.io customresourcedefinitions (get/list/
        create/update/patch) — l8k installs the nic-configuration-
        operator CRDs if missing.
      * namespaces (get/create/delete) — the nvidia-k8s-launch-kit
        bootstrap namespace, created and deleted each run.
      * apps daemonsets (get/list/watch/create/delete) — the nic-
        configuration-daemon bootstrap DaemonSet.
      * serviceaccounts/configmaps (get/create/delete) and
        rbac.authorization.k8s.io roles/rolebindings (get/create/
        delete) — the daemon's supporting resources.
      * pods/exec create — probing daemon pods for VPD / link
        metadata.
      * nodes patch — writing the nvidia.kubernetes-launch-kit.
        machine and .gpu labels onto matched nodes.
      * configuration.net.nvidia.com nicdevices (get/list) — the
        per-PF CRs the daemon publishes.
      * mellanox.com nicclusterpolicies (get/patch) — server-side
        apply of the NicConfigurationOperator section onto the
        user's NicClusterPolicy.
  - Job-mode --discover-network is now usable without escalating the
    Job ServiceAccount to cluster-admin.

## Shared helper pkg/k8s/client.ResolveKubeconfigPath

  - Lifted the explicit-path → KUBECONFIG env → ~/.kube/config →
    in-cluster resolution chain out of BuildKubeClient into an
    exported helper. The network collector uses it before handing
    a path to l8k's kubeclient.New, so aicr's K8s client and the
    l8k discovery client resolve kubeconfigs identically.

## Factory wiring + defaults

  - pkg/collector/factory.go gains CreateNetworkCollector() and three
    Option functions (WithClusterConfigPath, WithDiscoverNetwork,
    WithKubeconfigPath). Existing patterns matched (struct-literal
    construction, no New on the collector itself).
  - pkg/defaults adds CollectorNetworkTimeout (10m) — capped above
    l8k's internal DaemonSet wait — and MaxClusterConfigBytes (1
    MiB) for the file-source io.LimitReader.

## Cross-repo dependency

  - go.mod requires github.com/nvidia/k8s-launch-kit at the merged
    PR NVIDIA#102 SHA (pseudo-version v0.0.0-20260625104911-f00c3cdde8c9).
    No local replace — pure upstream pin.
  - vendor/ tree includes l8k and its transitive deps (controller-
    runtime client/scheme, Mellanox network-operator API types,
    nic-configuration-operator API types). aicr does not register
    any of those schemes itself — scheme ownership stays with l8k.

# Testing

  - pkg/collector/network/translate_test.go: 9 cases covering single-
    group happy path, every subtype's shape, kernel modules, multi-
    group rejection, nil input, empty ClusterConfig, no-PFs (pfs
    subtype skipped).
  - pkg/collector/network/network_test.go: 6 cases covering inactive
    contract, file mode success, missing path, invalid YAML, multi-
    group surfaced from collector, file-path-wins-over-discover.
    Tests use real YAML round-tripped through l8k's parser — no test
    seams.
  - pkg/k8s/agent/deployer_test.go: a new TestDeployer_EnsureRBAC
    subtest asserts the discover-network rule set lands when
    Config.DiscoverNetwork is set (mellanox.com/nicclusterpolicies
    patch, pods/exec create, nodes patch all present); the existing
    default subtest still asserts exactly 3 rules so non-discovery
    snapshots stay minimal-priv.
  - go test -race ./... passes. Lint clean (golangci-lint v2.12.2,
    pinned in .settings.yaml).

# End-to-end discovery against a real cluster

Running --discover-network in local mode against a 4-node ThinkSystem
SR675-V3 RTX-PRO-6000 cluster produced a NetworkTopology Measurement
with the expected identity / capabilities / pfs shape, including the
gpuProximity / connectedGPU per-PF Context that l8k overlays from
preset topology:

    apiVersion: aicr.nvidia.com/v1alpha1
    fingerprint:
      accelerator:
        source: nodeTopology.label.nvidia.com/gpu.product
        value: rtx-pro-6000
      gpuNodeCount:
        source: nodeTopology.label.nvidia.com/gpu.product
        value: 3
      k8sVersion:
        source: k8s.server.version
        value: 1.34.3
      nodeCount:
        source: nodeTopology.summary.node-count
        value: 4
      os:
        value: ""
      service:
        value: ""
    kind: Snapshot
    measurements:
      # ...
      - subtypes:
          - context:
              gpuType: NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition
              identifier: thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715
              linkType: Ethernet
              machineType: ThinkSystem-SR675-V3
              nodeSelector: nvidia.kubernetes-launch-kit.machine=ThinkSystem-SR675-V3-NVIDIA-RTX-PRO-6000-Blackwell-Ser-20a0a715
            data:
              pf-count: 6
              rail-count: 4
            subtype: identity
          - data:
              ib: true
              rdma: true
              sriov: true
            subtype: capabilities
          - items:
              - context:
                  connectedGPU: GPU0
                  deviceID: a2dc
                  gpuProximity: PIX
                  model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112
                  networkInterface: ens15f0np0
                  partNumber: 900-9D3D4-00EN-HA0
                  pciAddress: "0000:05:00.0"
                  psid: mt_0000001069
                  rdmaDevice: mlx5_1
                data:
                  numaNode: 0
                  rail: 0
                  traffic: east-west
              - context:
                  deviceID: a2dc
                  model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112
                  networkInterface: ens1f0np0
                  partNumber: 900-9D3B6-00CV-AA0
                  pciAddress: 0000:61:00.0
                  psid: mt_0000000884
                  rdmaDevice: mlx5_bond_0
                data:
                  numaNode: 0
                  traffic: north-south
              - context:
                  deviceID: a2dc
                  model: BlueField-3 P-Series DPU 200GbE/NDR200 dual-port QSFP112
                  networkInterface: ens1f1np1
                  partNumber: 900-9D3B6-00CV-AA0
                  pciAddress: 0000:61:00.1
                  psid: mt_0000000884
                data:
                  numaNode: 0
                  traffic: north-south
              - context:
                  connectedGPU: GPU2
                  deviceID: a2dc
                  gpuProximity: PIX
                  model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112
                  networkInterface: ens16f0np0
                  partNumber: 900-9D3D4-00EN-HA0
                  pciAddress: 0000:75:00.0
                  psid: mt_0000001069
                  rdmaDevice: mlx5_0
                data:
                  numaNode: 0
                  rail: 1
                  traffic: east-west
              - context:
                  connectedGPU: GPU4
                  deviceID: a2dc
                  gpuProximity: PIX
                  model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112
                  networkInterface: ens20f0np0
                  partNumber: 900-9D3D4-00EN-HA0
                  pciAddress: 0000:85:00.0
                  psid: mt_0000001069
                  rdmaDevice: mlx5_3
                data:
                  numaNode: 1
                  rail: 2
                  traffic: east-west
              - context:
                  connectedGPU: GPU6
                  deviceID: a2dc
                  gpuProximity: PIX
                  model: BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112
                  networkInterface: ens21f0np0
                  partNumber: 900-9D3D4-00EN-HA0
                  pciAddress: 0000:f5:00.0
                  psid: mt_0000001069
                  rdmaDevice: mlx5_2
                data:
                  numaNode: 1
                  rail: 3
                  traffic: east-west
            subtype: pfs
        type: NetworkTopology
    metadata:
      source-node: ""
      timestamp: "2026-06-25T10:40:47Z"
      version: dev

CLI invocation and l8k discovery log (truncated to network-relevant
lines; the proc/sys collectors warn-and-skip because the local-mode
host isn't a cluster node):

    AICR_AGENT_MODE=true aicr snapshot \
      --discover-network \
      --kubeconfig ~/workspace/k8s-launch-kit/rtx-cluster/kubeconfig \
      -o /Users/amaslennikov/workspace/tmp/aicr-discovery/discovery.yaml
    [cli] collecting SystemD service configurations
    [cli] collecting GPU information
    [cli] collecting node topology information
    [cli] collecting OS configuration
    [cli] failed to collect grub - skipping: collector=grub error=[INTERNAL] failed to read GRUB params from /proc/cmdline: [INTERNAL] failed to open file "/proc/cmdline": open /proc/cmdline: no such file or directory
    [cli] collecting network topology via l8k discovery
    [cli] failed to collect sysctl - skipping: collector=sysctl error=[INTERNAL] failed to collect sysctl parameters: [INTERNAL] failed to walk directory /proc/sys: lstat /proc/sys: no such file or directory
    [cli] failed to collect kmod - skipping: collector=kmod error=[INTERNAL] failed to read kernel modules from /proc/modules: [INTERNAL] failed to open file "/proc/modules": open /proc/modules: no such file or directory
    [cli] GPU hardware detection failed: error=[INTERNAL] NFD PCI discovery failed: failed to detect PCI devices: open /sys/bus/pci/devices: no such file or directory
    [cli] failed to collect release - skipping: collector=release error=[INTERNAL] failed to read os release from /usr/lib/os-release: [INTERNAL] failed to open file "/usr/lib/os-release": open /usr/lib/os-release: no such file or directory
    [cli] collecting Kubernetes cluster information
    [cli] D-Bus not available - no systemd data will be collected: error=dial unix /run/systemd/private: connect: no such file or directory hint=systemd/D-Bus is required for service status collection
    [cli] failed to collect node - skipping: collector=node error=[INVALID_REQUEST] node name not set in environment
    [cli] node topology collection complete: nodes=4 taints=1 labels=232
    [cli] Bootstrapping NIC configuration daemon for discovery: namespace=nvidia-k8s-launch-kit image=nvcr.io/nvidia/mellanox/nic-configuration-operator-daemon:network-operator-v26.4.0
    [cli] Filtered nodes by NVIDIA NIC presence: candidates=4 withNICs=4
    [cli] Skipping discovered group with no east-west NICs (north-south only); not added to cluster-config: nodes=[mdf-prod-2852-k8m-02] pfCount=2
    [cli] Discovery summary: groupCount=1 eastWestPFs=4 northSouthPFsFiltered=2 presetMatches=1 presetDeviationGroups=0 machineLabelledGroups=1
    [cli] Deleted bootstrap namespace: namespace=nvidia-k8s-launch-kit
    [cli] network topology collection complete: identifier=thinksystem-sr675-v3-nvidia-rtx-pro-6000-blackwell-ser-20a0a715 machineType=ThinkSystem-SR675-V3 gpuType=NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition pfs=6 rails=4

# Not in this commit

  - ConfigMap mounting for Job-mode --cluster-config — file source
    works in local mode only today.
  - Multi-group NetworkTopology emission. The single-group
    enforcement is in toMeasurement; lifting it requires the
    multi-Measurement-per-Type consumer rewrites in constraints
    extractor, recipe validation, diff indexing, and fingerprint
    that are tracked as a follow-up phase.
@NVIDIA NVIDIA deleted a comment from coderabbitai Bot Jun 25, 2026
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough
📝 Walkthrough

Walkthrough

Adds a network topology collector that can read a cluster-config file or run live discovery, translates LaunchKit config into NetworkTopology measurements, and threads the new options through the CLI, snapshotter, agent config, and RBAC. It also updates kubeconfig resolution, collector timeouts, documentation, tests, and module dependencies.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested labels

area/api

Suggested reviewers

  • mchmarny
  • lalitadithya
  • ArangoGutierrez
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding a NetworkTopology collector and integrating l8k.
Description check ✅ Passed The description matches the implemented network collector, CLI, RBAC, helper, and test changes.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/collector/network/network_test.go`:
- Around line 99-144: Consolidate the repeated file-mode error-path coverage in
Collector.Collect into a single table-driven test that exercises the missing
path, invalid YAML, and multi-group rejection cases using shared setup/assertion
logic. Keep the existing checks for errors.StructuredError codes and groupCount
context, but move case-specific inputs and expectations into a case list keyed
by the relevant symbols TestCollect_FileMode_MissingPath,
TestCollect_FileMode_InvalidYAML, and TestCollect_FileMode_MultiGroupRejected.

In `@pkg/collector/network/network.go`:
- Around line 112-121: The cluster-config size check in network.go is being
applied with io.LimitReader before l8kplugin.ParseClusterConfig, which can hide
oversized inputs as parse errors; instead, read or inspect the file size in the
caller around the ParseClusterConfig flow, enforce
defaults.MaxClusterConfigBytes before parsing, and return an explicit
errors.ErrCodeInvalidRequest with the cluster-config path when the limit is
exceeded.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

Comment on lines +99 to +144
func TestCollect_FileMode_MissingPath(t *testing.T) {
_, err := (&Collector{ClusterConfigPath: "/definitely/not/a/real/file.yaml"}).Collect(context.Background())
if err == nil {
t.Fatal("expected error for missing file, got nil")
}
se := (*errors.StructuredError)(nil)
if !stderrors.As(err, &se) || se.Code != errors.ErrCodeNotFound {
t.Errorf("expected ErrCodeNotFound, got %v", err)
}
}

func TestCollect_FileMode_InvalidYAML(t *testing.T) {
path := writeTempConfig(t, "not: valid: yaml: at: all: : :")
_, err := (&Collector{ClusterConfigPath: path}).Collect(context.Background())
if err == nil {
t.Fatal("expected error for invalid YAML, got nil")
}
se := (*errors.StructuredError)(nil)
if !stderrors.As(err, &se) || se.Code != errors.ErrCodeInvalidRequest {
t.Errorf("expected ErrCodeInvalidRequest, got %v", err)
}
}

// TestCollect_FileMode_MultiGroupRejected anchors that the collector
// surfaces the translator's single-group enforcement faithfully — a
// multi-group source returns ErrCodeInvalidRequest with groupCount set.
func TestCollect_FileMode_MultiGroupRejected(t *testing.T) {
multi := singleGroupYAML + `
- identifier: test-group-2
machineType: TestMachine2
gpuType: TestGPU
pfs: []
`
path := writeTempConfig(t, multi)
_, err := (&Collector{ClusterConfigPath: path}).Collect(context.Background())
if err == nil {
t.Fatal("expected ErrCodeInvalidRequest for multi-group input, got nil")
}
se := (*errors.StructuredError)(nil)
if !stderrors.As(err, &se) || se.Code != errors.ErrCodeInvalidRequest {
t.Fatalf("expected ErrCodeInvalidRequest, got %v", err)
}
if got := se.Context["groupCount"]; got != 2 {
t.Errorf("Context[groupCount] = %v, want 2", got)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Convert the repeated file-mode error-path tests into a table-driven test.

TestCollect_FileMode_MissingPath, TestCollect_FileMode_InvalidYAML, and TestCollect_FileMode_MultiGroupRejected follow the same structure and should be consolidated into a table-driven case set for consistency and easier extension. As per coding guidelines, tests with multiple cases should be table-driven.

Proposed refactor sketch
-func TestCollect_FileMode_MissingPath(t *testing.T) { ... }
-func TestCollect_FileMode_InvalidYAML(t *testing.T) { ... }
-func TestCollect_FileMode_MultiGroupRejected(t *testing.T) { ... }
+func TestCollect_FileMode_Errors(t *testing.T) {
+	cases := []struct {
+		name      string
+		path      func(t *testing.T) string
+		wantCode  errors.ErrorCode
+		checkMore func(t *testing.T, se *errors.StructuredError)
+	}{
+		// missing path / invalid yaml / multi-group...
+	}
+	for _, tc := range cases {
+		t.Run(tc.name, func(t *testing.T) {
+			_, err := (&Collector{ClusterConfigPath: tc.path(t)}).Collect(context.Background())
+			if err == nil {
+				t.Fatal("expected error, got nil")
+			}
+			se := (*errors.StructuredError)(nil)
+			if !stderrors.As(err, &se) || se.Code != tc.wantCode {
+				t.Fatalf("expected %v, got %v", tc.wantCode, err)
+			}
+			if tc.checkMore != nil {
+				tc.checkMore(t, se)
+			}
+		})
+	}
+}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestCollect_FileMode_MissingPath(t *testing.T) {
_, err := (&Collector{ClusterConfigPath: "/definitely/not/a/real/file.yaml"}).Collect(context.Background())
if err == nil {
t.Fatal("expected error for missing file, got nil")
}
se := (*errors.StructuredError)(nil)
if !stderrors.As(err, &se) || se.Code != errors.ErrCodeNotFound {
t.Errorf("expected ErrCodeNotFound, got %v", err)
}
}
func TestCollect_FileMode_InvalidYAML(t *testing.T) {
path := writeTempConfig(t, "not: valid: yaml: at: all: : :")
_, err := (&Collector{ClusterConfigPath: path}).Collect(context.Background())
if err == nil {
t.Fatal("expected error for invalid YAML, got nil")
}
se := (*errors.StructuredError)(nil)
if !stderrors.As(err, &se) || se.Code != errors.ErrCodeInvalidRequest {
t.Errorf("expected ErrCodeInvalidRequest, got %v", err)
}
}
// TestCollect_FileMode_MultiGroupRejected anchors that the collector
// surfaces the translator's single-group enforcement faithfully — a
// multi-group source returns ErrCodeInvalidRequest with groupCount set.
func TestCollect_FileMode_MultiGroupRejected(t *testing.T) {
multi := singleGroupYAML + `
- identifier: test-group-2
machineType: TestMachine2
gpuType: TestGPU
pfs: []
`
path := writeTempConfig(t, multi)
_, err := (&Collector{ClusterConfigPath: path}).Collect(context.Background())
if err == nil {
t.Fatal("expected ErrCodeInvalidRequest for multi-group input, got nil")
}
se := (*errors.StructuredError)(nil)
if !stderrors.As(err, &se) || se.Code != errors.ErrCodeInvalidRequest {
t.Fatalf("expected ErrCodeInvalidRequest, got %v", err)
}
if got := se.Context["groupCount"]; got != 2 {
t.Errorf("Context[groupCount] = %v, want 2", got)
}
}
func TestCollect_FileMode_Errors(t *testing.T) {
cases := []struct {
name string
path func(t *testing.T) string
wantCode errors.ErrorCode
checkMore func(t *testing.T, se *errors.StructuredError)
}{
{
name: "missing path",
path: func(t *testing.T) string {
return "/definitely/not/a/real/file.yaml"
},
wantCode: errors.ErrCodeNotFound,
},
{
name: "invalid yaml",
path: func(t *testing.T) string {
return writeTempConfig(t, "not: valid: yaml: at: all: : :")
},
wantCode: errors.ErrCodeInvalidRequest,
},
{
name: "multi group rejected",
path: func(t *testing.T) string {
multi := singleGroupYAML + `
- identifier: test-group-2
machineType: TestMachine2
gpuType: TestGPU
pfs: []
`
return writeTempConfig(t, multi)
},
wantCode: errors.ErrCodeInvalidRequest,
checkMore: func(t *testing.T, se *errors.StructuredError) {
if got := se.Context["groupCount"]; got != 2 {
t.Errorf("Context[groupCount] = %v, want 2", got)
}
},
},
}
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
_, err := (&Collector{ClusterConfigPath: tc.path(t)}).Collect(context.Background())
if err == nil {
t.Fatal("expected error, got nil")
}
se := (*errors.StructuredError)(nil)
if !stderrors.As(err, &se) || se.Code != tc.wantCode {
t.Fatalf("expected %v, got %v", tc.wantCode, err)
}
if tc.checkMore != nil {
tc.checkMore(t, se)
}
})
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/collector/network/network_test.go` around lines 99 - 144, Consolidate the
repeated file-mode error-path coverage in Collector.Collect into a single
table-driven test that exercises the missing path, invalid YAML, and multi-group
rejection cases using shared setup/assertion logic. Keep the existing checks for
errors.StructuredError codes and groupCount context, but move case-specific
inputs and expectations into a case list keyed by the relevant symbols
TestCollect_FileMode_MissingPath, TestCollect_FileMode_InvalidYAML, and
TestCollect_FileMode_MultiGroupRejected.

Source: Coding guidelines

Comment on lines +112 to +121
// Read +1 over the cap so an overshoot triggers the size error
// rather than silently truncating into a parse failure later.
limited := io.LimitReader(f, defaults.MaxClusterConfigBytes+1)

cfg, err := l8kplugin.ParseClusterConfig(limited)
if err != nil {
return nil, errors.WrapWithContext(errors.ErrCodeInvalidRequest,
"failed to parse cluster-config file", err,
map[string]any{"path": c.ClusterConfigPath})
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

git ls-files 'pkg/collector/network/*' 'pkg/**' | sed -n '1,200p'
echo '---'
rg -n "ParseClusterConfig|MaxClusterConfigBytes|LimitReader|cluster-config" pkg vendor -g '*.go'

Repository: NVIDIA/aicr

Length of output: 29085


🏁 Script executed:

#!/bin/bash
set -euo pipefail

fd -a 'network.go' pkg
echo '---'
fd -a 'cluster_config' .
echo '---'
rg -n "func .*ParseClusterConfig|ParseClusterConfig\(" -g '*.go' .

Repository: NVIDIA/aicr

Length of output: 495


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "ParseClusterConfig|MaxClusterConfigBytes|cluster-config" . -g '*.go' -g '*.mod' -g '*.sum'

Repository: NVIDIA/aicr

Length of output: 6566


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '180,220p' vendor/github.com/NVIDIA/k8s-launch-kit/pkg/networkoperatorplugin/library.go
echo '---'
sed -n '100,130p' pkg/collector/network/network.go
echo '---'
sed -n '1,120p' pkg/bundler/bundler.go

Repository: NVIDIA/aicr

Length of output: 7031


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '180,220p' vendor/github.com/NVIDIA/k8s-launch-kit/pkg/networkoperatorplugin/library.go
echo '---'
sed -n '100,130p' pkg/collector/network/network.go
echo '---'
sed -n '1,120p' pkg/bundler/bundler.go

Repository: NVIDIA/aicr

Length of output: 7031


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '180,220p' vendor/github.com/NVIDIA/k8s-launch-kit/pkg/networkoperatorplugin/library.go
echo '---'
sed -n '100,130p' pkg/collector/network/network.go
echo '---'
sed -n '1,120p' pkg/bundler/bundler.go

Repository: NVIDIA/aicr

Length of output: 7031


Enforce the cluster-config size cap before parsing
ParseClusterConfig reads the entire io.Reader, so io.LimitReader(..., MaxClusterConfigBytes+1) here only truncates oversized files and can turn them into a misleading parse error. Check the byte count before calling ParseClusterConfig, and return an explicit ErrCodeInvalidRequest when the file exceeds the cap.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/collector/network/network.go` around lines 112 - 121, The cluster-config
size check in network.go is being applied with io.LimitReader before
l8kplugin.ParseClusterConfig, which can hide oversized inputs as parse errors;
instead, read or inspect the file size in the caller around the
ParseClusterConfig flow, enforce defaults.MaxClusterConfigBytes before parsing,
and return an explicit errors.ErrCodeInvalidRequest with the cluster-config path
when the limit is exceeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant