Problem
The NCCL performance validator (nccl-all-reduce-bw) only supports EKS via Kubeflow TrainJob. GKE requires a different execution model because GPUDirect TCPXO needs a tcpxo-daemon sidecar per pod and hostNetwork: true, which doesn't fit the TrainJob abstraction.
GKE+H100 is currently in pendingNCCLCombinations and skips with an informative message. Testdata exists at validators/performance/testdata/h100/gke/ but cannot be used by the current automation.
Root Cause
The validator's apply flow is hardcoded to single-GVR resources:
- Apply
runtime.yaml as a TrainingRuntime
- Apply
trainjob.yaml as a TrainJob
- Wait for TrainJob completion
- Extract logs from launcher pod
GKE's runtime.yaml contains multiple resource types (Services + Pods), causing: "the API version in the data (v1) does not match the expected API version (trainer.kubeflow.org/v1alpha1)"
Proposed Fix
Add a GKE execution strategy in validateNcclAllReduceBw:
- Multi-resource apply — split YAML by
---, detect each resource's GVR from apiVersion/kind, apply independently
- Pod readiness wait — wait for NCCL test pods to be 2/2 Ready (not TrainJob completion)
- Exec-based trigger —
kubectl exec into host-1 to run /scripts/allreduce.sh
- Parse output — reuse existing
ncclBandwidthRe regex from exec stdout
Branch on service == GKE to use this flow; EKS continues using TrainJob path.
Files
validators/performance/nccl_all_reduce_bw_constraint.go — add GKE execution branch
validators/performance/testdata/h100/gke/runtime.yaml — already exists (raw Pods + Services)
validators/performance/testdata/h100/gke/trainjob.yaml — may be replaced by exec trigger logic
Validation
Manually validated on GKE a3-megagpu-8g (2x H100, COS, K8s 1.35):
- NCCL AllReduce: 335 GB/s peak busBW, 87.2 GB/s avg
- Using
hostNetwork: true + privileged: true (fallback profile)
Related: #381 (TCPXO hostNetwork requirement), upstream container-engine-accelerators#580
Problem
The NCCL performance validator (
nccl-all-reduce-bw) only supports EKS via Kubeflow TrainJob. GKE requires a different execution model because GPUDirect TCPXO needs atcpxo-daemonsidecar per pod andhostNetwork: true, which doesn't fit the TrainJob abstraction.GKE+H100 is currently in
pendingNCCLCombinationsand skips with an informative message. Testdata exists atvalidators/performance/testdata/h100/gke/but cannot be used by the current automation.Root Cause
The validator's apply flow is hardcoded to single-GVR resources:
runtime.yamlas aTrainingRuntimetrainjob.yamlas aTrainJobGKE's
runtime.yamlcontains multiple resource types (Services + Pods), causing:"the API version in the data (v1) does not match the expected API version (trainer.kubeflow.org/v1alpha1)"Proposed Fix
Add a GKE execution strategy in
validateNcclAllReduceBw:---, detect each resource's GVR fromapiVersion/kind, apply independentlykubectl execinto host-1 to run/scripts/allreduce.shncclBandwidthReregex from exec stdoutBranch on
service == GKEto use this flow; EKS continues using TrainJob path.Files
validators/performance/nccl_all_reduce_bw_constraint.go— add GKE execution branchvalidators/performance/testdata/h100/gke/runtime.yaml— already exists (raw Pods + Services)validators/performance/testdata/h100/gke/trainjob.yaml— may be replaced by exec trigger logicValidation
Manually validated on GKE a3-megagpu-8g (2x H100, COS, K8s 1.35):
hostNetwork: true+privileged: true(fallback profile)Related: #381 (TCPXO hostNetwork requirement), upstream container-engine-accelerators#580