Skip to content

fix: use workflow_run for PR coverage comments on fork PRs#5

Merged
dims merged 2 commits into
NVIDIA:mainfrom
dims:fix/fork-pr-coverage-comments
Jan 31, 2026
Merged

fix: use workflow_run for PR coverage comments on fork PRs#5
dims merged 2 commits into
NVIDIA:mainfrom
dims:fix/fork-pr-coverage-comments

Conversation

@dims

@dims dims commented Jan 31, 2026

Copy link
Copy Markdown
Collaborator

Summary

Use workflow_run pattern to post coverage comments on fork PRs, fixing the 403 permission error that was causing CI failures.

Motivation / Context

Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting comments directly. The previous implementation tried to post PR comments using actions/github-script during the pull_request event, which fails with "Resource not accessible by integration" (HTTP 403) for fork PRs.

This change uses the workflow_run pattern recommended by GitHub Security Lab:

  1. Main workflow uploads coverage data as artifact (read-only, safe)
  2. Separate workflow_run triggered workflow posts comment (has write permissions)

Reference: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/

related to comments in #3
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/eidos, pkg/cli)
  • API server (cmd/eidosd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: GitHub Actions workflows (.github/)

Implementation Notes

  • The go-coverage action now saves coverage data (coverage %, threshold, pass/fail, PR number) as a JSON artifact instead of posting directly
  • New on-push-comment.yaml workflow triggers via workflow_run after "On Push Qualification" completes
  • The workflow_run event runs in the context of the base repository, giving it write permissions even for fork PRs
  • Artifact retention is set to 1 day since it's only needed briefly for the comment workflow

Note: The workflow_run workflow won't trigger until this PR is merged (it must exist on the default branch). For this PR specifically, coverage will appear in the Step Summary but not as a PR comment.

Testing

# YAML validation
yamllint .github/workflows/on-push-comment.yaml .github/actions/go-coverage/action.yml
  • YAML syntax validated locally
  • CI will validate the workflow runs without the 403 error
  • Full coverage comment functionality will be testable after merge

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: No migration needed. The change is backwards compatible - coverage is still calculated and shown in Step Summary. PR comments will start appearing for future PRs after this merges.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (CI workflow change)
  • I updated docs if user-facing behavior changed — N/A (internal CI)
  • Changes follow existing patterns in the codebase
  • Commits are signed off (git commit -s) — DCO info

Copilot AI review requested due to automatic review settings January 31, 2026 19:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements the workflow_run pattern to fix coverage comment posting on fork PRs. Fork PRs have restricted GITHUB_TOKEN permissions that prevent direct comment posting, so this change splits the workflow into two parts: the main workflow uploads coverage data as an artifact (read-only operation), and a separate workflow_run-triggered workflow posts the comment with write permissions. This follows the secure pattern recommended by GitHub Security Lab.

Changes:

  • Added new workflow_run workflow (on-push-comment.yaml) that downloads coverage artifacts and posts PR comments with appropriate write permissions
  • Modified go-coverage action to save coverage data as an artifact instead of posting comments directly
  • Removed direct PR comment posting logic from go-coverage action (previously used github-script to post comments inline)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
.github/workflows/on-push-comment.yaml New workflow_run workflow that downloads coverage artifacts and posts comments on PRs with write permissions
.github/actions/go-coverage/action.yml Removed direct PR comment posting, now saves coverage data as artifact for workflow_run to consume

Comment thread .github/workflows/on-push-comment.yaml
Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting
comments directly. This change uses the workflow_run pattern:

1. Main workflow uploads coverage data as artifact (read-only safe)
2. Separate workflow_run triggered workflow posts comment (write perms)

This is the recommended secure pattern per GitHub Security Lab:
https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/

Fixes: NVIDIA/aicr#3

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Signed-off-by: Davanum Srinivas <[email protected]>
@dims dims force-pushed the fix/fork-pr-coverage-comments branch from 195a121 to 6697c70 Compare January 31, 2026 19:47
Copilot AI review requested due to automatic review settings January 31, 2026 20:08

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@mchmarny mchmarny linked an issue Jan 31, 2026 that may be closed by this pull request
1 task

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment thread .github/workflows/on-push-comment.yaml
@dims dims merged commit f91a5b3 into NVIDIA:main Jan 31, 2026
9 checks passed
dims referenced this pull request in dims/aicr Feb 20, 2026
Add three new validation steps to the H100 inference test:

- Inference Gateway (#6): verify GatewayClass accepted and Gateway
  programmed with inference extension CRDs present
- Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter
  metrics, Prometheus scraping, and custom metrics API availability
- Secure Accelerator Access (#3): verify GPU access is DRA-mediated
  (no hostPath, no device plugin), with proper container security

Also adds diagnostics for gateway, metrics, and DRA state on failure.

Signed-off-by: Davanum Srinivas <[email protected]>
dims referenced this pull request in dims/aicr Feb 20, 2026
Add three new validation steps to the H100 inference test:

- Inference Gateway (#6): verify GatewayClass accepted and Gateway
  programmed with inference extension CRDs present
- Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter
  metrics, Prometheus scraping, and custom metrics API availability
- Secure Accelerator Access (#3): verify GPU access is DRA-mediated
  (no hostPath, no device plugin), with proper container security

Also adds diagnostics for gateway, metrics, and DRA state on failure.

Signed-off-by: Davanum Srinivas <[email protected]>
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump and validator fallback URL update
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  Pulled.
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  Pulled.
  $ aicr recipe --service eks --accelerator h100 --intent training \
      --os ubuntu --platform kubeflow -o recipe.yaml
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... succeeds for both kai-scheduler and kubeflow-trainer components.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump and validator fallback URL update
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  Pulled.
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  Pulled.
  $ aicr recipe --service eks --accelerator h100 --intent training \
      --os ubuntu --platform kubeflow -o recipe.yaml
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... succeeds for both kai-scheduler and kubeflow-trainer components.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump and validator fallback URL update
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  Pulled.
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  Pulled.
  $ aicr recipe --service eks --accelerator h100 --intent training \
      --os ubuntu --platform kubeflow -o recipe.yaml
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... succeeds for both kai-scheduler and kubeflow-trainer components.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump and validator fallback URL update
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  Pulled.
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  Pulled.
  $ aicr recipe --service eks --accelerator h100 --intent training \
      --os ubuntu --platform kubeflow -o recipe.yaml
  $ aicr bundle -r recipe.yaml -o /tmp/bundle
  ... succeeds for both kai-scheduler and kubeflow-trainer components.

Refs: NVIDIA#698
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump, validator fallback URL update, and
demo migration to the new RuntimePatches API
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides`
is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still
admits the old field name for compat, but the controller no longer
applies it — pods come out with no override fields, and on AICR's
tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand
the demo previously used silently no-ops, leaving pods Pending.

The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and
`demos/cuj1-gke.md` is migrated to the new shape:

  spec:
    runtimePatches:
      - manager: aicr.nvidia.com/demo
        trainingRuntimeSpec:
          template:
            spec:
              replicatedJobs:
                - name: node
                  template:
                    spec:
                      template:
                        spec:
                          nodeSelector: {nodeGroup: gpu-worker}
                          tolerations:
                            - {key: dedicated, operator: Equal,
                               value: worker-workload, effect: NoSchedule}
                            - {key: dedicated, operator: Equal,
                               value: worker-workload, effect: NoExecute}

Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade:
TrainJob admitted, pod scheduled to the GPU node with the expected
tolerations + nodeSelector, training completed in 2m39s with
accuracy=0.7413 (matches pre-upgrade baseline).

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump, validator fallback URL update, and
demo migration to the new RuntimePatches API
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides`
is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still
admits the old field name for compat, but the controller no longer
applies it — pods come out with no override fields, and on AICR's
tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand
the demo previously used silently no-ops, leaving pods Pending.

The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and
`demos/cuj1-gke.md` is migrated to the new shape:

  spec:
    runtimePatches:
      - manager: aicr.nvidia.com/demo
        trainingRuntimeSpec:
          template:
            spec:
              replicatedJobs:
                - name: node
                  template:
                    spec:
                      template:
                        spec:
                          nodeSelector: {nodeGroup: gpu-worker}
                          tolerations:
                            - {key: dedicated, operator: Equal,
                               value: worker-workload, effect: NoSchedule}
                            - {key: dedicated, operator: Equal,
                               value: worker-workload, effect: NoExecute}

Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade:
TrainJob admitted, pod scheduled to the GPU node with the expected
tolerations + nodeSelector, training completed in 2m39s with
accuracy=0.7413 (matches pre-upgrade baseline).

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...
yuanchen8911 added a commit to yuanchen8911/aicr that referenced this pull request Apr 30, 2026
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are
small chart-pin changes coupled to a single non-pin tweak each.

Components bumped:

  kai-scheduler           v0.13.0 -> v0.14.1
  kubeflow-trainer        2.1.0   -> 2.2.0

kai-scheduler — chart bump and OCI registry namespace migration
(NVIDIA#698 follow-up NVIDIA#3):

KAI-Scheduler was transferred from the NVIDIA org to its own
`kai-scheduler` org and chart publishing moved with it. The old
namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0;
the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries
the full release stream. v0.14.1 verified clean: 41/41 templates and
identical kinds/counts vs v0.13.0; only values.yaml addition is an
opt-in `vpa:` block (`enabled: false` default). Our customizations
(`global.tolerations`, `admission.gpuPodRuntimeClassName`,
`postCleanup.enabled`) all still apply unchanged.

kubeflow-trainer — chart bump, validator fallback URL update, demo
migration to RuntimePatches, and ClusterTrainingRuntime alignment
(NVIDIA#698 follow-up NVIDIA#5):

The chart pin in `recipes/registry.yaml` and the hardcoded fallback
archive URL in `validators/performance/trainer_lifecycle.go` are
coupled: the validator's no-CRD install path downloads
`https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz`
and applies the `manifests/overlays/manager` kustomize. If the chart
pin moves but the validator URL doesn't, the fallback installs the
old release while the chart deploys the new one. v2.2.0 archive
layout is unchanged from v2.1.0 (same `manifests/overlays/manager`
kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the
only difference is the controller-manager image tag.

v2.2.0 ships two breaking API changes that touch AICR:

  1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309).
     The CRD still admits the old field for compat but the v2.2
     controller no longer applies it. The pytorch-mnist demo TrainJob
     in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to
     the `runtimePatches` shape with `manager: aicr.nvidia.com/demo`
     and explicit per-cluster scheduling (the EKS demo carries the
     AICR-standard `dedicated=worker-workload` tolerations + NoExecute
     effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule`
     and `nvidia.com/gpu=present:NoSchedule` to match the rest of the
     GKE flow).

  2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239).
     Upstream removed the field from the Torch policy because it now
     infers parallelism from the container's `nvidia.com/gpu` limit.
     `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI
     test fixtures stay as-is. AICR's `torch-distributed`
     ClusterTrainingRuntime is updated from
     `mlPolicy.torch: { numProcPerNode: auto }` to
     `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime.

Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade:
demo TrainJob admitted, pod scheduled with the migrated runtimePatches,
training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade
baseline). 2-replica Deployment with `schedulerName: kai-scheduler` +
DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also
scheduled cleanly with `priorityClassName: train` (each replica got
its own H100 via DRA).

Verified locally:

  $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1
  $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0
  $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...
@github-actions

github-actions Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

@github-actions github-actions Bot locked as resolved and limited conversation to collaborators May 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Configure copy-pr-bot

3 participants