fix: use workflow_run for PR coverage comments on fork PRs#5
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements the workflow_run pattern to fix coverage comment posting on fork PRs. Fork PRs have restricted GITHUB_TOKEN permissions that prevent direct comment posting, so this change splits the workflow into two parts: the main workflow uploads coverage data as an artifact (read-only operation), and a separate workflow_run-triggered workflow posts the comment with write permissions. This follows the secure pattern recommended by GitHub Security Lab.
Changes:
- Added new workflow_run workflow (on-push-comment.yaml) that downloads coverage artifacts and posts PR comments with appropriate write permissions
- Modified go-coverage action to save coverage data as an artifact instead of posting comments directly
- Removed direct PR comment posting logic from go-coverage action (previously used github-script to post comments inline)
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| .github/workflows/on-push-comment.yaml | New workflow_run workflow that downloads coverage artifacts and posts comments on PRs with write permissions |
| .github/actions/go-coverage/action.yml | Removed direct PR comment posting, now saves coverage data as artifact for workflow_run to consume |
Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting comments directly. This change uses the workflow_run pattern: 1. Main workflow uploads coverage data as artifact (read-only safe) 2. Separate workflow_run triggered workflow posts comment (write perms) This is the recommended secure pattern per GitHub Security Lab: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/ Fixes: NVIDIA/aicr#3 Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Davanum Srinivas <[email protected]>
195a121 to
6697c70
Compare
Add three new validation steps to the H100 inference test: - Inference Gateway (#6): verify GatewayClass accepted and Gateway programmed with inference extension CRDs present - Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter metrics, Prometheus scraping, and custom metrics API availability - Secure Accelerator Access (#3): verify GPU access is DRA-mediated (no hostPath, no device plugin), with proper container security Also adds diagnostics for gateway, metrics, and DRA state on failure. Signed-off-by: Davanum Srinivas <[email protected]>
Add three new validation steps to the H100 inference test: - Inference Gateway (#6): verify GatewayClass accepted and Gateway programmed with inference extension CRDs present - Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter metrics, Prometheus scraping, and custom metrics API availability - Secure Accelerator Access (#3): verify GPU access is DRA-mediated (no hostPath, no device plugin), with proper container security Also adds diagnostics for gateway, metrics, and DRA state on failure. Signed-off-by: Davanum Srinivas <[email protected]>
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, and demo migration to the new RuntimePatches API (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides` is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the controller no longer applies it — pods come out with no override fields, and on AICR's tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand the demo previously used silently no-ops, leaving pods Pending. The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the new shape: spec: runtimePatches: - manager: aicr.nvidia.com/demo trainingRuntimeSpec: template: spec: replicatedJobs: - name: node template: spec: template: spec: nodeSelector: {nodeGroup: gpu-worker} tolerations: - {key: dedicated, operator: Equal, value: worker-workload, effect: NoSchedule} - {key: dedicated, operator: Equal, value: worker-workload, effect: NoExecute} Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: TrainJob admitted, pod scheduled to the GPU node with the expected tolerations + nodeSelector, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, and demo migration to the new RuntimePatches API (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides` is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the controller no longer applies it — pods come out with no override fields, and on AICR's tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand the demo previously used silently no-ops, leaving pods Pending. The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the new shape: spec: runtimePatches: - manager: aicr.nvidia.com/demo trainingRuntimeSpec: template: spec: replicatedJobs: - name: node template: spec: template: spec: nodeSelector: {nodeGroup: gpu-worker} tolerations: - {key: dedicated, operator: Equal, value: worker-workload, effect: NoSchedule} - {key: dedicated, operator: Equal, value: worker-workload, effect: NoExecute} Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: TrainJob admitted, pod scheduled to the GPU node with the expected tolerations + nodeSelector, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...
Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, demo migration to RuntimePatches, and ClusterTrainingRuntime alignment (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships two breaking API changes that touch AICR: 1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309). The CRD still admits the old field for compat but the v2.2 controller no longer applies it. The pytorch-mnist demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the `runtimePatches` shape with `manager: aicr.nvidia.com/demo` and explicit per-cluster scheduling (the EKS demo carries the AICR-standard `dedicated=worker-workload` tolerations + NoExecute effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule` and `nvidia.com/gpu=present:NoSchedule` to match the rest of the GKE flow). 2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239). Upstream removed the field from the Torch policy because it now infers parallelism from the container's `nvidia.com/gpu` limit. `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI test fixtures stay as-is. AICR's `torch-distributed` ClusterTrainingRuntime is updated from `mlPolicy.torch: { numProcPerNode: auto }` to `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime. Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: demo TrainJob admitted, pod scheduled with the migrated runtimePatches, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). 2-replica Deployment with `schedulerName: kai-scheduler` + DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also scheduled cleanly with `priorityClassName: train` (each replica got its own H100 via DRA). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...
|
This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes. |
Summary
Use
workflow_runpattern to post coverage comments on fork PRs, fixing the 403 permission error that was causing CI failures.Motivation / Context
Fork PRs have restricted
GITHUB_TOKENpermissions that prevent posting comments directly. The previous implementation tried to post PR comments usingactions/github-scriptduring thepull_requestevent, which fails with "Resource not accessible by integration" (HTTP 403) for fork PRs.This change uses the
workflow_runpattern recommended by GitHub Security Lab:workflow_runtriggered workflow posts comment (has write permissions)Reference: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/
related to comments in #3
Related: N/A
Type of Change
Component(s) Affected
cmd/eidos,pkg/cli)cmd/eidosd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/).github/)Implementation Notes
go-coverageaction now saves coverage data (coverage %, threshold, pass/fail, PR number) as a JSON artifact instead of posting directlyon-push-comment.yamlworkflow triggers viaworkflow_runafter "On Push Qualification" completesworkflow_runevent runs in the context of the base repository, giving it write permissions even for fork PRsNote: The
workflow_runworkflow won't trigger until this PR is merged (it must exist on the default branch). For this PR specifically, coverage will appear in the Step Summary but not as a PR comment.Testing
# YAML validation yamllint .github/workflows/on-push-comment.yaml .github/actions/go-coverage/action.ymlRisk Assessment
Rollout notes: No migration needed. The change is backwards compatible - coverage is still calculated and shown in Step Summary. PR comments will start appearing for future PRs after this merges.
Checklist
make testwith-race)make lint)git commit -s) — DCO info