fix: use workflow_run for PR coverage comments on fork PRs by dims · Pull Request #5 · NVIDIA/aicr

dims · 2026-01-31T19:43:39Z

Summary

Use workflow_run pattern to post coverage comments on fork PRs, fixing the 403 permission error that was causing CI failures.

Motivation / Context

Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting comments directly. The previous implementation tried to post PR comments using actions/github-script during the pull_request event, which fails with "Resource not accessible by integration" (HTTP 403) for fork PRs.

This change uses the workflow_run pattern recommended by GitHub Security Lab:

Main workflow uploads coverage data as artifact (read-only, safe)
Separate workflow_run triggered workflow posts comment (has write permissions)

Reference: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/

related to comments in #3
Related: N/A

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Refactoring (no functional changes)
Build/CI/tooling

Component(s) Affected

CLI (cmd/eidos, pkg/cli)
API server (cmd/eidosd, pkg/api, pkg/server)
Recipe engine / data (pkg/recipe)
Bundlers (pkg/bundler, pkg/component/*)
Collectors / snapshotter (pkg/collector, pkg/snapshotter)
Validator (pkg/validator)
Core libraries (pkg/errors, pkg/k8s)
Docs/examples (docs/, examples/)
Other: GitHub Actions workflows (.github/)

Implementation Notes

The go-coverage action now saves coverage data (coverage %, threshold, pass/fail, PR number) as a JSON artifact instead of posting directly
New on-push-comment.yaml workflow triggers via workflow_run after "On Push Qualification" completes
The workflow_run event runs in the context of the base repository, giving it write permissions even for fork PRs
Artifact retention is set to 1 day since it's only needed briefly for the comment workflow

Note: The workflow_run workflow won't trigger until this PR is merged (it must exist on the default branch). For this PR specifically, coverage will appear in the Step Summary but not as a PR comment.

Testing

# YAML validation
yamllint .github/workflows/on-push-comment.yaml .github/actions/go-coverage/action.yml

YAML syntax validated locally
CI will validate the workflow runs without the 403 error
Full coverage comment functionality will be testable after merge

Risk Assessment

Low — Isolated change, well-tested, easy to revert
Medium — Touches multiple components or has broader impact
High — Breaking change, affects critical paths, or complex rollout

Rollout notes: No migration needed. The change is backwards compatible - coverage is still calculated and shown in Step Summary. PR comments will start appearing for future PRs after this merges.

Checklist

Tests pass locally (make test with -race)
Linter passes (make lint)
I did not skip/disable tests to make CI green
I added/updated tests for new functionality — N/A (CI workflow change)
I updated docs if user-facing behavior changed — N/A (internal CI)
Changes follow existing patterns in the codebase
Commits are signed off (git commit -s) — DCO info

Copilot

Pull request overview

This PR implements the workflow_run pattern to fix coverage comment posting on fork PRs. Fork PRs have restricted GITHUB_TOKEN permissions that prevent direct comment posting, so this change splits the workflow into two parts: the main workflow uploads coverage data as an artifact (read-only operation), and a separate workflow_run-triggered workflow posts the comment with write permissions. This follows the secure pattern recommended by GitHub Security Lab.

Changes:

Added new workflow_run workflow (on-push-comment.yaml) that downloads coverage artifacts and posts PR comments with appropriate write permissions
Modified go-coverage action to save coverage data as an artifact instead of posting comments directly
Removed direct PR comment posting logic from go-coverage action (previously used github-script to post comments inline)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
.github/workflows/on-push-comment.yaml	New workflow_run workflow that downloads coverage artifacts and posts comments on PRs with write permissions
.github/actions/go-coverage/action.yml	Removed direct PR comment posting, now saves coverage data as artifact for workflow_run to consume

Fork PRs have restricted GITHUB_TOKEN permissions that prevent posting comments directly. This change uses the workflow_run pattern: 1. Main workflow uploads coverage data as artifact (read-only safe) 2. Separate workflow_run triggered workflow posts comment (write perms) This is the recommended secure pattern per GitHub Security Lab: https://securitylab.github.com/resources/github-actions-preventing-pwn-requests/ Fixes: NVIDIA/aicr#3 Co-Authored-By: Claude Opus 4.5 <[email protected]> Signed-off-by: Davanum Srinivas <[email protected]>

mchmarny

/lgtm

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Add three new validation steps to the H100 inference test: - Inference Gateway (#6): verify GatewayClass accepted and Gateway programmed with inference extension CRDs present - Accelerator & AI Service Metrics (#4/#5): verify DCGM Exporter metrics, Prometheus scraping, and custom metrics API availability - Secure Accelerator Access (#3): verify GPU access is DRA-mediated (no hostPath, no device plugin), with proper container security Also adds diagnostics for gateway, metrics, and DRA state on failure. Signed-off-by: Davanum Srinivas <[email protected]>

Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump and validator fallback URL update (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 Pulled. $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 Pulled. $ aicr recipe --service eks --accelerator h100 --intent training \ --os ubuntu --platform kubeflow -o recipe.yaml $ aicr bundle -r recipe.yaml -o /tmp/bundle ... succeeds for both kai-scheduler and kubeflow-trainer components. Refs: NVIDIA#698

Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, and demo migration to the new RuntimePatches API (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships a breaking API change to TrainJob: `podTemplateOverrides` is replaced by `runtimePatches` (kubeflow/trainer#3309). The CRD still admits the old field name for compat, but the controller no longer applies it — pods come out with no override fields, and on AICR's tainted GPU nodes the `tolerations: [{operator: Exists}]` shorthand the demo previously used silently no-ops, leaving pods Pending. The `pytorch-mnist` demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the new shape: spec: runtimePatches: - manager: aicr.nvidia.com/demo trainingRuntimeSpec: template: spec: replicatedJobs: - name: node template: spec: template: spec: nodeSelector: {nodeGroup: gpu-worker} tolerations: - {key: dedicated, operator: Equal, value: worker-workload, effect: NoSchedule} - {key: dedicated, operator: Equal, value: worker-workload, effect: NoExecute} Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: TrainJob admitted, pod scheduled to the GPU node with the expected tolerations + nodeSelector, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/...

Two Phase-2 follow-ups from NVIDIA#698, batched together because both are small chart-pin changes coupled to a single non-pin tweak each. Components bumped: kai-scheduler v0.13.0 -> v0.14.1 kubeflow-trainer 2.1.0 -> 2.2.0 kai-scheduler — chart bump and OCI registry namespace migration (NVIDIA#698 follow-up NVIDIA#3): KAI-Scheduler was transferred from the NVIDIA org to its own `kai-scheduler` org and chart publishing moved with it. The old namespace `oci://ghcr.io/nvidia/kai-scheduler` is frozen at v0.13.0; the new namespace `oci://ghcr.io/kai-scheduler/kai-scheduler` carries the full release stream. v0.14.1 verified clean: 41/41 templates and identical kinds/counts vs v0.13.0; only values.yaml addition is an opt-in `vpa:` block (`enabled: false` default). Our customizations (`global.tolerations`, `admission.gpuPodRuntimeClassName`, `postCleanup.enabled`) all still apply unchanged. kubeflow-trainer — chart bump, validator fallback URL update, demo migration to RuntimePatches, and ClusterTrainingRuntime alignment (NVIDIA#698 follow-up NVIDIA#5): The chart pin in `recipes/registry.yaml` and the hardcoded fallback archive URL in `validators/performance/trainer_lifecycle.go` are coupled: the validator's no-CRD install path downloads `https://github.com/kubeflow/trainer/archive/refs/tags/<version>.tar.gz` and applies the `manifests/overlays/manager` kustomize. If the chart pin moves but the validator URL doesn't, the fallback installs the old release while the chart deploys the new one. v2.2.0 archive layout is unchanged from v2.1.0 (same `manifests/overlays/manager` kustomize, same `trainjobs.trainer.kubeflow.org/v1alpha1` CRD); the only difference is the controller-manager image tag. v2.2.0 ships two breaking API changes that touch AICR: 1. PodTemplateOverrides → RuntimePatches (kubeflow/trainer#3309). The CRD still admits the old field for compat but the v2.2 controller no longer applies it. The pytorch-mnist demo TrainJob in `demos/cuj1-eks.md` and `demos/cuj1-gke.md` is migrated to the `runtimePatches` shape with `manager: aicr.nvidia.com/demo` and explicit per-cluster scheduling (the EKS demo carries the AICR-standard `dedicated=worker-workload` tolerations + NoExecute effect; the GKE demo carries `dedicated=gpu-workload:NoSchedule` and `nvidia.com/gpu=present:NoSchedule` to match the rest of the GKE flow). 2. mlPolicy.torch.numProcPerNode removal (kubeflow/trainer#3239). Upstream removed the field from the Torch policy because it now infers parallelism from the container's `nvidia.com/gpu` limit. `mlPolicy.mpi.numProcPerNode` is unaffected, so the existing MPI test fixtures stay as-is. AICR's `torch-distributed` ClusterTrainingRuntime is updated from `mlPolicy.torch: { numProcPerNode: auto }` to `mlPolicy.torch: {}`, matching the v2.2.0 reference runtime. Validated end-to-end on a real EKS H100 cluster (aicr1) post-upgrade: demo TrainJob admitted, pod scheduled with the migrated runtimePatches, training completed in 2m39s with accuracy=0.7413 (matches pre-upgrade baseline). 2-replica Deployment with `schedulerName: kai-scheduler` + DRA `ResourceClaimTemplate` referencing `gpu.nvidia.com` also scheduled cleanly with `priorityClassName: train` (each replica got its own H100 via DRA). Verified locally: $ helm pull oci://ghcr.io/kai-scheduler/kai-scheduler/kai-scheduler --version v0.14.1 $ helm pull oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 2.2.0 $ make tidy && make lint && go test -count=1 ./pkg/recipe/... ./validators/performance/... ./pkg/bundler/deployer/helm/...

github-actions · 2026-05-02T06:37:25Z

This pull request has been automatically locked since it has been closed for 90 days with no further activity. Please open a new pull request for related changes.

Copilot AI review requested due to automatic review settings January 31, 2026 19:43

Copilot started reviewing on behalf of dims January 31, 2026 19:44 View session

Copilot AI reviewed Jan 31, 2026

View reviewed changes

Comment thread .github/workflows/on-push-comment.yaml

dims force-pushed the fix/fork-pr-coverage-comments branch from 195a121 to 6697c70 Compare January 31, 2026 19:47

dims assigned mchmarny Jan 31, 2026

Merge branch 'main' into fix/fork-pr-coverage-comments

4acd178

Copilot AI review requested due to automatic review settings January 31, 2026 20:08

mchmarny approved these changes Jan 31, 2026

View reviewed changes

Copilot started reviewing on behalf of mchmarny January 31, 2026 20:09 View session

mchmarny added the enhancement label Jan 31, 2026

mchmarny linked an issue Jan 31, 2026 that may be closed by this pull request

[Feature]: Configure copy-pr-bot #4

Closed

1 task

Copilot AI reviewed Jan 31, 2026

View reviewed changes

Comment thread .github/workflows/on-push-comment.yaml

dims merged commit f91a5b3 into NVIDIA:main Jan 31, 2026
9 checks passed

dims mentioned this pull request Jan 31, 2026

fix: add actions:read permission for artifact download #6

Merged

25 tasks

dims mentioned this pull request Feb 20, 2026

chore: improve consistency across GPU CI workflows #160

Merged

6 tasks

This was referenced Feb 20, 2026

feat(ci): add CNCF AI conformance validations to inference workflow #162

Merged

feat(validator): add Go-based CNCF AI conformance checks #180

Merged

mchmarny mentioned this pull request Mar 9, 2026

feat(validation): container-per-validator execution engine #290

Merged

9 tasks

lockwobr mentioned this pull request Apr 24, 2026

undeploy.sh post-flight leaks "Please enter Username:" into terminating-namespaces WARNING #684

Closed

github-actions Bot locked as resolved and limited conversation to collaborators May 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: use workflow_run for PR coverage comments on fork PRs#5

fix: use workflow_run for PR coverage comments on fork PRs#5
dims merged 2 commits into
NVIDIA:mainfrom
dims:fix/fork-pr-coverage-comments

dims commented Jan 31, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

mchmarny left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

dims commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation / Context

Type of Change

Component(s) Affected

Implementation Notes

Testing

Risk Assessment

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

mchmarny left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dims commented Jan 31, 2026 •

edited

Loading