Skip to content

fix(ci): shard Tier 3 KWOK matrix to stay under 256-config cap#1173

Merged
njhensley merged 7 commits into
NVIDIA:mainfrom
njhensley:ci/shard-kwok-tier3-matrix
Jun 3, 2026
Merged

fix(ci): shard Tier 3 KWOK matrix to stay under 256-config cap#1173
njhensley merged 7 commits into
NVIDIA:mainfrom
njhensley:ci/shard-kwok-tier3-matrix

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Shards the Tier 3 KWOK recipe × deployer matrix into batches fanned out to a reusable workflow, so the job stays under GitHub Actions' hard cap of 256 matrix configurations per job.

Motivation / Context

The test-tier3 job crossed every testable recipe with all four deployers (72 × 4 = 288), exceeding GitHub's 256-config limit. GitHub rejected the matrix before any leg ran, so the post-merge/nightly full-matrix backstop silently stopped executing.

Fixes: #1171
Related: #1172 (follow-up: deployer-list / step-block de-duplication across tiers)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • Docs/examples (docs/, examples/)
  • Other: CI workflows (.github/workflows/kwok-recipes.yaml, .github/workflows/kwok-tier3-shard.yaml)

Implementation Notes

  • discover now builds the {recipe, deployer} cross-product and chunks it into batches of TIER3_BATCH_SIZE (200, with headroom under 256), emitting a new tier3_batches output of {id, pairs} objects. A fail-closed guard errors loudly if the batch size is ever raised past 256, rather than resurfacing GitHub's opaque rejection mid-fan-out.
  • test-tier3 is now a thin matrix over those batches that fans each one out to the new kwok-tier3-shard.yaml reusable workflow (matching the repo's existing workflow_call pattern), which expands its batch as its own ≤ 256 matrix. Batch count grows automatically as overlays are added — no manual job duplication.
  • Concurrency: the group gains a -${{ matrix.batch.id }} suffix so every shard of a single run lands in its own group and runs in parallel; the SHA key still keeps successive merges independent (ADR-003's no-cancel-on-successive-merges guarantee preserved). The inner pair-matrix in the shard workflow has no concurrency block, so it is plainly parallel.
  • Edge cases: workflow_dispatch emits tier3_batches=[] so the output contract is total; test-tier3's if: guard is keyed on tier3_batches (the output it actually consumes).
  • ADR-003 amended: updated the concurrency snippet and added a "Tier 3 matrix sharding (GitHub 256-config cap)" subsection and refreshed structure diagram.

Testing

This is a CI-workflow + docs change (no Go); the merge gate for workflows is actionlint + yamllint, run directly:

# actionlint pinned to the CI version (v1.7.11), exact CI flags — clean repo-wide
actionlint -color -shellcheck=

# yamllint with repo config — clean
yamllint -c .yamllint.yaml .github/workflows/kwok-recipes.yaml .github/workflows/kwok-tier3-shard.yaml

Batching logic validated locally with synthetic inputs (0, 1, 50, 72, 200, 201, 400 recipes × 4 deployers): no dropped or duplicated pairs, contiguous integer batch ids, every batch ≤ 200. Reviewed via a three-persona review (CI/security, distributed-systems correctness, maintainability) — no blockers; all feedback applied.

Note: Tier 3 only runs on push-to-main and the nightly schedule (never on PRs, by design), so the sharded matrix is first exercised end-to-end on the post-merge run.

Risk Assessment

  • Low — Isolated CI/docs change, no product code; easy to revert (single commit).

Rollout notes: No migration. KWOK Test Summary remains the required check (per ADR-003), so the renamed/nested Tier 3 matrix job names do not affect branch protection.

Checklist

  • Tests pass locally (make test with -race) — N/A (no Go changes); ran actionlint + yamllint instead
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality — N/A (CI workflow change)
  • I updated docs if user-facing behavior changed (ADR-003)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

The test-tier3 job crossed every testable recipe with all four deployers
(72 × 4 = 288), exceeding GitHub Actions' hard limit of 256 matrix
configurations per job, so the job was rejected before any leg ran.

discover now builds the {recipe, deployer} cross-product and chunks it
into batches of TIER3_BATCH_SIZE (200) via a new tier3_batches output.
test-tier3 becomes a thin matrix over those batches, fanning each out to
a new kwok-tier3-shard.yaml reusable workflow that expands its batch as
its own (<= 256) matrix. Batch count grows automatically as overlays are
added; a fail-closed guard errors if the batch size is ever raised past
256 rather than resurfacing GitHub's opaque rejection.

The concurrency group gains a -${{ matrix.batch.id }} suffix so every
shard of one run lands in its own group and runs in parallel, while the
SHA key still keeps successive merges independent (ADR-003 no-cancel
guarantee preserved).

ADR-003 amended to record the sharding decision and updated concurrency
snippet. Deferred deployer-list/step-block de-duplication tracked in
NVIDIA#1172.

Fixes NVIDIA#1171
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR shards the Tier 3 recipe×deployer matrix to stay under GitHub’s 256-configuration limit. It adds a reusable workflow (.github/workflows/kwok-tier3-shard.yaml) that accepts a JSON list of {recipe,deployer} pairs and runs kwok-test per pair. The discover job now emits chunked tier3_batches (with batch-size validation and initialization for workflow_dispatch). test-tier3 is refactored to call the shard workflow once per batch via a batch matrix, and the ADR is updated to document concurrency and sharding.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix(ci): shard Tier 3 KWOK matrix to stay under 256-config cap' directly describes the main change: implementing matrix sharding to address the GitHub Actions 256-configuration limit.
Description check ✅ Passed The description comprehensively explains the fix, motivation, implementation details, testing approach, and risk assessment, all of which relate to the changeset.
Linked Issues check ✅ Passed All objectives from issue #1171 are met: the PR shards the matrix into batches [#1171], enforces a 256-config guard [#1171], implements explicit {recipe, deployer} pair emission [#1171], and preserves coverage as overlays grow [#1171].
Out of Scope Changes check ✅ Passed All changes scope to the stated objectives: workflow sharding logic, reusable shard workflow, ADR documentation update, and CI configuration—no unrelated code modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/003-scaling-recipe-tests.md`:
- Line 159: The fenced workflow diagram block in
docs/design/003-scaling-recipe-tests.md (the triple-backtick block at the
workflow starting on line 159) is missing a language tag and triggers MD040;
update the opening fence from ``` to ```text so the block becomes a text-coded
fenced block (leave the content and the closing ``` unchanged) to satisfy the
linter and keep CI green.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 859c99a1-b199-4870-80b4-eb4107115e5c

📥 Commits

Reviewing files that changed from the base of the PR and between f0490fb and e627e97.

📒 Files selected for processing (3)
  • .github/workflows/kwok-recipes.yaml
  • .github/workflows/kwok-tier3-shard.yaml
  • docs/design/003-scaling-recipe-tests.md

Comment thread docs/design/003-scaling-recipe-tests.md
mchmarny
mchmarny previously approved these changes Jun 3, 2026

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix to a real problem. The reusable-workflow + batched-matrix pattern is the right idiom for the 256-config cap, and the implementation gets the details right: fail-closed batch-size guard, tier3_batches=[] on workflow_dispatch keeps the output contract total, if: condition keyed on the output actually consumed, concurrency group's matrix.batch.id suffix preserves ADR-003's no-cancel-on-successive-merges contract while letting shards within a single run parallelize, and path triggers correctly add the new shard workflow.

ADR-003 is updated in the same PR (diagram + new sharding subsection) and the deployer-list duplication is already tracked in #1172. CI green on the pre-merge commit.

Two inline notes — both positive observations, no asks. The first true end-to-end exercise of the sharded matrix is on the post-merge run since Tier 3 only fires on push-to-main + nightly; worth eyeballing the GH UI / summary aggregation once it lands.

LGTM.

Comment thread .github/workflows/kwok-recipes.yaml
Comment thread .github/workflows/kwok-recipes.yaml
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@njhensley this PR now has merge conflicts with main. Please rebase to resolve them.

@njhensley njhensley enabled auto-merge (squash) June 3, 2026 22:23
coderabbitai[bot]

This comment was marked as resolved.

@njhensley njhensley merged commit 7068779 into NVIDIA:main Jun 3, 2026
116 checks passed
@njhensley njhensley deleted the ci/shard-kwok-tier3-matrix branch June 23, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Tier 3 KWOK matrix exceeds GitHub's 256-configuration limit (288 configs)

2 participants