fix(ci): shard Tier 3 KWOK matrix to stay under 256-config cap#1173
Conversation
The test-tier3 job crossed every testable recipe with all four deployers
(72 × 4 = 288), exceeding GitHub Actions' hard limit of 256 matrix
configurations per job, so the job was rejected before any leg ran.
discover now builds the {recipe, deployer} cross-product and chunks it
into batches of TIER3_BATCH_SIZE (200) via a new tier3_batches output.
test-tier3 becomes a thin matrix over those batches, fanning each out to
a new kwok-tier3-shard.yaml reusable workflow that expands its batch as
its own (<= 256) matrix. Batch count grows automatically as overlays are
added; a fail-closed guard errors if the batch size is ever raised past
256 rather than resurfacing GitHub's opaque rejection.
The concurrency group gains a -${{ matrix.batch.id }} suffix so every
shard of one run lands in its own group and runs in parallel, while the
SHA key still keeps successive merges independent (ADR-003 no-cancel
guarantee preserved).
ADR-003 amended to record the sharding decision and updated concurrency
snippet. Deferred deployer-list/step-block de-duplication tracked in
NVIDIA#1172.
Fixes NVIDIA#1171
📝 WalkthroughWalkthroughThis PR shards the Tier 3 recipe×deployer matrix to stay under GitHub’s 256-configuration limit. It adds a reusable workflow (.github/workflows/kwok-tier3-shard.yaml) that accepts a JSON list of {recipe,deployer} pairs and runs kwok-test per pair. The discover job now emits chunked tier3_batches (with batch-size validation and initialization for workflow_dispatch). test-tier3 is refactored to call the shard workflow once per batch via a batch matrix, and the ADR is updated to document concurrency and sharding. Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/003-scaling-recipe-tests.md`:
- Line 159: The fenced workflow diagram block in
docs/design/003-scaling-recipe-tests.md (the triple-backtick block at the
workflow starting on line 159) is missing a language tag and triggers MD040;
update the opening fence from ``` to ```text so the block becomes a text-coded
fenced block (leave the content and the closing ``` unchanged) to satisfy the
linter and keep CI green.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 859c99a1-b199-4870-80b4-eb4107115e5c
📒 Files selected for processing (3)
.github/workflows/kwok-recipes.yaml.github/workflows/kwok-tier3-shard.yamldocs/design/003-scaling-recipe-tests.md
mchmarny
left a comment
There was a problem hiding this comment.
Clean fix to a real problem. The reusable-workflow + batched-matrix pattern is the right idiom for the 256-config cap, and the implementation gets the details right: fail-closed batch-size guard, tier3_batches=[] on workflow_dispatch keeps the output contract total, if: condition keyed on the output actually consumed, concurrency group's matrix.batch.id suffix preserves ADR-003's no-cancel-on-successive-merges contract while letting shards within a single run parallelize, and path triggers correctly add the new shard workflow.
ADR-003 is updated in the same PR (diagram + new sharding subsection) and the deployer-list duplication is already tracked in #1172. CI green on the pre-merge commit.
Two inline notes — both positive observations, no asks. The first true end-to-end exercise of the sharded matrix is on the post-merge run since Tier 3 only fires on push-to-main + nightly; worth eyeballing the GH UI / summary aggregation once it lands.
LGTM.
|
@njhensley this PR now has merge conflicts with |
Summary
Shards the Tier 3 KWOK
recipe × deployermatrix into batches fanned out to a reusable workflow, so the job stays under GitHub Actions' hard cap of 256 matrix configurations per job.Motivation / Context
The
test-tier3job crossed every testable recipe with all four deployers (72 × 4 = 288), exceeding GitHub's 256-config limit. GitHub rejected the matrix before any leg ran, so the post-merge/nightly full-matrix backstop silently stopped executing.Fixes: #1171
Related: #1172 (follow-up: deployer-list / step-block de-duplication across tiers)
Type of Change
Component(s) Affected
docs/,examples/).github/workflows/kwok-recipes.yaml,.github/workflows/kwok-tier3-shard.yaml)Implementation Notes
discovernow builds the{recipe, deployer}cross-product and chunks it into batches ofTIER3_BATCH_SIZE(200, with headroom under 256), emitting a newtier3_batchesoutput of{id, pairs}objects. A fail-closed guard errors loudly if the batch size is ever raised past 256, rather than resurfacing GitHub's opaque rejection mid-fan-out.test-tier3is now a thin matrix over those batches that fans each one out to the newkwok-tier3-shard.yamlreusable workflow (matching the repo's existingworkflow_callpattern), which expands its batch as its own ≤ 256 matrix. Batch count grows automatically as overlays are added — no manual job duplication.-${{ matrix.batch.id }}suffix so every shard of a single run lands in its own group and runs in parallel; the SHA key still keeps successive merges independent (ADR-003's no-cancel-on-successive-merges guarantee preserved). The inner pair-matrix in the shard workflow has no concurrency block, so it is plainly parallel.workflow_dispatchemitstier3_batches=[]so the output contract is total;test-tier3'sif:guard is keyed ontier3_batches(the output it actually consumes).Testing
This is a CI-workflow + docs change (no Go); the merge gate for workflows is actionlint + yamllint, run directly:
Batching logic validated locally with synthetic inputs (0, 1, 50, 72, 200, 201, 400 recipes × 4 deployers): no dropped or duplicated pairs, contiguous integer batch ids, every batch ≤ 200. Reviewed via a three-persona review (CI/security, distributed-systems correctness, maintainability) — no blockers; all feedback applied.
Note: Tier 3 only runs on push-to-
mainand the nightly schedule (never on PRs, by design), so the sharded matrix is first exercised end-to-end on the post-merge run.Risk Assessment
Rollout notes: No migration.
KWOK Test Summaryremains the required check (per ADR-003), so the renamed/nested Tier 3 matrix job names do not affect branch protection.Checklist
make testwith-race) — N/A (no Go changes); ran actionlint + yamllint insteadmake lint)git commit -S)