Skip to content

fix(kwok): remove broken --set array-index lines from validate-scheduling.sh#643

Merged
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/bundler-set-array-indexing
Apr 22, 2026
Merged

fix(kwok): remove broken --set array-index lines from validate-scheduling.sh#643
yuanchen8911 merged 1 commit into
NVIDIA:mainfrom
yuanchen8911:fix/bundler-set-array-indexing

Conversation

@yuanchen8911

@yuanchen8911 yuanchen8911 commented Apr 22, 2026

Copy link
Copy Markdown
Contributor

Summary

Remove four dead --set networkoperator:operator.tolerations[N]... lines from kwok/scripts/validate-scheduling.sh. These lines tried to append a KWOK-test-specific toleration using Helm's list-index syntax, but aicr bundle has never supported that syntax and never will as part of this PR.

Motivation / Context

KWOK Tier 1 / Tier 2 CI jobs fail at the bundle step on any PR whose path-trigger fires:

invalid path segment "tolerations[2]" in "networkoperator:operator.tolerations[2].key=...":
must contain only alphanumeric, dot, hyphen, or underscore characters

The surface symptom is loud rejection at parse time (added by #603), but the underlying cause goes deeper: even if the parser accepted [N], the downstream path walker in pkg/component/overrides.go (getOrCreateNestedMap, called from setMapValueByPath and thus from ApplyMapOverrides) splits on . only. tolerations[2] becomes a literal map key — the generated values file emits {operator: {"tolerations[2]": {key: ...}}}, which Helm silently ignores while keeping the upstream tolerations list's defaults.

So the 4 --set lines have never done what the KWOK script thought they were doing. Before #603 they produced silently-wrong bundles; after #603 they fail loudly. Either way, nothing depends on them — the adjacent --accelerated-node-toleration kwok.x-k8s.io/node=fake:NoSchedule already propagates a tolerate-the-fake-node rule to every accelerated workload (including network-operator) via each component's nodeScheduling overlay.

The fix is to drop the dead lines. No functional regression.

First revision of this PR attempted to extend safePathPattern to permit Helm-style [N] suffixes. CodeRabbit flagged the error-message and test-preservation gaps in that approach (r3126044835, r3126044843), and Codex's subsequent review (top-level comment) correctly pointed out that the downstream walker in ApplyMapOverrides / writeClusterValuesFile would have treated tolerations[2] as a literal map key regardless — turning a loud parse-time failure into a silent bundle-generation bug. Reverted the regex change, instead fixing the actual source of the broken invocation (the KWOK script).

Fixes: N/A
Related: #641 (inference-perf validator — blocked on this CI unblock)

Type of Change

  • Bug fix (non-breaking change that fixes an issue)

Component(s) Affected

  • Other: KWOK test harness (kwok/scripts/validate-scheduling.sh)
  • Tests: pkg/bundler/config/config_test.go (negative test documenting the parse rejection)

Implementation Notes

Script change (the real fix)

 --set "skyhook-customizations:enabled=false" \
-    --set "networkoperator:operator.tolerations[2].key=aicr.nvidia.com/kwok-test" \
-    --set "networkoperator:operator.tolerations[2].operator=Equal" \
-    --set "networkoperator:operator.tolerations[2].value=true" \
-    --set "networkoperator:operator.tolerations[2].effect=NoSchedule" \
 --set "dynamoplatform:etcd.persistence.enabled=false" \

Negative test (defensive documentation)

Adds helm-style array index rejected to TestParseValueOverrides. It explicitly asserts that --set foo:bar.tolerations[N].key=x syntax remains rejected at parse time, with a comment explaining that the downstream path walker can't interpret [N] as a list index. The test exists so a future contributor doesn't re-loosen the regex without first teaching getOrCreateNestedMap (and the dynamic-values extractor) to handle indexed paths — a substantive feature intentionally out of scope here.

safePathPattern unchanged

pkg/bundler/config/component_path.go matches origin/main exactly — the regex from #603 is preserved.

Testing

make qualify

All green. Targeted:

go test -run TestParseValueOverrides -v ./pkg/bundler/config/...
# 12 subtests PASS, including the new helm-style_array_index_rejected

Risk Assessment

Rollout notes: None. The KWOK script's scheduling semantics are preserved via the existing --accelerated-node-toleration lines.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed — N/A (CI-only script + internal test)
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

@coderabbitai

coderabbitai Bot commented Apr 22, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: f167504c-85f2-4f01-a98c-b0be0ae67233

📥 Commits

Reviewing files that changed from the base of the PR and between 4412b3e and 9fde739.

📒 Files selected for processing (2)
  • kwok/scripts/validate-scheduling.sh
  • pkg/bundler/config/config_test.go
💤 Files with no reviewable changes (1)
  • kwok/scripts/validate-scheduling.sh

📝 Walkthrough

Walkthrough

A test in pkg/bundler/config/config_test.go is extended with a subtest that verifies the parser rejects Helm-style bracketed array index syntax in override paths (e.g., tolerations[2], env[3]). Separately, kwok/scripts/validate-scheduling.sh had four Helm --set overrides using networkoperator.operator.tolerations[2] removed from bundle generation. No exported or public API signatures were changed; all edits are test and script-level.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main change: removing broken array-index syntax lines from a shell script that attempted to use unsupported Helm syntax.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, explaining the motivation, implementation, and testing for removing dead --set lines and adding a negative test.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/config/component_path.go`:
- Around line 27-31: The validation rejects paths using safePathPattern but its
error text still claims only alphanumeric/dot/hyphen/underscore are allowed;
update the validation error message produced by the component-path validator
(the code that checks safePathPattern) to explicitly mention that a single
trailing numeric bracket index like “[N]” is permitted (e.g., "allowed
characters: letters, digits, dot, hyphen, underscore, and an optional trailing
numeric index in brackets like [N]"). Ensure the message references
safePathPattern's accepted form so users debugging --set paths see the permitted
“[N]” syntax.

In `@pkg/bundler/config/config_test.go`:
- Around line 554-571: The test currently only checks
ParseValueOverrides([]string{in}) succeeds; update it to also assert the parsed
override preserves the exact indexed component path and value used downstream:
after calling ParseValueOverrides, find the returned override (from the result
slice) and assert override.ComponentPath.Path equals the original indexed path
segment (e.g., "operator.tolerations[2]" or "driver.env[3]" or "items[10]") and
that override.Value (or the field holding the parsed RHS) equals the expected
value (e.g., "aicr.nvidia.com/kwok-test", "Equal", "CUDA_VISIBLE_DEVICES", "x");
use the ParseValueOverrides return structures and their ComponentPath.Path and
Value fields to perform these equality checks so the test fails if indexing was
normalized/dropped.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 04069684-d8dc-4b39-b7de-ca9753987341

📥 Commits

Reviewing files that changed from the base of the PR and between 16d670d and 8cdc822.

📒 Files selected for processing (2)
  • pkg/bundler/config/component_path.go
  • pkg/bundler/config/config_test.go

Comment thread pkg/bundler/config/component_path.go Outdated
Comment thread pkg/bundler/config/config_test.go Outdated
@yuanchen8911 yuanchen8911 force-pushed the fix/bundler-set-array-indexing branch from 8cdc822 to ba2a182 Compare April 22, 2026 18:34
@yuanchen8911

Copy link
Copy Markdown
Contributor Author

Addressed both CodeRabbit findings in force-pushed ba2a182d:

  1. Error message updated (component_path.go:80) — now reads:

    invalid path segment "tolerations[2]" in "...": must start with alphanumeric and contain only alphanumeric, dot, hyphen, or underscore characters, with an optional trailing numeric array index [N]

  2. Test strengthened (config_test.go) — helm-style array index accepted now asserts exact component / path / value preservation via lookup(result, component, path), so any future normalization that silently drops [N] would fail the test instead of slipping through.

@github-actions github-actions Bot added size/M and removed size/S labels Apr 22, 2026
mchmarny
mchmarny previously approved these changes Apr 22, 2026
@yuanchen8911 yuanchen8911 force-pushed the fix/bundler-set-array-indexing branch from ba2a182 to b59f42f Compare April 22, 2026 18:42
@github-actions github-actions Bot added size/S and removed size/M labels Apr 22, 2026
@yuanchen8911 yuanchen8911 requested a review from mchmarny April 22, 2026 18:43
@yuanchen8911 yuanchen8911 changed the title fix(bundler): allow Helm-style array indexing in --set paths fix(kwok): remove broken --set array-index lines from validate-scheduling.sh Apr 22, 2026
@yuanchen8911 yuanchen8911 enabled auto-merge (squash) April 22, 2026 18:43

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/bundler/config/config_test.go`:
- Around line 562-570: The test currently iterates a simple []string and only
asserts an error for Helm array-index syntax; refactor it into a table-driven
test (slice of structs) where each case has explicit fields like component,
path, value and expectErr, call ParseValueOverrides for each case, assert that
when expectErr==true an error is returned, and when expectErr==false the
returned structure preserves component, path and value exactly (no
normalization) — update the loop over cases to use these struct fields and
explicit assertions against the parsed result from ParseValueOverrides.
- Around line 554-571: The test asserts Helm-style array-index syntax is
rejected but the PR intends to restore it; update the path validation and test
accordingly: change the safePathPattern in component_path.go to allow an
optional array index suffix (e.g., permit a (\[[0-9]+\])? after each path
segment) so ParseValueOverrides accepts tokens like "tolerations[2]" or
"env[3]", and convert the config_test.go case to a table-driven test that
expects no error for those inputs (use ParseValueOverrides to assert (err !=
nil) matches wantErr). Reference safePathPattern and ParseValueOverrides when
making the changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 3806edc9-0be1-4478-b899-b03bdfb0abd4

📥 Commits

Reviewing files that changed from the base of the PR and between ba2a182 and b59f42f.

📒 Files selected for processing (2)
  • kwok/scripts/validate-scheduling.sh
  • pkg/bundler/config/config_test.go
💤 Files with no reviewable changes (1)
  • kwok/scripts/validate-scheduling.sh

Comment thread pkg/bundler/config/config_test.go
Comment thread pkg/bundler/config/config_test.go Outdated
mchmarny
mchmarny previously approved these changes Apr 22, 2026
…ling.sh

The `validate-scheduling.sh` script attempts to append a KWOK-test-specific
toleration to `network-operator` via Helm-style `--set` array-index syntax:

  --set networkoperator:operator.tolerations[2].key=aicr.nvidia.com/kwok-test
  --set networkoperator:operator.tolerations[2].operator=Equal
  --set networkoperator:operator.tolerations[2].value=true
  --set networkoperator:operator.tolerations[2].effect=NoSchedule

These lines have never worked. `aicr bundle --set` delegates path resolution
to `pkg/component/overrides.go:getOrCreateNestedMap`, which splits the path
on `.` only — `tolerations[2]` becomes a literal map key, not a list index,
so the generated values emit `{tolerations[2]: {key: ..., ...}}` — a string
key Helm silently ignores while keeping the upstream `tolerations` list's
defaults. PR NVIDIA#603 later tightened `safePathPattern` to reject the `[N]`
segment outright, so the failure turned from silent-wrong-bundle into
loud-parse-error and KWOK Tier 1/2 CI jobs now fail at the bundle step.

Drop the 4 dead `--set` lines. The adjacent `--accelerated-node-toleration
kwok.x-k8s.io/node=fake:NoSchedule` already propagates a tolerate-the-fake-
node rule to every accelerated workload (network-operator included) via
the recipe's per-component nodeScheduling overlay, so nothing about the
scheduling semantics of the KWOK test actually regresses — we only remove
lines that were producing misleading bundle output to begin with.

An additional negative test `helm-style array index rejected` documents
why `foo:bar.tolerations[N].key=x` must stay rejected at parse time until
someone teaches the downstream path walker to interpret `[N]` as a list
index. That's a substantive feature (affects both `ApplyMapOverrides` and
the dynamic-values extractor) and is intentionally out of scope here.
@yuanchen8911 yuanchen8911 force-pushed the fix/bundler-set-array-indexing branch from 4412b3e to 9fde739 Compare April 22, 2026 20:21
@yuanchen8911 yuanchen8911 merged commit 64b8759 into NVIDIA:main Apr 22, 2026
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants