Skip to content

feat(validator): co-locate ai-service-metrics with Prometheus#1066

Merged
mchmarny merged 7 commits into
NVIDIA:mainfrom
njhensley:feat/deterministic-validator-scheduling
May 27, 2026
Merged

feat(validator): co-locate ai-service-metrics with Prometheus#1066
mchmarny merged 7 commits into
NVIDIA:mainfrom
njhensley:feat/deterministic-validator-scheduling

Conversation

@njhensley

Copy link
Copy Markdown
Member

Summary

Add a dependencyAffinity field to the validator catalog and use it to pin the ai-service-metrics orchestrator pod to the node running Prometheus. Eliminates intermittent failures on multi-Security-Group clusters where the dial would otherwise traverse cross-SG ingress.

Motivation / Context

On multi-SG EKS clusters (e.g., DGXC GB200 with separate customer/system ENI subnets), ai-service-metrics was nondeterministically landing on nodes that could not dial the Prometheus service. Image-locality scoring decided placement once the validator image was cached, masking the underlying network-reachability constraint. The fix declares the dependency in the catalog and the deployer materializes it as a podAffinity term whose weight (100) dominates image-locality on first scheduling, then reinforces itself via locality on subsequent runs.

Fixes: #933
Related: N/A

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change
  • Documentation update
  • Refactoring
  • Build/CI/tooling

Component(s) Affected

  • CLI
  • API server
  • Recipe engine / data
  • Bundlers
  • Collectors / snapshotter
  • Validator (pkg/validator, pkg/api/validator/v1)
  • Core libraries
  • Docs/examples (docs/integrator/validator-extension.md)
  • Other: validator catalog (recipes/validators/catalog.yaml)

Implementation Notes

  • New DependencyAffinity type on catalog entries: componentRef, podLabelSelector, topologyKey (defaults to kubernetes.io/hostname), and requirement (preferred default, or required).
  • BuildOrchestratorAffinity resolves each dependency against the recipe's componentRefs:
    • required + missing/disabled/no-namespace → ErrCodeInvalidRequest (caller must abort before deploying).
    • preferred + missing → slog.Warn and skip the term (preserves behavior on flat networks).
    • Resolvable refs produce a PodAffinityTerm with the component's resolved namespace.
  • Pre-flight gate added to pkg/validator/validator.go: walks selected validators and validates dependencyAffinity against the recipe before any Job is deployed.
  • Deployer threads componentRefs into the JobPlan and renders the affinity into the orchestrator pod's applyconfigurations spec.
  • ai-service-metrics catalog entry uses requirement: preferred so flat-cluster recipes that omit kube-prometheus-stack still run (validator schedules wherever, warning logged).
  • Outer wrap in pkg/validator/validator.go:290 returns ErrCodeInternal over an inner ErrCodeInvalidRequest; minor cosmetic deviation from the no-double-wrap rule, no behavior impact, tracked as follow-up.

Testing

make qualify
go test ./pkg/api/validator/v1/... ./pkg/validator/...

Manually verified end-to-end on a GB200/EKS multi-SG cluster:

  1. Prometheus label sanitymonitoring/prometheus-kube-prometheus-prometheus-0 carries app.kubernetes.io/name=prometheus; selector matches.
  2. Deterministic placement — 4/4 consecutive validate --phase conformance runs landed the ai-service-metrics orchestrator on the Prometheus node. Injected Job spec confirmed: preferredDuringSchedulingIgnoredDuringExecution with weight: 100, matchLabels: app.kubernetes.io/name: prometheus, namespaces: [monitoring], topologyKey: kubernetes.io/hostname.
  3. Negative path — with a --data overlay flipping ai-service-metrics to requirement: required and a recipe missing kube-prometheus-stack from componentRefs, validate aborts at pre-flight: [INVALID_REQUEST] required dependencyAffinity component "kube-prometheus-stack" is not in the recipe's componentRefs. Zero Jobs deployed.

Coverage delta:

  • pkg/api/validator/v1: 67.4% → 73.6% (+6.2%)
  • pkg/validator/catalog: 97.4% → 97.5% (+0.1%)
  • pkg/validator/job: 12.9% → 15.7% (+2.8%)
  • pkg/validator: 25.1% → 24.8% (-0.3%, within noise)

Risk Assessment

  • Low — Isolated to validator orchestration. Default requirement: preferred preserves existing behavior on recipes without kube-prometheus-stack; the only validator using the new field today is ai-service-metrics. Easy to revert by removing the catalog entry's dependencyAffinity block.

Rollout notes: No migration required. Existing recipes work unchanged. Recipes wanting to enforce co-location for custom validators can opt-in via requirement: required.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S)

njhensley added 4 commits May 27, 2026 09:17
Validators can declare component-scoped pod affinity so the orchestrator
co-locates with the component it queries. Required deps fail closed when
absent from the recipe (ErrCodeInvalidRequest); preferred deps warn and
skip the term. Motivated by multi-Security-Group EKS clusters where
Prometheus dial reachability depends on which node the orchestrator
lands on (NVIDIA#933).
BuildJobPlan now returns (JobPlan, error) and renders affinity from
BuildOrchestratorAffinity; the renderer falls back to prefer-CPU NodeAffinity
when Affinity is nil so external callers stay compatible. Deployer threads
the resolved recipe's componentRefs through from ValidationInput so missing
required deps surface as ErrCodeInvalidRequest before any Job is applied.
Adds dependencyAffinity (preferred, kube-prometheus-stack) to the embedded
catalog so the orchestrator dial of Prometheus stays loopback/same-node on
clusters with asymmetric pod-to-pod reachability (DGXC EKS, multi-SG). Docs
the new field under integrator/validator-extension with the motivating case.
BuildOrchestratorAffinity now validates each dependency inline and treats
disabled or unresolved (ref.IsEnabled / empty namespace) componentRefs as
"missing" — with a specific reason in the error so required deps don't
silently strand a partial phase deploy. runPhase pre-flights the same check
across all entries before deploying any Job. The corev1.Affinity →
ApplyConfiguration walker now covers NodeAffinity.Required, MatchFields,
PodAntiAffinity, NamespaceSelector, MatchLabelKeys, and MismatchLabelKeys so
externally-constructed JobPlans round-trip through server-side apply without
silently dropping fields. The embedded-catalog test uses "v0.0.0-next" to
bypass the release-tag rewrite path.
@github-actions

Copy link
Copy Markdown
Contributor

@njhensley njhensley self-assigned this May 27, 2026
@coderabbitai

coderabbitai Bot commented May 27, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: e682a928-bf60-47b2-b2b8-1aa9fac2f52a

📥 Commits

Reviewing files that changed from the base of the PR and between 369695c and f946c66.

📒 Files selected for processing (1)
  • pkg/api/validator/v1/README.md

📝 Walkthrough

Walkthrough

Adds a DependencyAffinity API and validation; implements BuildOrchestratorAffinity to resolve dependency entries into podAffinity terms; threads componentRefs through Plan/BuildJobPlan and Deployer, propagating affinity into JobPlan rendering and server-side-apply conversion; performs pre-flight dependency resolution before deploying jobs; updates embedded catalog (ai-service-metrics) and adds unit tests covering validation, affinity building, job-plan propagation, rendering, and deployer error paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • NVIDIA/aicr#1028: Both PRs modify deployer.Plan/BuildJobPlan callsites to thread additional parameters (image overrides in #1028; componentRefs and affinity handling in this PR).

Suggested labels

area/tests, size/L

Suggested reviewers

  • mchmarny
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(validator): co-locate ai-service-metrics with Prometheus' is concise and directly summarizes the main change: adding Prometheus co-location for the ai-service-metrics validator.
Description check ✅ Passed The description is well-written, providing summary, motivation, implementation details, testing results, and references to issue #933.
Linked Issues check ✅ Passed The PR comprehensively implements all objectives from issue #933: adds dependencyAffinity field to catalog, implements BuildOrchestratorAffinity with required/preferred modes, adds pre-flight validation, threads componentRefs through deployer, and applies the feature to ai-service-metrics with requirement: preferred.
Out of Scope Changes check ✅ Passed All changes are directly related to implementing dependencyAffinity support and applying it to resolve issue #933. Documentation updates, API changes, job plan modifications, and validator orchestration changes are all in-scope for the stated objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/api/validator/v1/README.md (1)

322-336: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Document all new Plan(...) parameters in the API reference.

The signature includes imageRegistryOverride, imageTagOverride, and componentRefs, but the parameter list currently omits them. Please add brief parameter descriptions to keep the API contract unambiguous.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/api/validator/v1/README.md` around lines 322 - 336, The README's
Plan(...) parameter list is missing descriptions for imageRegistryOverride,
imageTagOverride, and componentRefs; update the API reference for the Plan(cat,
validationInput, runID, namespace, version, commit, serviceAccount,
imagePullSecrets, tolerations, nodeSelector, imageRegistryOverride,
imageTagOverride, componentRefs) signature by adding three brief bullet entries:
explain imageRegistryOverride as an optional registry host to override container
image pull locations, imageTagOverride as an optional tag to override validator
image tags, and componentRefs as a list/map of component identifiers (or image
references) to pass to validators for targeted behavior; ensure wording matches
the existing style and placement with the other parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/integrator/validator-extension.md`:
- Around line 109-112: Replace the deterministic language in the sentence
discussing the `preferred` vs `required` scheduling behavior: change claims like
"dominates the scheduler's image-locality scoring" and "after which image
locality reinforces (rather than counteracts) the affinity" to softer,
observation-based phrasing such as "has a strong influence on the scheduler's
image-locality scoring on the first run" and "afterward image locality can
support (rather than oppose) the affinity" so the doc describes likely behavior
instead of a guaranteed internal precedence; keep the guidance to prefer
`preferred` and reserve `required` for cases that cannot succeed without exact
co-location.

In `@pkg/api/validator/v1/affinity.go`:
- Around line 69-72: The current block that calls dep.Validate() (inside the
validation for dependency affinity) re-wraps the error using errors.Wrap which
double-wraps an already-coded pkg/errors error; replace the errors.Wrap usage
with errors.PropagateOrWrap(err, errors.ErrCodeInvalidRequest, "invalid
dependencyAffinity") so that DependencyAffinity.Validate() structured errors are
propagated unchanged while non-coded errors get wrapped with
ErrCodeInvalidRequest and the "invalid dependencyAffinity" message.

In `@pkg/validator/catalog/catalog.go`:
- Around line 279-282: The current code double-wraps structured errors from
DependencyAffinity.Validate(); change the return logic in the loop where
dep.Validate() is called (referencing dep.Validate(), v.Name and
dependencyAffinity[j]) to use errors.PropagateOrWrap(err,
errors.ErrCodeInvalidRequest, fmt.Sprintf("validator %q:
dependencyAffinity[%d]", v.Name, j)) instead of errors.Wrap so that
already-coded *errors.StructuredError values from DependencyAffinity.Validate()
are propagated unchanged and non-structured errors get wrapped with the
contextual message.

In `@pkg/validator/validator.go`:
- Around line 287-292: Add a context cancellation check inside the pre-flight
loop that iterates over entries: before calling
v1.BuildOrchestratorAffinity(entry.DependencyAffinity,
validationInput.GetComponentRefs()), select on ctx.Done() and return immediately
with the context error (wrapped appropriately) so the validator stops promptly
when canceled; update the loop that uses variables entries/entry and the
surrounding function to return early on ctx cancellation rather than continuing
work.
- Around line 288-291: The current dependency-affinity pre-flight double-wraps
structured errors returned from v1.BuildOrchestratorAffinity; instead check the
returned err and use errors.PropagateOrWrap(err, errors.ErrCodeInvalidRequest,
fmt.Sprintf("dependencyAffinity pre-flight failed for validator %q",
entry.Name)) so already-coded errors are preserved and uncoded ones get wrapped
with ErrCodeInvalidRequest and the same message; update the error handling
around v1.BuildOrchestratorAffinity accordingly.

In `@recipes/validators/catalog.yaml`:
- Around line 110-118: The comment incorrectly states this is “Required for
determinism” while the configuration uses dependencyAffinity with requirement:
preferred; update the comment text near dependencyAffinity / componentRef:
kube-prometheus-stack / podLabelSelector: app.kubernetes.io/name: prometheus to
indicate this is a best-effort/preferred co-location (not guaranteed), e.g.,
replace “Required for determinism” with wording that reflects
preferred/best-effort behavior so the comment matches the configured
requirement: preferred.

---

Outside diff comments:
In `@pkg/api/validator/v1/README.md`:
- Around line 322-336: The README's Plan(...) parameter list is missing
descriptions for imageRegistryOverride, imageTagOverride, and componentRefs;
update the API reference for the Plan(cat, validationInput, runID, namespace,
version, commit, serviceAccount, imagePullSecrets, tolerations, nodeSelector,
imageRegistryOverride, imageTagOverride, componentRefs) signature by adding
three brief bullet entries: explain imageRegistryOverride as an optional
registry host to override container image pull locations, imageTagOverride as an
optional tag to override validator image tags, and componentRefs as a list/map
of component identifiers (or image references) to pass to validators for
targeted behavior; ensure wording matches the existing style and placement with
the other parameters.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 8955eb3b-2c9d-4b52-8c0c-73f9743c5437

📥 Commits

Reviewing files that changed from the base of the PR and between 7adc586 and 98daa48.

📒 Files selected for processing (18)
  • docs/integrator/validator-extension.md
  • pkg/api/validator/v1/README.md
  • pkg/api/validator/v1/affinity.go
  • pkg/api/validator/v1/affinity_test.go
  • pkg/api/validator/v1/catalog.go
  • pkg/api/validator/v1/dependency_affinity.go
  • pkg/api/validator/v1/dependency_affinity_test.go
  • pkg/api/validator/v1/job_plan.go
  • pkg/api/validator/v1/job_plan_internal.go
  • pkg/api/validator/v1/job_plan_test.go
  • pkg/api/validator/v1/validation_input.go
  • pkg/validator/catalog/catalog.go
  • pkg/validator/catalog/catalog_test.go
  • pkg/validator/job/deployer.go
  • pkg/validator/job/deployer_test.go
  • pkg/validator/job/result_test.go
  • pkg/validator/validator.go
  • recipes/validators/catalog.yaml

Comment thread docs/integrator/validator-extension.md Outdated
Comment thread pkg/api/validator/v1/affinity.go
Comment thread pkg/validator/catalog/catalog.go
Comment thread pkg/validator/validator.go
Comment thread pkg/validator/validator.go
Comment thread recipes/validators/catalog.yaml Outdated
Switch the three Wrap call sites that re-wrapped already-coded errors
(BuildOrchestratorAffinity inner Validate, catalog static validation,
runPhase pre-flight) to PropagateOrWrap so a callee's
ErrCodeInvalidRequest isn't masked by the outer message. Add a
ctx.Done() check inside the pre-flight loop. Reword the
validator-extension doc and catalog comment to describe preferred
as best-effort rather than guaranteed scheduler dominance.
@njhensley

Copy link
Copy Markdown
Member Author

Re: the outside-diff finding on pkg/api/validator/v1/README.md:322-336 (Plan(...) parameter list missing imageRegistryOverride / imageTagOverride / componentRefs): added three bullets in 2c22a086, matching the existing parameter-description style and placement.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/api/validator/v1/README.md`:
- Line 337: Update the README description for imageTagOverride (aka
AICR_VALIDATOR_IMAGE_TAG) to clarify it only replaces tags on tag-based image
references and does NOT modify digest-pinned references (the image rewrite logic
preserves digest-pinned refs); keep the note that an empty string disables the
override and ensure the wording explicitly states "tag-based references only"
rather than suggesting digest-pinned refs are affected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1c90b9c8-c20e-47ff-ba5c-1df741bc46bb

📥 Commits

Reviewing files that changed from the base of the PR and between 98daa48 and 369695c.

📒 Files selected for processing (7)
  • docs/integrator/validator-extension.md
  • pkg/api/validator/v1/README.md
  • pkg/api/validator/v1/affinity.go
  • pkg/validator/catalog/catalog.go
  • pkg/validator/catalog/catalog_test.go
  • pkg/validator/validator.go
  • recipes/validators/catalog.yaml

Comment thread pkg/api/validator/v1/README.md Outdated
@njhensley njhensley marked this pull request as ready for review May 27, 2026 21:44
@njhensley njhensley requested review from a team as code owners May 27, 2026 21:44

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mechanism is sound: pinning to kubernetes.io/hostname with weight=100 is the right shape for the multi-SG reachability problem, and preferred as the catalog default keeps flat-cluster recipes working. Test coverage is strong — TestAffinityToApplyConfig_RoundTripsAllFields is especially welcome.

Five non-blocking comments inline: one on the apply-config converter's drift risk on future k8s bumps (most important), one on the pre-flight gate doing more work than it needs to, one on required not actually guaranteeing runtime presence (only recipe membership), a long-parameter-list nit, and an observation about Prometheus label fragility across chart versions.

CI is green. Nothing here blocks merge.

return nil, errors.Wrap(errors.ErrCodeTimeout, "context canceled during dependencyAffinity pre-flight", ctx.Err())
default:
}
if _, err := v1.BuildOrchestratorAffinity(entry.DependencyAffinity, validationInput.GetComponentRefs()); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-flight calls BuildOrchestratorAffinity purely for its validation side-effect — the returned *corev1.Affinity is discarded, but each invocation allocates the full affinity tree, builds the refByName map, and runs the converter logic.

Extract a ValidateDependencyAffinity(deps, componentRefs) error that performs only the resolution checks, then have BuildOrchestratorAffinity call it internally. Pre-flight uses the validate path; deploy path keeps building. Cheaper and the intent is clearer at the call site (you're validating, not building).

Not blocking — the entry count per phase is small.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Refactored in the staged changes: extracted resolveDeps as an unexported helper, added a public ValidateDependencyAffinity(deps, componentRefs) error that calls resolveDeps with warning emission suppressed, and switched runPhase's pre-flight to use it. BuildOrchestratorAffinity now also delegates to resolveDeps, so the validate and build paths share one source of truth and pre-flight no longer double-emits the missing-preferred warning.

// walk because client-go does not expose a generated converter for this pair.
// A nil or empty Affinity falls back to preferCPUNodeAffinity() to preserve
// pre-PR behavior for external callers that construct a JobPlan manually.
func affinityToApplyConfig(a *corev1.Affinity) *applycorev1.AffinityApplyConfiguration {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hand-written corev1.Affinityapplycorev1.AffinityApplyConfiguration walker. The docstring acknowledges client-go doesn't ship a generated converter; the risk is that when k8s API adds new affinity fields in future bumps (e.g. recent additions like MatchLabelKeys / MismatchLabelKeys were added in v1.29-v1.30), this walker will silently drop them — server-side apply won't error, the field just won't materialize.

Two suggestions, either acceptable:

  1. Add an explicit allowlist test that fails when reflect.TypeOf(corev1.PodAffinityTerm{}).NumField() changes, so the next k8s bump forces a re-audit of this file.
  2. Drop the walker entirely and round-trip via JSON marshal → unmarshal into the apply config type. Slower, but immune to API drift.

Round-tripping TestAffinityToApplyConfig_RoundTripsAllFields covers today's surface but won't catch tomorrow's additions.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took option (1). Added TestAffinityTypeFieldCountInvariant in affinity_test.go that asserts the field counts of corev1.Affinity / NodeAffinity / NodeSelectorTerm / PodAffinity / PodAntiAffinity / PodAffinityTerm against current expected values. Future k8s vendor bumps that add or remove a field will fail the test with a message instructing the bumper to audit affinityToApplyConfig before updating the count. Cheaper than JSON round-trip, gives the next k8s bumper a forcing function.

var reason string
switch {
case !found:
reason = "is not in the recipe's componentRefs"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required resolution checks recipe membership (ref.IsEnabled(), ref.Namespace != ""), not runtime presence. If kube-prometheus-stack is in componentRefs but its pods haven't started yet (or crashlooped), a required PodAffinity term on the orchestrator will leave the Job stuck in Pending until the deploy timeout fires — with no signal in the validator output beyond "ran out of time."

The docstring above (lines 36-39) implies the recipe-membership check is the guard, but the failure mode is actually different from what it suggests. Worth either:

  • explicitly noting in the docstring that required assumes the dependency's pods will eventually be Running, OR
  • having the deployer surface the orchestrator's Pending reason (PodScheduled=False) into the validator output when wait-for-Job times out.

For ai-service-metrics this is moot (you use preferred). It matters once someone declares the first required dependency.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Real gap. Clarified BuildOrchestratorAffinity's docstring (and ValidateDependencyAffinity's) to explicitly call out that resolution checks recipe membership only, not runtime readiness, and to point operators at PodScheduled + the dependency component's replica status when triaging a hung run. Deferring the deployer's scheduling-diagnostics-on-timeout change to a separate follow-up issue since it's structural (a Job timeout would need to read pod conditions before returning), out of scope for this PR, and doesn't affect ai-service-metrics which uses preferred.

componentRefs []recipe.ComponentRef,
) (JobPlan, error) {

affinity, err := BuildOrchestratorAffinity(entry.DependencyAffinity, componentRefs)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Plan and BuildJobPlan are now at 13 positional parameters and the test files are a sea of nil, nil, nil, "", "", nil placeholders that have to be re-aligned every time anything is added. The README also has to enumerate them in prose.

Worth a follow-up to convert to a config struct (JobPlanConfig / PlanConfig) or functional options — same shape this repo already uses for recipe.NewBuilder and server.New. Not blocking, but the friction will compound the next time someone needs to thread another field through.

- componentRef: kube-prometheus-stack
podLabelSelector:
app.kubernetes.io/name: prometheus
requirement: preferred

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description confirms monitoring/prometheus-kube-prometheus-prometheus-0 carries app.kubernetes.io/name=prometheus on the test cluster — good. One caveat to lock in: when kube-prometheus-stack chart is upgraded (or when overrides set a custom release name), the Prometheus pods sometimes only carry prometheus=kube-prometheus-stack-prometheus or app=kube-prometheus-stack-prometheus and not the generic app.kubernetes.io/name. That isn't a bug in this PR, but it does mean a downstream chart bump can silently turn the affinity into a no-op (since preferred, no error).

Optional defense: render a slog.Warn from the deployer when the LabelSelector matches zero pods at deploy time (informer is right there).

@mchmarny mchmarny merged commit e565ce0 into NVIDIA:main May 27, 2026
117 checks passed
@njhensley njhensley deleted the feat/deterministic-validator-scheduling branch June 23, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

2 participants