feat(validator): co-locate ai-service-metrics with Prometheus#1066
Conversation
Validators can declare component-scoped pod affinity so the orchestrator co-locates with the component it queries. Required deps fail closed when absent from the recipe (ErrCodeInvalidRequest); preferred deps warn and skip the term. Motivated by multi-Security-Group EKS clusters where Prometheus dial reachability depends on which node the orchestrator lands on (NVIDIA#933).
BuildJobPlan now returns (JobPlan, error) and renders affinity from BuildOrchestratorAffinity; the renderer falls back to prefer-CPU NodeAffinity when Affinity is nil so external callers stay compatible. Deployer threads the resolved recipe's componentRefs through from ValidationInput so missing required deps surface as ErrCodeInvalidRequest before any Job is applied.
Adds dependencyAffinity (preferred, kube-prometheus-stack) to the embedded catalog so the orchestrator dial of Prometheus stays loopback/same-node on clusters with asymmetric pod-to-pod reachability (DGXC EKS, multi-SG). Docs the new field under integrator/validator-extension with the motivating case.
BuildOrchestratorAffinity now validates each dependency inline and treats disabled or unresolved (ref.IsEnabled / empty namespace) componentRefs as "missing" — with a specific reason in the error so required deps don't silently strand a partial phase deploy. runPhase pre-flights the same check across all entries before deploying any Job. The corev1.Affinity → ApplyConfiguration walker now covers NodeAffinity.Required, MatchFields, PodAntiAffinity, NamespaceSelector, MatchLabelKeys, and MismatchLabelKeys so externally-constructed JobPlans round-trip through server-side apply without silently dropping fields. The embedded-catalog test uses "v0.0.0-next" to bypass the release-tag rewrite path.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a DependencyAffinity API and validation; implements BuildOrchestratorAffinity to resolve dependency entries into podAffinity terms; threads componentRefs through Plan/BuildJobPlan and Deployer, propagating affinity into JobPlan rendering and server-side-apply conversion; performs pre-flight dependency resolution before deploying jobs; updates embedded catalog (ai-service-metrics) and adds unit tests covering validation, affinity building, job-plan propagation, rendering, and deployer error paths. Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/api/validator/v1/README.md (1)
322-336:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDocument all new
Plan(...)parameters in the API reference.The signature includes
imageRegistryOverride,imageTagOverride, andcomponentRefs, but the parameter list currently omits them. Please add brief parameter descriptions to keep the API contract unambiguous.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/api/validator/v1/README.md` around lines 322 - 336, The README's Plan(...) parameter list is missing descriptions for imageRegistryOverride, imageTagOverride, and componentRefs; update the API reference for the Plan(cat, validationInput, runID, namespace, version, commit, serviceAccount, imagePullSecrets, tolerations, nodeSelector, imageRegistryOverride, imageTagOverride, componentRefs) signature by adding three brief bullet entries: explain imageRegistryOverride as an optional registry host to override container image pull locations, imageTagOverride as an optional tag to override validator image tags, and componentRefs as a list/map of component identifiers (or image references) to pass to validators for targeted behavior; ensure wording matches the existing style and placement with the other parameters.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/integrator/validator-extension.md`:
- Around line 109-112: Replace the deterministic language in the sentence
discussing the `preferred` vs `required` scheduling behavior: change claims like
"dominates the scheduler's image-locality scoring" and "after which image
locality reinforces (rather than counteracts) the affinity" to softer,
observation-based phrasing such as "has a strong influence on the scheduler's
image-locality scoring on the first run" and "afterward image locality can
support (rather than oppose) the affinity" so the doc describes likely behavior
instead of a guaranteed internal precedence; keep the guidance to prefer
`preferred` and reserve `required` for cases that cannot succeed without exact
co-location.
In `@pkg/api/validator/v1/affinity.go`:
- Around line 69-72: The current block that calls dep.Validate() (inside the
validation for dependency affinity) re-wraps the error using errors.Wrap which
double-wraps an already-coded pkg/errors error; replace the errors.Wrap usage
with errors.PropagateOrWrap(err, errors.ErrCodeInvalidRequest, "invalid
dependencyAffinity") so that DependencyAffinity.Validate() structured errors are
propagated unchanged while non-coded errors get wrapped with
ErrCodeInvalidRequest and the "invalid dependencyAffinity" message.
In `@pkg/validator/catalog/catalog.go`:
- Around line 279-282: The current code double-wraps structured errors from
DependencyAffinity.Validate(); change the return logic in the loop where
dep.Validate() is called (referencing dep.Validate(), v.Name and
dependencyAffinity[j]) to use errors.PropagateOrWrap(err,
errors.ErrCodeInvalidRequest, fmt.Sprintf("validator %q:
dependencyAffinity[%d]", v.Name, j)) instead of errors.Wrap so that
already-coded *errors.StructuredError values from DependencyAffinity.Validate()
are propagated unchanged and non-structured errors get wrapped with the
contextual message.
In `@pkg/validator/validator.go`:
- Around line 287-292: Add a context cancellation check inside the pre-flight
loop that iterates over entries: before calling
v1.BuildOrchestratorAffinity(entry.DependencyAffinity,
validationInput.GetComponentRefs()), select on ctx.Done() and return immediately
with the context error (wrapped appropriately) so the validator stops promptly
when canceled; update the loop that uses variables entries/entry and the
surrounding function to return early on ctx cancellation rather than continuing
work.
- Around line 288-291: The current dependency-affinity pre-flight double-wraps
structured errors returned from v1.BuildOrchestratorAffinity; instead check the
returned err and use errors.PropagateOrWrap(err, errors.ErrCodeInvalidRequest,
fmt.Sprintf("dependencyAffinity pre-flight failed for validator %q",
entry.Name)) so already-coded errors are preserved and uncoded ones get wrapped
with ErrCodeInvalidRequest and the same message; update the error handling
around v1.BuildOrchestratorAffinity accordingly.
In `@recipes/validators/catalog.yaml`:
- Around line 110-118: The comment incorrectly states this is “Required for
determinism” while the configuration uses dependencyAffinity with requirement:
preferred; update the comment text near dependencyAffinity / componentRef:
kube-prometheus-stack / podLabelSelector: app.kubernetes.io/name: prometheus to
indicate this is a best-effort/preferred co-location (not guaranteed), e.g.,
replace “Required for determinism” with wording that reflects
preferred/best-effort behavior so the comment matches the configured
requirement: preferred.
---
Outside diff comments:
In `@pkg/api/validator/v1/README.md`:
- Around line 322-336: The README's Plan(...) parameter list is missing
descriptions for imageRegistryOverride, imageTagOverride, and componentRefs;
update the API reference for the Plan(cat, validationInput, runID, namespace,
version, commit, serviceAccount, imagePullSecrets, tolerations, nodeSelector,
imageRegistryOverride, imageTagOverride, componentRefs) signature by adding
three brief bullet entries: explain imageRegistryOverride as an optional
registry host to override container image pull locations, imageTagOverride as an
optional tag to override validator image tags, and componentRefs as a list/map
of component identifiers (or image references) to pass to validators for
targeted behavior; ensure wording matches the existing style and placement with
the other parameters.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 8955eb3b-2c9d-4b52-8c0c-73f9743c5437
📒 Files selected for processing (18)
docs/integrator/validator-extension.mdpkg/api/validator/v1/README.mdpkg/api/validator/v1/affinity.gopkg/api/validator/v1/affinity_test.gopkg/api/validator/v1/catalog.gopkg/api/validator/v1/dependency_affinity.gopkg/api/validator/v1/dependency_affinity_test.gopkg/api/validator/v1/job_plan.gopkg/api/validator/v1/job_plan_internal.gopkg/api/validator/v1/job_plan_test.gopkg/api/validator/v1/validation_input.gopkg/validator/catalog/catalog.gopkg/validator/catalog/catalog_test.gopkg/validator/job/deployer.gopkg/validator/job/deployer_test.gopkg/validator/job/result_test.gopkg/validator/validator.gorecipes/validators/catalog.yaml
Switch the three Wrap call sites that re-wrapped already-coded errors (BuildOrchestratorAffinity inner Validate, catalog static validation, runPhase pre-flight) to PropagateOrWrap so a callee's ErrCodeInvalidRequest isn't masked by the outer message. Add a ctx.Done() check inside the pre-flight loop. Reword the validator-extension doc and catalog comment to describe preferred as best-effort rather than guaranteed scheduler dominance.
|
Re: the outside-diff finding on |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/api/validator/v1/README.md`:
- Line 337: Update the README description for imageTagOverride (aka
AICR_VALIDATOR_IMAGE_TAG) to clarify it only replaces tags on tag-based image
references and does NOT modify digest-pinned references (the image rewrite logic
preserves digest-pinned refs); keep the note that an empty string disables the
override and ensure the wording explicitly states "tag-based references only"
rather than suggesting digest-pinned refs are affected.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1c90b9c8-c20e-47ff-ba5c-1df741bc46bb
📒 Files selected for processing (7)
docs/integrator/validator-extension.mdpkg/api/validator/v1/README.mdpkg/api/validator/v1/affinity.gopkg/validator/catalog/catalog.gopkg/validator/catalog/catalog_test.gopkg/validator/validator.gorecipes/validators/catalog.yaml
mchmarny
left a comment
There was a problem hiding this comment.
Mechanism is sound: pinning to kubernetes.io/hostname with weight=100 is the right shape for the multi-SG reachability problem, and preferred as the catalog default keeps flat-cluster recipes working. Test coverage is strong — TestAffinityToApplyConfig_RoundTripsAllFields is especially welcome.
Five non-blocking comments inline: one on the apply-config converter's drift risk on future k8s bumps (most important), one on the pre-flight gate doing more work than it needs to, one on required not actually guaranteeing runtime presence (only recipe membership), a long-parameter-list nit, and an observation about Prometheus label fragility across chart versions.
CI is green. Nothing here blocks merge.
| return nil, errors.Wrap(errors.ErrCodeTimeout, "context canceled during dependencyAffinity pre-flight", ctx.Err()) | ||
| default: | ||
| } | ||
| if _, err := v1.BuildOrchestratorAffinity(entry.DependencyAffinity, validationInput.GetComponentRefs()); err != nil { |
There was a problem hiding this comment.
Pre-flight calls BuildOrchestratorAffinity purely for its validation side-effect — the returned *corev1.Affinity is discarded, but each invocation allocates the full affinity tree, builds the refByName map, and runs the converter logic.
Extract a ValidateDependencyAffinity(deps, componentRefs) error that performs only the resolution checks, then have BuildOrchestratorAffinity call it internally. Pre-flight uses the validate path; deploy path keeps building. Cheaper and the intent is clearer at the call site (you're validating, not building).
Not blocking — the entry count per phase is small.
There was a problem hiding this comment.
Good catch. Refactored in the staged changes: extracted resolveDeps as an unexported helper, added a public ValidateDependencyAffinity(deps, componentRefs) error that calls resolveDeps with warning emission suppressed, and switched runPhase's pre-flight to use it. BuildOrchestratorAffinity now also delegates to resolveDeps, so the validate and build paths share one source of truth and pre-flight no longer double-emits the missing-preferred warning.
| // walk because client-go does not expose a generated converter for this pair. | ||
| // A nil or empty Affinity falls back to preferCPUNodeAffinity() to preserve | ||
| // pre-PR behavior for external callers that construct a JobPlan manually. | ||
| func affinityToApplyConfig(a *corev1.Affinity) *applycorev1.AffinityApplyConfiguration { |
There was a problem hiding this comment.
Hand-written corev1.Affinity → applycorev1.AffinityApplyConfiguration walker. The docstring acknowledges client-go doesn't ship a generated converter; the risk is that when k8s API adds new affinity fields in future bumps (e.g. recent additions like MatchLabelKeys / MismatchLabelKeys were added in v1.29-v1.30), this walker will silently drop them — server-side apply won't error, the field just won't materialize.
Two suggestions, either acceptable:
- Add an explicit allowlist test that fails when
reflect.TypeOf(corev1.PodAffinityTerm{}).NumField()changes, so the next k8s bump forces a re-audit of this file. - Drop the walker entirely and round-trip via JSON marshal → unmarshal into the apply config type. Slower, but immune to API drift.
Round-tripping TestAffinityToApplyConfig_RoundTripsAllFields covers today's surface but won't catch tomorrow's additions.
There was a problem hiding this comment.
Took option (1). Added TestAffinityTypeFieldCountInvariant in affinity_test.go that asserts the field counts of corev1.Affinity / NodeAffinity / NodeSelectorTerm / PodAffinity / PodAntiAffinity / PodAffinityTerm against current expected values. Future k8s vendor bumps that add or remove a field will fail the test with a message instructing the bumper to audit affinityToApplyConfig before updating the count. Cheaper than JSON round-trip, gives the next k8s bumper a forcing function.
| var reason string | ||
| switch { | ||
| case !found: | ||
| reason = "is not in the recipe's componentRefs" |
There was a problem hiding this comment.
required resolution checks recipe membership (ref.IsEnabled(), ref.Namespace != ""), not runtime presence. If kube-prometheus-stack is in componentRefs but its pods haven't started yet (or crashlooped), a required PodAffinity term on the orchestrator will leave the Job stuck in Pending until the deploy timeout fires — with no signal in the validator output beyond "ran out of time."
The docstring above (lines 36-39) implies the recipe-membership check is the guard, but the failure mode is actually different from what it suggests. Worth either:
- explicitly noting in the docstring that
requiredassumes the dependency's pods will eventually be Running, OR - having the deployer surface the orchestrator's Pending reason (
PodScheduled=False) into the validator output when wait-for-Job times out.
For ai-service-metrics this is moot (you use preferred). It matters once someone declares the first required dependency.
There was a problem hiding this comment.
Real gap. Clarified BuildOrchestratorAffinity's docstring (and ValidateDependencyAffinity's) to explicitly call out that resolution checks recipe membership only, not runtime readiness, and to point operators at PodScheduled + the dependency component's replica status when triaging a hung run. Deferring the deployer's scheduling-diagnostics-on-timeout change to a separate follow-up issue since it's structural (a Job timeout would need to read pod conditions before returning), out of scope for this PR, and doesn't affect ai-service-metrics which uses preferred.
| componentRefs []recipe.ComponentRef, | ||
| ) (JobPlan, error) { | ||
|
|
||
| affinity, err := BuildOrchestratorAffinity(entry.DependencyAffinity, componentRefs) |
There was a problem hiding this comment.
nit: Plan and BuildJobPlan are now at 13 positional parameters and the test files are a sea of nil, nil, nil, "", "", nil placeholders that have to be re-aligned every time anything is added. The README also has to enumerate them in prose.
Worth a follow-up to convert to a config struct (JobPlanConfig / PlanConfig) or functional options — same shape this repo already uses for recipe.NewBuilder and server.New. Not blocking, but the friction will compound the next time someone needs to thread another field through.
| - componentRef: kube-prometheus-stack | ||
| podLabelSelector: | ||
| app.kubernetes.io/name: prometheus | ||
| requirement: preferred |
There was a problem hiding this comment.
The PR description confirms monitoring/prometheus-kube-prometheus-prometheus-0 carries app.kubernetes.io/name=prometheus on the test cluster — good. One caveat to lock in: when kube-prometheus-stack chart is upgraded (or when overrides set a custom release name), the Prometheus pods sometimes only carry prometheus=kube-prometheus-stack-prometheus or app=kube-prometheus-stack-prometheus and not the generic app.kubernetes.io/name. That isn't a bug in this PR, but it does mean a downstream chart bump can silently turn the affinity into a no-op (since preferred, no error).
Optional defense: render a slog.Warn from the deployer when the LabelSelector matches zero pods at deploy time (informer is right there).
Summary
Add a
dependencyAffinityfield to the validator catalog and use it to pin theai-service-metricsorchestrator pod to the node running Prometheus. Eliminates intermittent failures on multi-Security-Group clusters where the dial would otherwise traverse cross-SG ingress.Motivation / Context
On multi-SG EKS clusters (e.g., DGXC GB200 with separate customer/system ENI subnets),
ai-service-metricswas nondeterministically landing on nodes that could not dial the Prometheus service. Image-locality scoring decided placement once the validator image was cached, masking the underlying network-reachability constraint. The fix declares the dependency in the catalog and the deployer materializes it as apodAffinityterm whose weight (100) dominates image-locality on first scheduling, then reinforces itself via locality on subsequent runs.Fixes: #933
Related: N/A
Type of Change
Component(s) Affected
pkg/validator,pkg/api/validator/v1)docs/integrator/validator-extension.md)recipes/validators/catalog.yaml)Implementation Notes
DependencyAffinitytype on catalog entries:componentRef,podLabelSelector,topologyKey(defaults tokubernetes.io/hostname), andrequirement(preferreddefault, orrequired).BuildOrchestratorAffinityresolves each dependency against the recipe'scomponentRefs:required+ missing/disabled/no-namespace →ErrCodeInvalidRequest(caller must abort before deploying).preferred+ missing →slog.Warnand skip the term (preserves behavior on flat networks).PodAffinityTermwith the component's resolved namespace.pkg/validator/validator.go: walks selected validators and validatesdependencyAffinityagainst the recipe before any Job is deployed.componentRefsinto the JobPlan and renders the affinity into the orchestrator pod'sapplyconfigurationsspec.ai-service-metricscatalog entry usesrequirement: preferredso flat-cluster recipes that omitkube-prometheus-stackstill run (validator schedules wherever, warning logged).pkg/validator/validator.go:290returnsErrCodeInternalover an innerErrCodeInvalidRequest; minor cosmetic deviation from the no-double-wrap rule, no behavior impact, tracked as follow-up.Testing
make qualify go test ./pkg/api/validator/v1/... ./pkg/validator/...Manually verified end-to-end on a GB200/EKS multi-SG cluster:
monitoring/prometheus-kube-prometheus-prometheus-0carriesapp.kubernetes.io/name=prometheus; selector matches.validate --phase conformanceruns landed theai-service-metricsorchestrator on the Prometheus node. Injected Job spec confirmed:preferredDuringSchedulingIgnoredDuringExecutionwithweight: 100,matchLabels: app.kubernetes.io/name: prometheus,namespaces: [monitoring],topologyKey: kubernetes.io/hostname.--dataoverlay flippingai-service-metricstorequirement: requiredand a recipe missingkube-prometheus-stackfrom componentRefs, validate aborts at pre-flight:[INVALID_REQUEST] required dependencyAffinity component "kube-prometheus-stack" is not in the recipe's componentRefs. Zero Jobs deployed.Coverage delta:
pkg/api/validator/v1: 67.4% → 73.6% (+6.2%)pkg/validator/catalog: 97.4% → 97.5% (+0.1%)pkg/validator/job: 12.9% → 15.7% (+2.8%)pkg/validator: 25.1% → 24.8% (-0.3%, within noise)Risk Assessment
requirement: preferredpreserves existing behavior on recipes withoutkube-prometheus-stack; the only validator using the new field today isai-service-metrics. Easy to revert by removing the catalog entry'sdependencyAffinityblock.Rollout notes: No migration required. Existing recipes work unchanged. Recipes wanting to enforce co-location for custom validators can opt-in via
requirement: required.Checklist
make testwith-race)make lint)git commit -S)