Problem
When generating bundles that include Skyhook-based components (EFA driver, kernel updates, RAID setup, etc.), AICR currently has no way to reason about the existing state of cluster nodes. This leads to several issues:
- Unnecessary reinstalls — Users with pre-configured nodes (EFA already installed, kernel already updated) get a full bundle that may re-run setup steps, including node reboots
- Missing guidance — Users don't have enough context to know which components they can safely skip vs. which are required for the "optimized" configuration value proposition
- No detection mechanism —
nvidia-setup doesn't detect existing installations to skip redundant steps
Desired Behavior
Short-term
- Improve bundle README to clearly document what each Skyhook component does (e.g., "this will reboot your nodes")
- Users can opt out of individual components, but the bundle should communicate the trade-offs (partial install = weaker optimization guarantees)
Long-term
- Snapshot-based pre-flight checks: Extend the existing snapshot + recipe validation flow to detect pre-installed components on nodes and provide actionable guidance:
- "EFA driver already installed — you may remove
nvidia-setup from the bundle"
- "Kernel version mismatch —
nvidia-setup kernel install recommended"
- "Component X already present but outdated — uninstall first or let Skyhook upgrade"
- Leverage the component health-check infrastructure recently added to run these detection checks
- Ensure
pkg/collector/os captures sufficient signal to detect EFA driver presence, kernel version, and other relevant node-level state
Implementation Notes
- The existing
validate command already checks recipe constraints against a snapshot — this is the natural extension point
- Component-level checks (recently added) provide the framework for per-component pre-flight detection
- May need to extend
pkg/collector/os to capture EFA driver status and other Skyhook-relevant signals
- Core principle: AICR provides optimized configuration — users can opt out, but should understand the implications
Problem
When generating bundles that include Skyhook-based components (EFA driver, kernel updates, RAID setup, etc.), AICR currently has no way to reason about the existing state of cluster nodes. This leads to several issues:
nvidia-setupdoesn't detect existing installations to skip redundant stepsDesired Behavior
Short-term
Long-term
nvidia-setupfrom the bundle"nvidia-setupkernel install recommended"pkg/collector/oscaptures sufficient signal to detect EFA driver presence, kernel version, and other relevant node-level stateImplementation Notes
validatecommand already checks recipe constraints against a snapshot — this is the natural extension pointpkg/collector/osto capture EFA driver status and other Skyhook-relevant signals