Skip to content

Pre-flight node state detection for Skyhook components (EFA, kernel, etc.) #295

Description

@mchmarny

Problem

When generating bundles that include Skyhook-based components (EFA driver, kernel updates, RAID setup, etc.), AICR currently has no way to reason about the existing state of cluster nodes. This leads to several issues:

  1. Unnecessary reinstalls — Users with pre-configured nodes (EFA already installed, kernel already updated) get a full bundle that may re-run setup steps, including node reboots
  2. Missing guidance — Users don't have enough context to know which components they can safely skip vs. which are required for the "optimized" configuration value proposition
  3. No detection mechanismnvidia-setup doesn't detect existing installations to skip redundant steps

Desired Behavior

Short-term

  • Improve bundle README to clearly document what each Skyhook component does (e.g., "this will reboot your nodes")
  • Users can opt out of individual components, but the bundle should communicate the trade-offs (partial install = weaker optimization guarantees)

Long-term

  • Snapshot-based pre-flight checks: Extend the existing snapshot + recipe validation flow to detect pre-installed components on nodes and provide actionable guidance:
    • "EFA driver already installed — you may remove nvidia-setup from the bundle"
    • "Kernel version mismatch — nvidia-setup kernel install recommended"
    • "Component X already present but outdated — uninstall first or let Skyhook upgrade"
  • Leverage the component health-check infrastructure recently added to run these detection checks
  • Ensure pkg/collector/os captures sufficient signal to detect EFA driver presence, kernel version, and other relevant node-level state

Implementation Notes

  • The existing validate command already checks recipe constraints against a snapshot — this is the natural extension point
  • Component-level checks (recently added) provide the framework for per-component pre-flight detection
  • May need to extend pkg/collector/os to capture EFA driver status and other Skyhook-relevant signals
  • Core principle: AICR provides optimized configuration — users can opt out, but should understand the implications

Metadata

Metadata

Assignees

No one assigned

    Fields

    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions