feat(collector): add Talos OS support via Kubernetes Node info#714
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThis pull request adds support for Talos as a valid GPU node operating system across the entire system. It introduces a new Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes 🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/collector/talos/os_test.go`:
- Around line 158-164: The test TestOSCollector_GracefulDegradation currently
passes WithNodeName("missing-node") which bypasses the environment-variable
resolution path; change the setup to construct the collector without the
WithNodeName option (i.e., call
NewOSCollector(WithClientSet(fake.NewSimpleClientset())) so the collector reads
NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME (already set to empty via t.Setenv) and
then assert Collect() handles the missing env-var case; update any
assertions/messages if they specifically expected the explicit node-name
behavior.
In `@pkg/collector/talos/service.go`:
- Around line 100-104: The code unconditionally sets the containerd subtype to
active whenever info.ContainerRuntimeVersion is non-empty; update the logic
around splitRuntimeID(info.ContainerRuntimeVersion) so you only mark
keyActiveState as "active" when the parsed runtime name equals "containerd"
(from splitRuntimeID), otherwise set keyActiveState to "unknown" or skip setting
the subtype fields; modify the block that calls b.SetString(keyActiveState,
...).SetString(keyRuntimeName, ...).SetString(keyVersion, ...) to branch on
runtimeName == "containerd" and handle non-containerd runtimes accordingly
(referencing splitRuntimeID, info.ContainerRuntimeVersion, keyActiveState,
keyRuntimeName, keyVersion).
In `@pkg/collector/talos/talos.go`:
- Around line 101-107: The fetchNode call uses the incoming ctx directly for
cs.CoreV1().Nodes().Get which can block without a deadline; wrap the API call in
a context.WithTimeout of 10s (collector timeout), e.g., create ctx, cancel :=
context.WithTimeout(ctx, 10*time.Second) and defer cancel(), then pass that
child ctx into cs.CoreV1().Nodes().Get (referencing fetchNode and
cs.CoreV1().Nodes().Get) and handle/propagate the timeout error appropriately
instead of using the original ctx.
In `@pkg/k8s/agent/job_test.go`:
- Around line 119-128: The test currently only compares the value of gotOSEnv
and can’t tell if the AICR_OS env var is absent vs present-but-empty; modify the
loop over spec.Containers[0].Env to track presence (e.g., a found boolean
alongside gotOSEnv) when e.Name == "AICR_OS", then after the loop assert the env
var was found (fail the test if not) and then assert that gotOSEnv ==
tt.wantAICROSEnv; update the assertion logic in pkg/k8s/agent/job_test.go so
presence and value of AICR_OS are both validated (referencing gotOSEnv, found,
spec.Containers[0].Env and tt.wantAICROSEnv).
In `@pkg/recipe/doc.go`:
- Line 31: Update the GET query-parameter documentation in the same godoc where
Criteria is described so the OS examples match the Criteria struct comment;
replace the stale "os" examples ("ubuntu, cos, rhel, any") with the full,
consistent list used in the Criteria comment ("ubuntu, rhel, cos, amazonlinux,
talos, any") (ensure ordering/formatting follows the Criteria comment) and
confirm the GET parameter list text referencing "os" uses the same identifiers.
In `@pkg/snapshotter/snapshot.go`:
- Around line 85-89: The code reads AICR_OS raw via parseOSEnv(), which can miss
matches due to casing/whitespace; update parseOSEnv to normalize the value by
trimming surrounding whitespace and converting to a consistent case (e.g.,
strings.ToLower(strings.TrimSpace(os.Getenv("AICR_OS")))) so collector backend
selection (the code paths that inspect parseOSEnv() and the related selection
logic) reliably matches values like "Talos" regardless of casing/extra spaces;
ensure callers of parseOSEnv() continue to work with the normalized string.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: b8c3db3e-be88-4fa4-bc9b-e5ef933325cd
📒 Files selected for processing (27)
api/aicr/v1/server.yamldocs/README.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/validations.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/cli/snapshot.gopkg/collector/factory.gopkg/collector/factory_test.gopkg/collector/talos/doc.gopkg/collector/talos/os.gopkg/collector/talos/os_test.gopkg/collector/talos/service.gopkg/collector/talos/service_test.gopkg/collector/talos/talos.gopkg/k8s/agent/job.gopkg/k8s/agent/job_test.gopkg/k8s/agent/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/snapshotter/agent.gopkg/snapshotter/snapshot.gopkg/snapshotter/snapshot_test.gosite/docs/getting-started/index.md
| node, err := cs.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{}) | ||
| if err != nil { | ||
| slog.Warn("failed to fetch node from Kubernetes API", | ||
| slog.String("node", nodeName), | ||
| slog.String("error", err.Error())) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Add explicit collector timeout around Node GET.
fetchNode performs Kubernetes API I/O using the incoming context directly. If upstream provides no deadline, this call can block longer than collector SLAs.
⏱️ Suggested fix pattern
import (
"context"
"log/slog"
"github.com/NVIDIA/aicr/pkg/collector/k8s"
+ "github.com/NVIDIA/aicr/pkg/defaults"
"github.com/NVIDIA/aicr/pkg/errors"
k8sclient "github.com/NVIDIA/aicr/pkg/k8s/client"
@@
- node, err := cs.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
+ getCtx, cancel := context.WithTimeout(ctx, defaults.CollectorTimeout)
+ defer cancel()
+ node, err := cs.CoreV1().Nodes().Get(getCtx, nodeName, metav1.GetOptions{})As per coding guidelines, "All I/O operations and HTTP handlers must use context with explicit timeouts via context.WithTimeout(). Collectors: 10s timeout."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/collector/talos/talos.go` around lines 101 - 107, The fetchNode call uses
the incoming ctx directly for cs.CoreV1().Nodes().Get which can block without a
deadline; wrap the API call in a context.WithTimeout of 10s (collector timeout),
e.g., create ctx, cancel := context.WithTimeout(ctx, 10*time.Second) and defer
cancel(), then pass that child ctx into cs.CoreV1().Nodes().Get (referencing
fetchNode and cs.CoreV1().Nodes().Get) and handle/propagate the timeout error
appropriately instead of using the original ctx.
Coverage Report ✅
Coverage BadgeMerging this branch changes the coverage (1 decrease, 6 increase)
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. |
mchmarny
left a comment
There was a problem hiding this comment.
Solid first phase of Talos enablement. Scope is appropriately tight (Kubernetes-API stub, no MPL deps, OS gating via existing criterion), the package layout is clearly templated for future OS-specific backends, and test coverage is real (91.7% on the new package, table-driven, graceful-degradation cases included). The three-phase escalation path documented in pkg/collector/talos/doc.go is exactly the right framing.
Six findings, none blocking:
- Doc gap (medium): the new
aicr snapshot --osflag isn't indocs/user/cli-reference.md's snapshot flags table — AGENTS.md requires this in the same PR. - Duplicate Node Get (medium):
OSCollectorandServiceCollectoreach fetch the same Node; cheap today, but scales linearly as more Talos collectors are added under this template. - Inconsistent ActiveState semantics (medium): malformed
ContainerRuntimeVersionstill stampsActiveState=activewith emptyVersion— locked in by tests; suggests either reportingunknownor omitting the empty Version. - Constant drift risk (low/medium):
OSTalosis defined in three packages with comments saying "must stay in sync"; trivial guard test would prevent silent drift. parseOSImagenon-Talos misbehavior (low): tests document("Red Hat Enterprise Linux 9.4 (Plow)", "red", "Plow"). Production never hits this code path (the collector only runs foros=talos), but the assertions read like locked-in bugs.parseOSEnvsilent fallback (low): invalidAICR_OSvalues silently degrade to systemd default with no log; mirrorparseMaxNodesPerEntryEnv'sslog.Warn.
Three GPU CI checks (nvkind H100/L40G) and tests/E2E are still in progress at review time — none of the failures so far are related. CI status looks healthy.
Talos has no systemd and no host /etc/os-release accessible to unprivileged pods, which blocks AICR's snapshot agent from deploying on Talos clusters (the documented blocker in #565). Adds an OS-criteria- selected backend that derives the same measurement schema from the Kubernetes Node object so existing recipe constraints continue to work unchanged across operating systems. Changes: - New CriteriaOSTalos value in pkg/recipe/criteria.go plus full enum audit (api/aicr/v1/server.yaml, docs/, site/docs/, doc.go). - New pkg/collector/talos package, organized so each new OS-specific collector follows the same template: talos.go package entry: Option, config, fetchNode helper service.go ServiceCollector emitting TypeSystemD with containerd.service + kubelet.service from Node.Status.NodeInfo (containerRuntimeVersion, kubeletVersion) os.go OSCollector emitting TypeOS with - release subtype from NodeInfo.{OSImage, KernelVersion, OperatingSystem} - extensions subtype from extensions.talos.dev/* Node labels (e.g., nvidia-container-toolkit, nvidia-open-gpu-kernel-modules versions) - pkg/collector/factory.go branches CreateSystemDCollector and CreateOSCollector on OS=talos via WithOS option. OSTalos string literal kept in sync with recipe.CriteriaOSTalos to avoid the recipe -> snapshotter -> collector import cycle. - pkg/k8s/agent/job.go gates the /run/systemd and /etc/os-release hostPath mounts on OS!=talos and propagates AICR_OS to the in-pod factory so it picks the same backend the controller-side picked. - pkg/snapshotter plumbs OS through AgentConfig and reads AICR_OS in measure() to rebuild the factory in-pod. New --os flag on aicr snapshot, validated through recipe.ParseCriteriaOSType. Stub backend uses only the Kubernetes API surface that pkg/collector/k8s already touches; no MPL-2.0 dependencies are introduced. The package docs and code comments lay out the future expansion path: K8s-API stub (this PR) -> gRPC reflection against machined -> vendor siderolabs/talos/pkg/machinery/client, which would introduce the first MPL-2.0 dependency into this codebase. Step up only when a constraint genuinely needs data the previous phase cannot satisfy. GPU collection on Talos requires no AICR changes: the GPU collector relies on NFD sysfs/PCI enumeration and nvidia-smi via the NVIDIA Container Runtime hook, neither of which involves nsenter, chroot, or host-fs mounts. Talos clusters with the nvidia-container-toolkit and nvidia-open-gpu-kernel-modules extensions installed (now visible via the extensions subtype) work as-is with --runtime-class nvidia. Tests: pkg/collector/talos coverage 91.7%; full test suite 55/55 green in -short mode; golangci-lint -c .golangci.yaml ./... 0 issues. Refs: #565
edd1961 to
3065a31
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/snapshotter/snapshot_test.go (1)
17-25:⚠️ Potential issue | 🟡 MinorMake the “unset” test case truly unset
AICR_OS.Right now the
"unset"case setsAICR_OSto""viat.Setenv, which is “set but empty”, not “absent”. IfparseOSEnv()usesos.LookupEnv(or otherwise branches on presence), this test may not cover the intended branch.✅ Suggested test adjustment
import ( "context" "fmt" + "os" "testing" "github.com/NVIDIA/aicr/pkg/collector" "github.com/NVIDIA/aicr/pkg/header" "github.com/NVIDIA/aicr/pkg/measurement" ) func TestParseOSEnv(t *testing.T) { tests := []struct { name string - env string + env string + unset bool want string }{ - {"unset", "", ""}, - {"talos", "talos", "talos"}, - {"ubuntu passthrough", "ubuntu", "ubuntu"}, + {"unset", "", true, ""}, + {"talos", "talos", false, "talos"}, + {"ubuntu passthrough", "ubuntu", false, "ubuntu"}, } for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { - t.Setenv("AICR_OS", tt.env) + if tt.unset { + prev, ok := os.LookupEnv("AICR_OS") + _ = os.Unsetenv("AICR_OS") + t.Cleanup(func() { + if ok { + _ = os.Setenv("AICR_OS", prev) + } else { + _ = os.Unsetenv("AICR_OS") + } + }) + } else { + t.Setenv("AICR_OS", tt.env) + } if got := parseOSEnv(); got != tt.want { t.Errorf("parseOSEnv() = %q, want %q", got, tt.want) } }) } }Also applies to: 320-338
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/snapshotter/snapshot_test.go` around lines 17 - 25, The "unset" test case currently calls t.Setenv("AICR_OS", "") which leaves the variable present but empty; change the test to remove the environment variable so it is truly absent (use os.Unsetenv("AICR_OS") or call os.Unsetenv inside the test setup and arrange to restore it via t.Cleanup) so parseOSEnv() exercises the "not present" branch; update the same pattern in the other occurrence around the block that covers lines 320-338 to ensure both tests really unset AICR_OS.
♻️ Duplicate comments (9)
pkg/recipe/doc.go (1)
31-141:⚠️ Potential issue | 🟡 MinorFix stale godoc: GET query-parameter
oslist must includeamazonlinuxandtalos.
Criteria.OScomment includestalos, but the GET query parameter section still documentsosasubuntu, cos, rhel, any—missingamazonlinuxandtalos.✅ Suggested godoc diff
-// - os: ubuntu, cos, rhel, any (default: any) +// - os: ubuntu, rhel, cos, amazonlinux, talos, any (default: any)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/recipe/doc.go` around lines 31 - 141, The GET query-parameter documentation in pkg/recipe/doc.go is stale: the "os" list omits amazonlinux and talos even though Criteria.OS (e.g., constants like CriteriaOSAmazonLinux, CriteriaOSTalos or related CriteriaOS* types) supports them; update the "os" line under "Query Parameters (HTTP API - GET)" to include amazonlinux and talos (so it reads e.g. "ubuntu, cos, rhel, amazonlinux, talos, any") and ensure any alias/description for the "os" parameter matches the Criteria.OS constants and examples elsewhere in the file.docs/user/cli-reference.md (1)
74-94:⚠️ Potential issue | 🟠 MajorAdd missing
aicr snapshot --osflag row to the CLI reference.The docs’
aicr snapshotflags table (lines ~74–94) does not list the new--osflag, even though recipe query mode does. This creates a user-visible mismatch between CLI behavior and documentation.✅ Suggested doc update (table row insertion)
| `--toleration` | | string[] | all taints | Tolerations for agent scheduling (key=value:effect, repeatable). **Default: all taints tolerated** (uses `operator: Exists`). Only specify to restrict which taints are tolerated. | +| `--os` | | string | | Node OS family to use for collectors (e.g. ubuntu, rhel, cos, amazonlinux, talos) | | `--timeout` | | duration | 5m | Timeout for agent Job completion | | `--no-cleanup` | | bool | false | Skip removal of Job and RBAC resources on completion. **Warning:** leaves a cluster-admin ClusterRoleBinding active. |Please confirm the exact
--osflag description/default from the Go CLI flag definition so the docs mirror the usage text precisely.#!/bin/bash # Description: Find the Go definition of the `aicr snapshot --os` flag so we can mirror its usage/default in docs. set -euo pipefail # 1) Locate snapshot command source file(s) SNAPSHOT_FILES=$(fd -a --type f "snapshot.go" pkg/cli pkg || true) echo "Found snapshot.go files:" echo "$SNAPSHOT_FILES" # 2) Search for `os` flag definition within those files for f in $SNAPSHOT_FILES; do echo "---- Searching $f for --os flag definition ----" rg -n --fixed-string 'Name: "os"' "$f" || true rg -n --fixed-string 'Name: "os"' "$f" || true rg -n -- '--os' "$f" || true rg -n -- "AICR_OS" "$f" || true rg -n -- "StringFlag\\{" "$f" | rg -n "os" || true done🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/user/cli-reference.md` around lines 74 - 94, The docs are missing the new aicr snapshot --os flag; locate the Go flag definition in the snapshot command (search for snapshot.go and the StringFlag/Flag with Name: "os" in the pkg/cli package or related files), copy the exact usage text and default value for the --os flag from that Go definition, and add a new row to the Flags table describing `--os` for the `aicr snapshot` command (include short/Type/Default/Description exactly as in the Go CLI). Ensure the table row references the `--os` flag name and mirrors the Go usage/default precisely so docs and CLI output match.pkg/snapshotter/snapshot.go (1)
85-100:⚠️ Potential issue | 🟠 MajorNormalize and validate
AICR_OSbefore factory selection.
parseOSEnv()currently forwards raw env input, so casing/whitespace/typos can silently fall back to non-Talos collectors and make failures hard to diagnose.♻️ Proposed hardening diff
import ( "context" "log/slog" "os" "strconv" + "strings" "sync" "time" @@ "github.com/NVIDIA/aicr/pkg/measurement" + "github.com/NVIDIA/aicr/pkg/recipe" "github.com/NVIDIA/aicr/pkg/serializer" ) @@ func parseOSEnv() string { - return os.Getenv("AICR_OS") + val := strings.ToLower(strings.TrimSpace(os.Getenv("AICR_OS"))) + if val == "" { + return "" + } + if _, err := recipe.ParseCriteriaOSType(val); err != nil { + slog.Warn("invalid AICR_OS value, ignoring", slog.String("value", val)) + return "" + } + return val }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/snapshotter/snapshot.go` around lines 85 - 100, Normalize and validate AICR_OS before using it to select collectors: change parseOSEnv() to return a trimmed, lowercased value (use strings.TrimSpace + strings.ToLower) and in NodeSnapshotter.measure() validate that the normalized osVal is one of the supported OS identifiers (e.g. "talos" or any values accepted by collector.WithOS); if osVal is empty continue with default behavior, but if it's present and not in the allowed set return an error (or log a clear warning and fail fast) instead of silently passing the raw value to collector.WithOS, referencing parseOSEnv, NodeSnapshotter.measure, and collector.WithOS to locate the change points.pkg/collector/talos/service.go (1)
100-104:⚠️ Potential issue | 🟠 MajorDo not mark
containerd.serviceactive for non-containerd or unparseable runtime IDs.At Line 101,
containerd.servicebecomesactivefor any non-empty runtime string (e.g.,cri-o://...or malformed values), which is an incorrect service-state signal.🔧 Suggested fix
runtimeName, runtimeVersion := splitRuntimeID(info.ContainerRuntimeVersion) - b.SetString(keyActiveState, "active"). - SetString(keyRuntimeName, runtimeName). - SetString(keyVersion, runtimeVersion) + b.SetString(keyRuntimeName, runtimeName) + if runtimeName != "containerd" || runtimeVersion == "" { + b.SetString(keyActiveState, "unknown") + return b.Build() + } + b.SetString(keyActiveState, "active"). + SetString(keyVersion, runtimeVersion)Please also update
pkg/collector/talos/service_test.goexpectations for the malformed runtime case accordingly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/collector/talos/service.go` around lines 100 - 104, The code currently sets keyActiveState to "active" for any non-empty runtime string; change the logic around splitRuntimeID(info.ContainerRuntimeVersion) so you only SetString(keyActiveState, "active") when runtimeName == "containerd" and runtimeVersion is non-empty/parseable, otherwise do not mark containerd active (set inactive or omit the active state consistently). Update uses of keyRuntimeName and keyVersion to still record parsed values only when parse succeeded, and adjust pkg/collector/talos/service_test.go expectations for the malformed runtime case to assert containerd is not active (and that runtime/version are empty or reflect the parsing outcome).pkg/k8s/agent/job_test.go (1)
119-128:⚠️ Potential issue | 🟡 MinorAssert
AICR_OSpresence/absence, not only value.At Line 120,
gotOSEnv == ""passes both when the variable is absent and when it is present-but-empty, so the empty-OS case is not strictly validated.🔧 Suggested fix
- gotOSEnv := "" + gotOSEnv := "" + foundOSEnv := false for _, e := range spec.Containers[0].Env { if e.Name == "AICR_OS" { gotOSEnv = e.Value + foundOSEnv = true + break } } + if tt.wantAICROSEnv == "" && foundOSEnv { + t.Errorf("AICR_OS env should be absent for OS=%q", tt.os) + } + if tt.wantAICROSEnv != "" && !foundOSEnv { + t.Errorf("AICR_OS env should be present for OS=%q", tt.os) + } if gotOSEnv != tt.wantAICROSEnv { t.Errorf("AICR_OS env = %q, want %q", gotOSEnv, tt.wantAICROSEnv) }As per coding guidelines, "Explicitly test error conditions and edge cases in test files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/k8s/agent/job_test.go` around lines 119 - 128, The test currently only checks the value of AICR_OS via gotOSEnv which conflates "absent" and "present-but-empty"; change the loop over spec.Containers[0].Env to record both a boolean foundOSEnv and the value (e.g., foundOSEnv := false; gotOSEnv := "" and set foundOSEnv = true when e.Name == "AICR_OS"), then replace the single value check with two assertions: assert presence/absence (foundOSEnv should equal (tt.wantAICROSEnv != "")) and, only if foundOSEnv is true, assert gotOSEnv == tt.wantAICROSEnv so you explicitly test existence and value separately.pkg/collector/factory.go (1)
26-30: 🧹 Nitpick | 🔵 TrivialAdd a guard test to detect Talos OS constant drift across packages.
The "must stay in sync" comment is good, but enforcement is manual. Add a small invariant test in a package that can import all three constants (
recipe,collector,agent) so drift fails fast in CI.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/collector/factory.go` around lines 26 - 30, Add a unit test that asserts the Talos OS constant is identical across packages to catch drift: create a new test (e.g., invariants_test) that imports the three symbols OSTalos from pkg/collector (const OSTalos), CriteriaOSTalos from pkg/recipe, and whatever the Talos constant is in pkg/agent, and assert they are equal (using testing.Fatal or require/xtest equivalent) so CI fails on mismatch; name the test clearly (e.g., TestTalosOSConstantInvariant) and keep it lightweight with no external dependencies.pkg/collector/talos/os_test.go (1)
163-164:⚠️ Potential issue | 🟡 Minor
TestOSCollector_GracefulDegradationbypasses the env-resolution path it prepares.At Line 163,
WithNodeName("missing-node")short-circuits NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME fallback, so this test does not cover the empty-env path.Proposed fix
- c := NewOSCollector(WithClientSet(fake.NewSimpleClientset()), WithNodeName("missing-node")) + c := NewOSCollector(WithClientSet(fake.NewSimpleClientset()))As per coding guidelines, "Explicitly test error conditions and edge cases in test files."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/collector/talos/os_test.go` around lines 163 - 164, The test is bypassing the environment-resolution path by passing WithNodeName("missing-node") so it never exercises NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME fallbacks; remove the WithNodeName(...) option when constructing NewOSCollector (leave NewOSCollector(WithClientSet(fake.NewSimpleClientset()))), ensure the environment variables NODE_NAME, KUBERNETES_NODE_NAME and HOSTNAME are unset or cleared in the test setup, then call c.Collect(ctx) to verify graceful degradation; reference NewOSCollector, WithClientSet, WithNodeName and Collect when making the change.pkg/collector/talos/talos.go (1)
101-101:⚠️ Potential issue | 🟠 MajorEnforce a 10s timeout at the
fetchNodeI/O boundary.
fetchNodeissues Kubernetes API I/O directly on the incomingctxat Line 101; this helper should guarantee its own timeout so it cannot block indefinitely when reused without a deadline.Proposed fix
import ( "context" "log/slog" "github.com/NVIDIA/aicr/pkg/collector/k8s" + "github.com/NVIDIA/aicr/pkg/defaults" "github.com/NVIDIA/aicr/pkg/errors" k8sclient "github.com/NVIDIA/aicr/pkg/k8s/client" corev1 "k8s.io/api/core/v1" @@ - node, err := cs.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{}) + getCtx, cancel := context.WithTimeout(ctx, defaults.CollectorTimeout) + defer cancel() + node, err := cs.CoreV1().Nodes().Get(getCtx, nodeName, metav1.GetOptions{})As per coding guidelines, "All I/O operations and HTTP handlers must use context with explicit timeouts via
context.WithTimeout(). Collectors: 10s timeout."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/collector/talos/talos.go` at line 101, The fetchNode helper currently calls cs.CoreV1().Nodes().Get using the incoming ctx and must enforce a 10s I/O timeout; wrap the call in a context.WithTimeout(ctx, 10*time.Second), defer cancel(), and use that timed context when calling cs.CoreV1().Nodes().Get (the node retrieval in fetchNode) so the function cannot block indefinitely when invoked without a deadline; ensure you import time if not present and always call cancel to avoid leaks.pkg/collector/talos/os.go (1)
134-156:⚠️ Potential issue | 🟡 Minor
parseOSImageemits misleadingID/VERSION_IDfor non‑Talos OSImage formats.At Line 146+ this parser converts multi-word distro names into invalid structured fields (e.g., RHEL-like strings become
ID=red,VERSION_ID=Plow). Prefer Talos-only normalization and leaveID/VERSION_IDunset when format is not confidently mapped.Proposed fix
func parseOSImage(image string) (id, version string) { if image == "" { return "", "" } parts := strings.SplitN(image, " (", 2) - id = strings.ToLower(strings.TrimSpace(parts[0])) - // Strip secondary words from the ID for distros that omit the - // parenthesis form (e.g., "Ubuntu 22.04.5 LTS" → "ubuntu"). - if spaceIdx := strings.Index(id, " "); spaceIdx >= 0 { - id = id[:spaceIdx] - } - if len(parts) == 2 { + base := strings.TrimSpace(parts[0]) + if !strings.EqualFold(base, "talos") { + return "", "" + } + id = "talos" + if len(parts) == 2 { v := strings.TrimSuffix(strings.TrimSpace(parts[1]), ")") v = strings.TrimPrefix(v, "v") version = v } return id, version }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/collector/talos/os.go` around lines 134 - 156, parseOSImage currently force-parses multi-word distro strings into invalid ID/VERSION_ID; change it to only yield values for confident Talos-style patterns: in parseOSImage check for a parenthesized version suffix (parts := strings.SplitN(image, " (", 2)) and only proceed if len(parts) == 2 AND the left-hand part contains no spaces OR the left-hand part equals "Talos" (case-insensitive) — otherwise return empty strings; when proceeding, keep the existing lowercasing/trim and version stripping logic (v prefix trim) so only true Talos-style inputs produce ID and VERSION_ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/k8s/agent/types.go`:
- Around line 43-47: Add a unit test that asserts the OSTalos constant in
pkg/k8s/agent (symbol OSTalos) matches the equivalent constants in the other
packages (e.g., collector.OSTalos and recipe.CriteriaOSTalos) to prevent drift;
implement a small cross-package test (in the appropriate _test.go) that imports
those packages and uses simple equality assertions (t.Fatalf or t.Errorf) if any
values differ so CI fails when they get out of sync.
---
Outside diff comments:
In `@pkg/snapshotter/snapshot_test.go`:
- Around line 17-25: The "unset" test case currently calls t.Setenv("AICR_OS",
"") which leaves the variable present but empty; change the test to remove the
environment variable so it is truly absent (use os.Unsetenv("AICR_OS") or call
os.Unsetenv inside the test setup and arrange to restore it via t.Cleanup) so
parseOSEnv() exercises the "not present" branch; update the same pattern in the
other occurrence around the block that covers lines 320-338 to ensure both tests
really unset AICR_OS.
---
Duplicate comments:
In `@docs/user/cli-reference.md`:
- Around line 74-94: The docs are missing the new aicr snapshot --os flag;
locate the Go flag definition in the snapshot command (search for snapshot.go
and the StringFlag/Flag with Name: "os" in the pkg/cli package or related
files), copy the exact usage text and default value for the --os flag from that
Go definition, and add a new row to the Flags table describing `--os` for the
`aicr snapshot` command (include short/Type/Default/Description exactly as in
the Go CLI). Ensure the table row references the `--os` flag name and mirrors
the Go usage/default precisely so docs and CLI output match.
In `@pkg/collector/factory.go`:
- Around line 26-30: Add a unit test that asserts the Talos OS constant is
identical across packages to catch drift: create a new test (e.g.,
invariants_test) that imports the three symbols OSTalos from pkg/collector
(const OSTalos), CriteriaOSTalos from pkg/recipe, and whatever the Talos
constant is in pkg/agent, and assert they are equal (using testing.Fatal or
require/xtest equivalent) so CI fails on mismatch; name the test clearly (e.g.,
TestTalosOSConstantInvariant) and keep it lightweight with no external
dependencies.
In `@pkg/collector/talos/os_test.go`:
- Around line 163-164: The test is bypassing the environment-resolution path by
passing WithNodeName("missing-node") so it never exercises
NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME fallbacks; remove the WithNodeName(...)
option when constructing NewOSCollector (leave
NewOSCollector(WithClientSet(fake.NewSimpleClientset()))), ensure the
environment variables NODE_NAME, KUBERNETES_NODE_NAME and HOSTNAME are unset or
cleared in the test setup, then call c.Collect(ctx) to verify graceful
degradation; reference NewOSCollector, WithClientSet, WithNodeName and Collect
when making the change.
In `@pkg/collector/talos/os.go`:
- Around line 134-156: parseOSImage currently force-parses multi-word distro
strings into invalid ID/VERSION_ID; change it to only yield values for confident
Talos-style patterns: in parseOSImage check for a parenthesized version suffix
(parts := strings.SplitN(image, " (", 2)) and only proceed if len(parts) == 2
AND the left-hand part contains no spaces OR the left-hand part equals "Talos"
(case-insensitive) — otherwise return empty strings; when proceeding, keep the
existing lowercasing/trim and version stripping logic (v prefix trim) so only
true Talos-style inputs produce ID and VERSION_ID.
In `@pkg/collector/talos/service.go`:
- Around line 100-104: The code currently sets keyActiveState to "active" for
any non-empty runtime string; change the logic around
splitRuntimeID(info.ContainerRuntimeVersion) so you only
SetString(keyActiveState, "active") when runtimeName == "containerd" and
runtimeVersion is non-empty/parseable, otherwise do not mark containerd active
(set inactive or omit the active state consistently). Update uses of
keyRuntimeName and keyVersion to still record parsed values only when parse
succeeded, and adjust pkg/collector/talos/service_test.go expectations for the
malformed runtime case to assert containerd is not active (and that
runtime/version are empty or reflect the parsing outcome).
In `@pkg/collector/talos/talos.go`:
- Line 101: The fetchNode helper currently calls cs.CoreV1().Nodes().Get using
the incoming ctx and must enforce a 10s I/O timeout; wrap the call in a
context.WithTimeout(ctx, 10*time.Second), defer cancel(), and use that timed
context when calling cs.CoreV1().Nodes().Get (the node retrieval in fetchNode)
so the function cannot block indefinitely when invoked without a deadline;
ensure you import time if not present and always call cancel to avoid leaks.
In `@pkg/k8s/agent/job_test.go`:
- Around line 119-128: The test currently only checks the value of AICR_OS via
gotOSEnv which conflates "absent" and "present-but-empty"; change the loop over
spec.Containers[0].Env to record both a boolean foundOSEnv and the value (e.g.,
foundOSEnv := false; gotOSEnv := "" and set foundOSEnv = true when e.Name ==
"AICR_OS"), then replace the single value check with two assertions: assert
presence/absence (foundOSEnv should equal (tt.wantAICROSEnv != "")) and, only if
foundOSEnv is true, assert gotOSEnv == tt.wantAICROSEnv so you explicitly test
existence and value separately.
In `@pkg/recipe/doc.go`:
- Around line 31-141: The GET query-parameter documentation in pkg/recipe/doc.go
is stale: the "os" list omits amazonlinux and talos even though Criteria.OS
(e.g., constants like CriteriaOSAmazonLinux, CriteriaOSTalos or related
CriteriaOS* types) supports them; update the "os" line under "Query Parameters
(HTTP API - GET)" to include amazonlinux and talos (so it reads e.g. "ubuntu,
cos, rhel, amazonlinux, talos, any") and ensure any alias/description for the
"os" parameter matches the Criteria.OS constants and examples elsewhere in the
file.
In `@pkg/snapshotter/snapshot.go`:
- Around line 85-100: Normalize and validate AICR_OS before using it to select
collectors: change parseOSEnv() to return a trimmed, lowercased value (use
strings.TrimSpace + strings.ToLower) and in NodeSnapshotter.measure() validate
that the normalized osVal is one of the supported OS identifiers (e.g. "talos"
or any values accepted by collector.WithOS); if osVal is empty continue with
default behavior, but if it's present and not in the allowed set return an error
(or log a clear warning and fail fast) instead of silently passing the raw value
to collector.WithOS, referencing parseOSEnv, NodeSnapshotter.measure, and
collector.WithOS to locate the change points.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 1f7b28e3-dac2-4dc1-bc72-5d0041e4d439
📒 Files selected for processing (27)
api/aicr/v1/server.yamldocs/README.mddocs/contributor/api-server.mddocs/contributor/cli.mddocs/contributor/validations.mddocs/user/api-reference.mddocs/user/cli-reference.mdpkg/cli/recipe.gopkg/cli/snapshot.gopkg/collector/factory.gopkg/collector/factory_test.gopkg/collector/talos/doc.gopkg/collector/talos/os.gopkg/collector/talos/os_test.gopkg/collector/talos/service.gopkg/collector/talos/service_test.gopkg/collector/talos/talos.gopkg/k8s/agent/job.gopkg/k8s/agent/job_test.gopkg/k8s/agent/types.gopkg/recipe/criteria.gopkg/recipe/criteria_test.gopkg/recipe/doc.gopkg/snapshotter/agent.gopkg/snapshotter/snapshot.gopkg/snapshotter/snapshot_test.gosite/docs/getting-started/index.md
| // OSTalos is the agent.Config.OS value that selects the Talos-aware pod | ||
| // configuration (no systemd hostPath mounts) and the Kubernetes-API service | ||
| // collector backend. Must stay in sync with collector.OSTalos and | ||
| // recipe.CriteriaOSTalos. | ||
| const OSTalos = "talos" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Add a small cross-package guard test for Talos constant sync.
You already document that this must stay aligned with other packages; a unit test assertion would prevent silent drift.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/k8s/agent/types.go` around lines 43 - 47, Add a unit test that asserts
the OSTalos constant in pkg/k8s/agent (symbol OSTalos) matches the equivalent
constants in the other packages (e.g., collector.OSTalos and
recipe.CriteriaOSTalos) to prevent drift; implement a small cross-package test
(in the appropriate _test.go) that imports those packages and uses simple
equality assertions (t.Fatalf or t.Errorf) if any values differ so CI fails when
they get out of sync.
Extracts the OS criterion string values into a leaf package pkg/recipe/oskind so pkg/recipe, pkg/collector, and pkg/k8s/agent all reference the same constant rather than each duplicating an "OSTalos" const with a "must stay in sync" comment that nothing enforces. The leaf package was the lowest-impact way to break the existing pkg/recipe -> pkg/snapshotter -> pkg/collector import cycle that originally drove the duplication. - New pkg/recipe/oskind package with the OS string constants (Any, Ubuntu, RHEL, COS, AmazonLinux, Talos), an All() helper used by recipe.GetCriteriaOSTypes, and an IsKnown() validator that downstream callers (e.g., AICR_OS env validation in the snapshotter) can use without taking on pkg/recipe's full surface. - pkg/recipe/criteria.go: CriteriaOS* typed constants now alias oskind values; GetCriteriaOSTypes delegates to oskind.All. - pkg/collector/factory.go: removes the local OSTalos string literal and its sync-warning comment; branches on oskind.Talos. - pkg/k8s/agent/types.go: removes the local OSTalos const and its sync-warning comment; pkg/k8s/agent/job.go branches on oskind.Talos. - Tests updated to reference oskind.Talos. Addresses PR #714 review feedback (constant drift) by eliminating the drift surface entirely rather than adding a guard test for it.
The --os flag was added to pkg/cli/snapshot.go in the prior commit but only the recipe-command flag table was updated. Adds the row to the snapshot-command flag table mirroring the Go usage text. Addresses PR #714 review feedback that AGENTS.md requires user- visible CLI flags to be documented in the same PR.
ServiceCollector and OSCollector ran in parallel from the snapshotter errgroup, each calling cfg.fetchNode independently — two Nodes().Get() round-trips for the same node per snapshot. Adds a sync.Once cache to config.fetchNode and a NewCollectors constructor that pairs the two collectors against a single config; the factory now uses NewCollectors when OS=talos so a snapshot does exactly one Node API fetch regardless of how many Talos collectors are added under this template in future. NewServiceCollector / NewOSCollector remain for tests that exercise a single collector in isolation; only the factory needs the shared-config form. Adds a regression test using a fake clientset reactor that asserts exactly one Get is observed across both collectors. Addresses PR #714 review feedback.
buildContainerdSubtype previously stamped ActiveState=active for any non-empty ContainerRuntimeVersion, including cri-o:// URIs and malformed strings without a "://" separator. A constraint comparing SystemD.containerd.service.Version >= "1.7" against a cri-o node would silently never match with no signal that the data was unparsed, and tests locked that behavior in. Tightens the active-state semantics to: empty -> unknown (no signal from kubelet) parseable as "containerd://X" -> active (containerd is the runtime) any other non-empty value -> unknown (CRI-O, malformed, etc.) RuntimeName is still recorded so downstream tooling can see what was observed; only ActiveState/Version are gated. Updates the malformed-id test to assert unknown and adds a new cri-o:// case for symmetry. Addresses PR #714 review feedback.
parseOSImage previously force-parsed any input — multi-word distro
names became invalid structured fields like ID=red, VERSION_ID=Plow
for "Red Hat Enterprise Linux 9.4 (Plow)". The Talos OS collector
only runs when the OS criterion is talos, so non-Talos inputs are
off the production path; declining to parse them keeps the data
shape honest if anyone ever exercises this code with another
OSImage.
Now only Talos-format strings (e.g., "Talos (vX.Y.Z)") yield ID
and VERSION_ID. Everything else returns empty strings; the caller
in buildReleaseSubtype already skips emitting those keys when
empty, so PRETTY_NAME stays populated but ID/VERSION_ID stay
unset rather than misleading.
Updates TestParseOSImage table to reflect the new contract: drops
the locked-in ("red", "Plow") and ("ubuntu", "") cases, asserts
empty for non-Talos inputs. Adds an absence assertion in
TestOSCollector_PopulatesReleaseAndExtensions so a regression that
re-introduces invented IDs would fail loudly.
Addresses PR #714 review feedback.
parseOSEnv used to forward whatever was in AICR_OS verbatim. A typo like AICR_OS=talsoo would silently bypass the in-pod factory's oskind.Talos check, fall back to the systemd collector — which then fails because the systemd hostPath mounts were already skipped on the controller side based on the (validated) --os flag. Hard to diagnose. Now mirrors parseMaxNodesPerEntryEnv's pattern: trim+lowercase, run through oskind.IsKnown, log slog.Warn on invalid input, and return "" so the factory falls back cleanly to the default backend. The controller-side CLI still validates first via recipe.ParseCriteriaOSType — this is defense-in-depth at the agent's process boundary in case AICR_OS is ever set outside the documented path. Test cases extended for normalization (case, whitespace) and invalid values. Addresses PR #714 review feedback.
The previous assertion `gotOSEnv == ""` passed for both "AICR_OS absent from container env" and "AICR_OS present with empty value". The agent intentionally omits AICR_OS entirely when OS is unset, so the test should fail loudly if a regression starts emitting an empty-valued env var. Tracks both presence (foundOSEnv bool) and value, asserts each separately. No production change. Addresses PR #714 review feedback.
The Criteria.OS comment was updated for amazonlinux/talos in the prior commit, but the GET query-parameter section in the same file still listed only ubuntu/cos/rhel/any. Bring it in line with the actual OS values declared in pkg/recipe/oskind. Addresses PR #714 review feedback.
TestOSCollector_GracefulDegradation passed WithNodeName("missing-node"),
which short-circuited the NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME
fallback chain — the test only covered "explicit node name not found"
and never reached resolveNodeName's env path.
Drops the WithNodeName option so the cleared env vars actually flow
through resolveNodeName -> k8s.GetNodeName -> empty -> graceful empty
measurement. No production change.
Addresses PR #714 review feedback.
t.Setenv("AICR_OS", "") leaves the variable present-but-empty, not
absent. parseOSEnv reads via os.Getenv which collapses both to the
same empty result today, so the assertion passed — but the test
description "unset" did not match the test reality, and a future
switch to os.LookupEnv (to distinguish absent from empty) would
silently skip the absent branch.
Splits the case into "unset" (uses os.Unsetenv with t.Cleanup
restoration) and "set but empty" (the existing t.Setenv path) so
both branches are exercised explicitly. No production change.
Addresses PR #714 review feedback.
mchmarny
left a comment
There was a problem hiding this comment.
/LGTM
All 6 of my previous review comments have been resolved
Adds an opt-in, kind-equivalent harness for spinning up a local Talos Linux cluster on a developer laptop and exercising the AICR snapshot agent against it. Closes the integration-test gap left by #714 (Talos collector) — the OS=talos pod-shape branch in pkg/k8s/agent/job.go was unit-tested only. The harness has two independent pieces stitched together by make: * tools/talos-test/{up.sh,down.sh,README.md}: cluster lifecycle. Wraps 'talosctl cluster create docker' (newer talosctl moved provisioners to subcommands), threads a containerd registry-mirror config patch so 'localhost:5001/aicr:local' resolves from inside Talos nodes, and on macOS Docker Desktop / Podman Machine loads br_netfilter via a one-shot privileged sidecar so the default Flannel CNI works. Streams 'talosctl health' progress, derives the apid endpoint and K8s API server URL from the docker-NAT'd host ports, rewrites the kubeconfig to the host-routable form, and relaxes Pod Security Standards on the default namespace so the privileged agent pod can be scheduled. Documented prerequisites and install commands for talosctl in the README. * tests/chainsaw/snapshot/deploy-agent-talos/: chainsaw test. Invokes 'aicr snapshot --os talos -o cm://default/aicr-e2e-snapshot' from a script step; the CLI exercises pkg/k8s/agent.Deployer end-to-end (RBAC creation, OS=talos pod shape, log streaming, Job completion, ConfigMap write, cleanup). Asserts only the surviving ConfigMap content. Uses chainsaw JMESPath filters ((sort([*].type)), (@[?type == 'X'] | [0].subtypes | sort(...))) so the assertion is order-independent at both the measurement-type and per-type subtype levels. Make targets (Makefile): talos-dev-env spin up cluster talos-dev-env-clean destroy cluster talos-snapshot-test build + run chainsaw against the live cluster None are wired into 'make qualify'; Talos is opt-in for now. CLI: The agent container's resource requests and limits were hardcoded (Privileged: 4Gi req / 8Gi lim memory; Restricted: 256Mi / 512Mi). That sizing fits production GPU nodes but blocks scheduling on small dev clusters such as the talosctl Docker provisioner default (2Gi worker), where there is no node-level option to bypass. Adds: --requests 'cpu=500m,memory=1Gi,ephemeral-storage=1Gi' --limits 'cpu=1,memory=2Gi,ephemeral-storage=2Gi' Each accepts a comma-separated 'name=quantity' list. Unspecified keys fall back to the existing per-mode defaults so the production path is unchanged. The nvidia.com/gpu limit injected by --require-gpu is preserved on top of any override. Plumbed through pkg/snapshotter.AgentConfig and pkg/k8s/agent.Config as corev1.ResourceList; applyPrivilegedSettings and applyRestrictedSettings merge overrides over the defaults via a new mergeResourceList helper. Tests: pkg/k8s/agent: TestApplyPrivilegedSettings_ResourceOverrides (no override, partial override, RequireGPU + override), TestApplyRestrictedSettings_ResourceOverrides, TestMergeResourceList (input-immutability invariant). Docs: docs/user/cli-reference.md --requests / --limits flag table entries tools/talos-test/README.md prerequisites, quickstart, customization, troubleshooting Components Affected: CLI (pkg/cli) Agent (pkg/k8s/agent, pkg/snapshotter) Tests/CI (tests/chainsaw/snapshot/deploy-agent-talos, tools/talos-test) Build (Makefile) Docs (docs/user/cli-reference.md) Tested locally end-to-end on macOS Podman Machine (Talos v1.9.0 + talosctl 1.13.0) — bring-up, agent deployment, snapshot assertion, teardown all pass.
Summary
Adds Talos Linux support to the snapshot agent and collector framework, unblocking AICR deployment on clusters that have no systemd and no host
/etc/os-releaseaccessible to unprivileged pods. Selection is driven by the existingosrecipe criterion (newos: talosvalue); no new criteria fields are introduced.Motivation / Context
Talos has no systemd D-Bus and exposes no
/etc/os-releaseat the host filesystem level, so the snapshot agent's privileged pod fails to start: the/run/systemdand/etc/os-releasehostPathmounts inpkg/k8s/agent/job.gocannot resolve. This is the documented blocker raised in the issue thread.This PR is the first phase of Talos enablement, scoped to the agent / collector / validator update path. Future phases (declarative cross-OS namespace and security-context configuration; optional Nodewright tuning package for Talos) are deferred to follow-on PRs.
Fixes: N/A
Related: #565
Type of Change
Component(s) Affected
cmd/aicr,pkg/cli)cmd/aicrd,pkg/api,pkg/server)pkg/recipe)pkg/bundler,pkg/component/*)pkg/collector,pkg/snapshotter)pkg/validator)pkg/errors,pkg/k8s)docs/,examples/)Implementation Notes
Selection via existing OS criterion
recipe.CriteriaOSTalos = "talos"added topkg/recipe/criteria.gowith the standard enum audit (api/aicr/v1/server.yaml× 3,docs/,site/docs/,pkg/recipe/doc.go,pkg/cli/recipe.go). The collector factory branches on this value via a newWithOSoption:collector.OSTalosis a string literal kept in sync withrecipe.CriteriaOSTalosto avoid thepkg/recipe→pkg/snapshotter→pkg/collectorimport cycle.New
pkg/collector/talospackageOrganized so future OS-specific packages can follow the same template:
talos.goOption,config,fetchNodehelper, shared reading keys.service.goServiceCollectoremitsTypeSystemDwithcontainerd.service+kubelet.servicesubtypes derived fromNode.Status.NodeInfo.{ContainerRuntimeVersion, KubeletVersion}. Subtype names match the systemd backend so existing constraints work unchanged.os.goOSCollectoremitsTypeOSwith areleasesubtype (parsed fromOSImage/KernelVersion) and anextensionssubtype derived fromextensions.talos.dev/<name>: <version>Node labels (e.g.,nvidia-container-toolkit,nvidia-open-gpu-kernel-modules). Theextensionssubtype is omitted when no matching labels are present.doc.goPod construction gating
pkg/k8s/agent/job.goskips the/run/systemdand/etc/os-releasehostPathmounts whenOS == "talos"and propagatesAICR_OSto the agent container. The in-pod factory rebuilds itself fromAICR_OSso it picks the same backend the controller-side picked.Deliberate stub: Kubernetes API only
The Talos backend uses only the Kubernetes API surface that
pkg/collector/k8salready touches. No MPL-2.0 dependencies are introduced. The package docs and code comments lay out a future expansion path:Node.Status.NodeInfo+Node.Labels.machined—google.golang.org/grpcwithdynamicpbmessages. No vendored MPL surface, but version-fragile and untyped.siderolabs/talos/pkg/machinery/client— MPL-2.0 which is different the the rest of dependenciesStep up only when a constraint genuinely needs data the previous phase cannot satisfy.
CLI
New
--osflag onaicr snapshot, validated throughrecipe.ParseCriteriaOSType. Plumbs throughsnapshotter.AgentConfig.OS→agent.Config.OS→ pod env + factory.Testing
Coverage delta:
pkg/collector/talosis a new package, so this is a baseline (91.7%, well above the 70% project floor). No other package has its coverage decreased by this PR.New tests added:
pkg/collector/talos/service_test.go— table-driven NodeInfo population, graceful degradation on missing client/node,splitRuntimeIDedge cases.pkg/collector/talos/os_test.go— table-driven release + extensions assembly, OSImage parsing across distros (Talos, Ubuntu LTS, RHEL), label-with-empty-suffix filtering, graceful degradation.pkg/collector/factory_test.go— verifiesOS=talosreturns*talos.ServiceCollector/*talos.OSCollector; default OS preserves systemd / os backends.pkg/k8s/agent/job_test.go—TestBuildPodSpec_TalosSkipsSystemDHostPathcovers Talos / Ubuntu / empty-OS cases for hostPath mounts andAICR_OSenv.pkg/snapshotter/snapshot_test.go—TestParseOSEnvfor the in-pod env-var pickup.pkg/recipe/criteria_test.go—talosin alias and sorted-types tests.Risk Assessment
Rollout notes: Backwards-compatible. All Talos behavior is gated on
OS == "talos"; existing recipes and agent invocations on Ubuntu / RHEL / COS / AmazonLinux are unchanged. EmptyOSpreserves the systemd default. The new--osflag is optional. To revert, drop the feature branch — no migration is required because the wire format adds new constants without changing existing values, and the default factory branches are unchanged.Checklist
make testwith-race)make lint)git commit -S) — GPG signing info