Skip to content

feat(collector): add Talos OS support via Kubernetes Node info#714

Merged
mchmarny merged 13 commits into
mainfrom
feat/talos_collector
Apr 30, 2026
Merged

feat(collector): add Talos OS support via Kubernetes Node info#714
mchmarny merged 13 commits into
mainfrom
feat/talos_collector

Conversation

@ayuskauskas

@ayuskauskas ayuskauskas commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds Talos Linux support to the snapshot agent and collector framework, unblocking AICR deployment on clusters that have no systemd and no host /etc/os-release accessible to unprivileged pods. Selection is driven by the existing os recipe criterion (new os: talos value); no new criteria fields are introduced.

Motivation / Context

Talos has no systemd D-Bus and exposes no /etc/os-release at the host filesystem level, so the snapshot agent's privileged pod fails to start: the /run/systemd and /etc/os-release hostPath mounts in pkg/k8s/agent/job.go cannot resolve. This is the documented blocker raised in the issue thread.

This PR is the first phase of Talos enablement, scoped to the agent / collector / validator update path. Future phases (declarative cross-OS namespace and security-context configuration; optional Nodewright tuning package for Talos) are deferred to follow-on PRs.

Fixes: N/A
Related: #565

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update
  • Refactoring (no functional changes)
  • Build/CI/tooling

Component(s) Affected

  • CLI (cmd/aicr, pkg/cli)
  • API server (cmd/aicrd, pkg/api, pkg/server)
  • Recipe engine / data (pkg/recipe)
  • Bundlers (pkg/bundler, pkg/component/*)
  • Collectors / snapshotter (pkg/collector, pkg/snapshotter)
  • Validator (pkg/validator)
  • Core libraries (pkg/errors, pkg/k8s)
  • Docs/examples (docs/, examples/)
  • Other: ____________

Implementation Notes

Selection via existing OS criterion

recipe.CriteriaOSTalos = "talos" added to pkg/recipe/criteria.go with the standard enum audit (api/aicr/v1/server.yaml × 3, docs/, site/docs/, pkg/recipe/doc.go, pkg/cli/recipe.go). The collector factory branches on this value via a new WithOS option:

func (f *DefaultFactory) CreateSystemDCollector() Collector {
    if f.OS == OSTalos { return talos.NewServiceCollector() }
    return &systemd.Collector{Services: f.SystemDServices}
}

func (f *DefaultFactory) CreateOSCollector() Collector {
    if f.OS == OSTalos { return talos.NewOSCollector() }
    return &os.Collector{}
}

collector.OSTalos is a string literal kept in sync with recipe.CriteriaOSTalos to avoid the pkg/recipepkg/snapshotterpkg/collector import cycle.

New pkg/collector/talos package

Organized so future OS-specific packages can follow the same template:

File Role
talos.go Package entry: Option, config, fetchNode helper, shared reading keys.
service.go ServiceCollector emits TypeSystemD with containerd.service + kubelet.service subtypes derived from Node.Status.NodeInfo.{ContainerRuntimeVersion, KubeletVersion}. Subtype names match the systemd backend so existing constraints work unchanged.
os.go OSCollector emits TypeOS with a release subtype (parsed from OSImage / KernelVersion) and an extensions subtype derived from extensions.talos.dev/<name>: <version> Node labels (e.g., nvidia-container-toolkit, nvidia-open-gpu-kernel-modules). The extensions subtype is omitted when no matching labels are present.
doc.go Package documentation: layout, selection, escalation path.

Pod construction gating

pkg/k8s/agent/job.go skips the /run/systemd and /etc/os-release hostPath mounts when OS == "talos" and propagates AICR_OS to the agent container. The in-pod factory rebuilds itself from AICR_OS so it picks the same backend the controller-side picked.

Deliberate stub: Kubernetes API only

The Talos backend uses only the Kubernetes API surface that pkg/collector/k8s already touches. No MPL-2.0 dependencies are introduced. The package docs and code comments lay out a future expansion path:

  1. (this PR) Kubernetes API stubNode.Status.NodeInfo + Node.Labels.
  2. gRPC reflection against machinedgoogle.golang.org/grpc with dynamicpb messages. No vendored MPL surface, but version-fragile and untyped.
  3. Vendor siderolabs/talos/pkg/machinery/client — MPL-2.0 which is different the the rest of dependencies

Step up only when a constraint genuinely needs data the previous phase cannot satisfy.

CLI

New --os flag on aicr snapshot, validated through recipe.ParseCriteriaOSType. Plumbs through snapshotter.AgentConfig.OSagent.Config.OS → pod env + factory.

Testing

unset GITLAB_TOKEN

go test -race -short ./...
# 55/55 packages pass

GOFLAGS="-mod=vendor" go test -coverprofile=cover.out ./pkg/collector/talos/...
go tool cover -func=cover.out | tail -1
# total: (statements) 91.7%

golangci-lint run -c .golangci.yaml ./...
# 0 issues.

./tools/check-agents-sync
# OK: AGENTS.md is in sync with .claude/CLAUDE.md

Coverage delta: pkg/collector/talos is a new package, so this is a baseline (91.7%, well above the 70% project floor). No other package has its coverage decreased by this PR.

New tests added:

  • pkg/collector/talos/service_test.go — table-driven NodeInfo population, graceful degradation on missing client/node, splitRuntimeID edge cases.
  • pkg/collector/talos/os_test.go — table-driven release + extensions assembly, OSImage parsing across distros (Talos, Ubuntu LTS, RHEL), label-with-empty-suffix filtering, graceful degradation.
  • pkg/collector/factory_test.go — verifies OS=talos returns *talos.ServiceCollector / *talos.OSCollector; default OS preserves systemd / os backends.
  • pkg/k8s/agent/job_test.goTestBuildPodSpec_TalosSkipsSystemDHostPath covers Talos / Ubuntu / empty-OS cases for hostPath mounts and AICR_OS env.
  • pkg/snapshotter/snapshot_test.goTestParseOSEnv for the in-pod env-var pickup.
  • pkg/recipe/criteria_test.gotalos in alias and sorted-types tests.

Risk Assessment

  • Low — Isolated change, well-tested, easy to revert
  • Medium — Touches multiple components or has broader impact
  • High — Breaking change, affects critical paths, or complex rollout

Rollout notes: Backwards-compatible. All Talos behavior is gated on OS == "talos"; existing recipes and agent invocations on Ubuntu / RHEL / COS / AmazonLinux are unchanged. Empty OS preserves the systemd default. The new --os flag is optional. To revert, drop the feature branch — no migration is required because the wire format adds new constants without changing existing values, and the default factory branches are unchanged.

Checklist

  • Tests pass locally (make test with -race)
  • Linter passes (make lint)
  • I did not skip/disable tests to make CI green
  • I added/updated tests for new functionality
  • I updated docs if user-facing behavior changed
  • Changes follow existing patterns in the codebase
  • Commits are cryptographically signed (git commit -S) — GPG signing info

@coderabbitai

coderabbitai Bot commented Apr 29, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 13aa9b1d-7209-44d9-96b7-47b8a0c2a50c

📥 Commits

Reviewing files that changed from the base of the PR and between b312ad4 and dd908e5.

📒 Files selected for processing (2)
  • docs/user/api-reference.md
  • docs/user/cli-reference.md

📝 Walkthrough

Walkthrough

This pull request adds support for Talos as a valid GPU node operating system across the entire system. It introduces a new pkg/recipe/oskind package as a single source of truth for OS values, refactors OS criteria to use these canonical constants, adds Talos-specific collector implementations (pkg/collector/talos/) that fetch node data from Kubernetes, updates the collector factory to conditionally route to Talos collectors when appropriate, integrates OS selection into the CLI (--os flag) and snapshotter (AICR_OS environment variable), modifies agent pod configuration to skip systemd mounts for Talos, and updates API schemas and documentation throughout to reflect the new supported OS option.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(collector): add Talos OS support via Kubernetes Node info' clearly and specifically describes the main feature addition of Talos OS support via Kubernetes Node information, directly matching the primary change in this changeset.
Description check ✅ Passed The description provides comprehensive context about adding Talos Linux support, motivation, implementation details, testing, and risk assessment, all of which align with and explain the changeset modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/talos_collector

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/collector/talos/os_test.go`:
- Around line 158-164: The test TestOSCollector_GracefulDegradation currently
passes WithNodeName("missing-node") which bypasses the environment-variable
resolution path; change the setup to construct the collector without the
WithNodeName option (i.e., call
NewOSCollector(WithClientSet(fake.NewSimpleClientset())) so the collector reads
NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME (already set to empty via t.Setenv) and
then assert Collect() handles the missing env-var case; update any
assertions/messages if they specifically expected the explicit node-name
behavior.

In `@pkg/collector/talos/service.go`:
- Around line 100-104: The code unconditionally sets the containerd subtype to
active whenever info.ContainerRuntimeVersion is non-empty; update the logic
around splitRuntimeID(info.ContainerRuntimeVersion) so you only mark
keyActiveState as "active" when the parsed runtime name equals "containerd"
(from splitRuntimeID), otherwise set keyActiveState to "unknown" or skip setting
the subtype fields; modify the block that calls b.SetString(keyActiveState,
...).SetString(keyRuntimeName, ...).SetString(keyVersion, ...) to branch on
runtimeName == "containerd" and handle non-containerd runtimes accordingly
(referencing splitRuntimeID, info.ContainerRuntimeVersion, keyActiveState,
keyRuntimeName, keyVersion).

In `@pkg/collector/talos/talos.go`:
- Around line 101-107: The fetchNode call uses the incoming ctx directly for
cs.CoreV1().Nodes().Get which can block without a deadline; wrap the API call in
a context.WithTimeout of 10s (collector timeout), e.g., create ctx, cancel :=
context.WithTimeout(ctx, 10*time.Second) and defer cancel(), then pass that
child ctx into cs.CoreV1().Nodes().Get (referencing fetchNode and
cs.CoreV1().Nodes().Get) and handle/propagate the timeout error appropriately
instead of using the original ctx.

In `@pkg/k8s/agent/job_test.go`:
- Around line 119-128: The test currently only compares the value of gotOSEnv
and can’t tell if the AICR_OS env var is absent vs present-but-empty; modify the
loop over spec.Containers[0].Env to track presence (e.g., a found boolean
alongside gotOSEnv) when e.Name == "AICR_OS", then after the loop assert the env
var was found (fail the test if not) and then assert that gotOSEnv ==
tt.wantAICROSEnv; update the assertion logic in pkg/k8s/agent/job_test.go so
presence and value of AICR_OS are both validated (referencing gotOSEnv, found,
spec.Containers[0].Env and tt.wantAICROSEnv).

In `@pkg/recipe/doc.go`:
- Line 31: Update the GET query-parameter documentation in the same godoc where
Criteria is described so the OS examples match the Criteria struct comment;
replace the stale "os" examples ("ubuntu, cos, rhel, any") with the full,
consistent list used in the Criteria comment ("ubuntu, rhel, cos, amazonlinux,
talos, any") (ensure ordering/formatting follows the Criteria comment) and
confirm the GET parameter list text referencing "os" uses the same identifiers.

In `@pkg/snapshotter/snapshot.go`:
- Around line 85-89: The code reads AICR_OS raw via parseOSEnv(), which can miss
matches due to casing/whitespace; update parseOSEnv to normalize the value by
trimming surrounding whitespace and converting to a consistent case (e.g.,
strings.ToLower(strings.TrimSpace(os.Getenv("AICR_OS")))) so collector backend
selection (the code paths that inspect parseOSEnv() and the related selection
logic) reliably matches values like "Talos" regardless of casing/extra spaces;
ensure callers of parseOSEnv() continue to work with the normalized string.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: b8c3db3e-be88-4fa4-bc9b-e5ef933325cd

📥 Commits

Reviewing files that changed from the base of the PR and between c56f142 and edd1961.

📒 Files selected for processing (27)
  • api/aicr/v1/server.yaml
  • docs/README.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/validations.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/cli/snapshot.go
  • pkg/collector/factory.go
  • pkg/collector/factory_test.go
  • pkg/collector/talos/doc.go
  • pkg/collector/talos/os.go
  • pkg/collector/talos/os_test.go
  • pkg/collector/talos/service.go
  • pkg/collector/talos/service_test.go
  • pkg/collector/talos/talos.go
  • pkg/k8s/agent/job.go
  • pkg/k8s/agent/job_test.go
  • pkg/k8s/agent/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/snapshotter/agent.go
  • pkg/snapshotter/snapshot.go
  • pkg/snapshotter/snapshot_test.go
  • site/docs/getting-started/index.md

Comment thread pkg/collector/talos/os_test.go
Comment thread pkg/collector/talos/os.go
Comment thread pkg/collector/talos/service.go
Comment on lines +101 to +107
node, err := cs.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
if err != nil {
slog.Warn("failed to fetch node from Kubernetes API",
slog.String("node", nodeName),
slog.String("error", err.Error()))
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add explicit collector timeout around Node GET.

fetchNode performs Kubernetes API I/O using the incoming context directly. If upstream provides no deadline, this call can block longer than collector SLAs.

⏱️ Suggested fix pattern
 import (
 	"context"
 	"log/slog"

 	"github.com/NVIDIA/aicr/pkg/collector/k8s"
+	"github.com/NVIDIA/aicr/pkg/defaults"
 	"github.com/NVIDIA/aicr/pkg/errors"
 	k8sclient "github.com/NVIDIA/aicr/pkg/k8s/client"
@@
-	node, err := cs.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
+	getCtx, cancel := context.WithTimeout(ctx, defaults.CollectorTimeout)
+	defer cancel()
+	node, err := cs.CoreV1().Nodes().Get(getCtx, nodeName, metav1.GetOptions{})

As per coding guidelines, "All I/O operations and HTTP handlers must use context with explicit timeouts via context.WithTimeout(). Collectors: 10s timeout."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/collector/talos/talos.go` around lines 101 - 107, The fetchNode call uses
the incoming ctx directly for cs.CoreV1().Nodes().Get which can block without a
deadline; wrap the API call in a context.WithTimeout of 10s (collector timeout),
e.g., create ctx, cancel := context.WithTimeout(ctx, 10*time.Second) and defer
cancel(), then pass that child ctx into cs.CoreV1().Nodes().Get (referencing
fetchNode and cs.CoreV1().Nodes().Get) and handle/propagate the timeout error
appropriately instead of using the original ctx.

Comment thread pkg/k8s/agent/job_test.go Outdated
Comment thread pkg/snapshotter/snapshot.go Outdated
@github-actions

github-actions Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Coverage Report ✅

Metric Value
Coverage 75.1%
Threshold 70%
Status Pass
Coverage Badge
![Coverage](https://img.shields.io/badge/coverage-75.1%25-green)

Merging this branch changes the coverage (1 decrease, 6 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/aicr/pkg/cli 51.92% (-0.32%) 👎
github.com/NVIDIA/aicr/pkg/collector 91.30% (+6.69%) 👍
github.com/NVIDIA/aicr/pkg/collector/talos 92.37% (+92.37%) 🌟
github.com/NVIDIA/aicr/pkg/k8s/agent 77.82% (+0.34%) 👍
github.com/NVIDIA/aicr/pkg/recipe 90.51% (+0.01%) 👍
github.com/NVIDIA/aicr/pkg/recipe/oskind 100.00% (+100.00%) 🌟
github.com/NVIDIA/aicr/pkg/snapshotter 52.16% (+1.03%) 👍

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/aicr/pkg/cli/recipe.go 76.54% (ø) 81 62 19
github.com/NVIDIA/aicr/pkg/cli/snapshot.go 4.00% (-0.55%) 50 (+6) 2 48 (+6) 👎
github.com/NVIDIA/aicr/pkg/collector/factory.go 91.30% (+6.69%) 23 (+10) 21 (+10) 2 👍
github.com/NVIDIA/aicr/pkg/collector/talos/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/collector/talos/os.go 100.00% (+100.00%) 51 (+51) 51 (+51) 0 🌟
github.com/NVIDIA/aicr/pkg/collector/talos/service.go 93.55% (+93.55%) 31 (+31) 29 (+29) 2 (+2) 🌟
github.com/NVIDIA/aicr/pkg/collector/talos/talos.go 80.56% (+80.56%) 36 (+36) 29 (+29) 7 (+7) 🌟
github.com/NVIDIA/aicr/pkg/k8s/agent/job.go 88.41% (+0.71%) 69 (+4) 61 (+4) 8 👍
github.com/NVIDIA/aicr/pkg/k8s/agent/types.go 100.00% (ø) 1 1 0
github.com/NVIDIA/aicr/pkg/recipe/criteria.go 90.48% (+0.03%) 294 (+1) 266 (+1) 28 👍
github.com/NVIDIA/aicr/pkg/recipe/doc.go 0.00% (ø) 0 0 0
github.com/NVIDIA/aicr/pkg/recipe/oskind/oskind.go 100.00% (+100.00%) 4 (+4) 4 (+4) 0 🌟
github.com/NVIDIA/aicr/pkg/snapshotter/agent.go 39.22% (ø) 153 60 93
github.com/NVIDIA/aicr/pkg/snapshotter/snapshot.go 76.92% (+0.11%) 78 (+9) 60 (+7) 18 (+2) 👍

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid first phase of Talos enablement. Scope is appropriately tight (Kubernetes-API stub, no MPL deps, OS gating via existing criterion), the package layout is clearly templated for future OS-specific backends, and test coverage is real (91.7% on the new package, table-driven, graceful-degradation cases included). The three-phase escalation path documented in pkg/collector/talos/doc.go is exactly the right framing.

Six findings, none blocking:

  • Doc gap (medium): the new aicr snapshot --os flag isn't in docs/user/cli-reference.md's snapshot flags table — AGENTS.md requires this in the same PR.
  • Duplicate Node Get (medium): OSCollector and ServiceCollector each fetch the same Node; cheap today, but scales linearly as more Talos collectors are added under this template.
  • Inconsistent ActiveState semantics (medium): malformed ContainerRuntimeVersion still stamps ActiveState=active with empty Version — locked in by tests; suggests either reporting unknown or omitting the empty Version.
  • Constant drift risk (low/medium): OSTalos is defined in three packages with comments saying "must stay in sync"; trivial guard test would prevent silent drift.
  • parseOSImage non-Talos misbehavior (low): tests document ("Red Hat Enterprise Linux 9.4 (Plow)", "red", "Plow"). Production never hits this code path (the collector only runs for os=talos), but the assertions read like locked-in bugs.
  • parseOSEnv silent fallback (low): invalid AICR_OS values silently degrade to systemd default with no log; mirror parseMaxNodesPerEntryEnv's slog.Warn.

Three GPU CI checks (nvkind H100/L40G) and tests/E2E are still in progress at review time — none of the failures so far are related. CI status looks healthy.

Comment thread pkg/collector/talos/talos.go
Comment thread pkg/collector/talos/service.go
Comment thread pkg/collector/factory.go Outdated
Comment thread pkg/collector/talos/os.go
Comment thread pkg/snapshotter/snapshot.go Outdated
Comment thread docs/user/cli-reference.md
Talos has no systemd and no host /etc/os-release accessible to
unprivileged pods, which blocks AICR's snapshot agent from deploying on
Talos clusters (the documented blocker in #565). Adds an OS-criteria-
selected backend that derives the same measurement schema from the
Kubernetes Node object so existing recipe constraints continue to work
unchanged across operating systems.

Changes:

- New CriteriaOSTalos value in pkg/recipe/criteria.go plus full enum
  audit (api/aicr/v1/server.yaml, docs/, site/docs/, doc.go).
- New pkg/collector/talos package, organized so each new OS-specific
  collector follows the same template:
    talos.go    package entry: Option, config, fetchNode helper
    service.go  ServiceCollector emitting TypeSystemD with
                containerd.service + kubelet.service from
                Node.Status.NodeInfo (containerRuntimeVersion,
                kubeletVersion)
    os.go       OSCollector emitting TypeOS with
                  - release subtype from NodeInfo.{OSImage,
                    KernelVersion, OperatingSystem}
                  - extensions subtype from extensions.talos.dev/*
                    Node labels (e.g., nvidia-container-toolkit,
                    nvidia-open-gpu-kernel-modules versions)
- pkg/collector/factory.go branches CreateSystemDCollector and
  CreateOSCollector on OS=talos via WithOS option. OSTalos string
  literal kept in sync with recipe.CriteriaOSTalos to avoid the
  recipe -> snapshotter -> collector import cycle.
- pkg/k8s/agent/job.go gates the /run/systemd and /etc/os-release
  hostPath mounts on OS!=talos and propagates AICR_OS to the in-pod
  factory so it picks the same backend the controller-side picked.
- pkg/snapshotter plumbs OS through AgentConfig and reads AICR_OS in
  measure() to rebuild the factory in-pod. New --os flag on
  aicr snapshot, validated through recipe.ParseCriteriaOSType.

Stub backend uses only the Kubernetes API surface that pkg/collector/k8s
already touches; no MPL-2.0 dependencies are introduced. The package
docs and code comments lay out the future expansion path: K8s-API stub
(this PR) -> gRPC reflection against machined -> vendor
siderolabs/talos/pkg/machinery/client, which would introduce the first
MPL-2.0 dependency into this codebase. Step up only when a constraint
genuinely needs data the previous phase cannot satisfy.

GPU collection on Talos requires no AICR changes: the GPU collector
relies on NFD sysfs/PCI enumeration and nvidia-smi via the NVIDIA
Container Runtime hook, neither of which involves nsenter, chroot, or
host-fs mounts. Talos clusters with the nvidia-container-toolkit and
nvidia-open-gpu-kernel-modules extensions installed (now visible via the
extensions subtype) work as-is with --runtime-class nvidia.

Tests: pkg/collector/talos coverage 91.7%; full test suite 55/55 green
in -short mode; golangci-lint -c .golangci.yaml ./... 0 issues.

Refs: #565
@ayuskauskas ayuskauskas force-pushed the feat/talos_collector branch from edd1961 to 3065a31 Compare April 29, 2026 18:53

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pkg/snapshotter/snapshot_test.go (1)

17-25: ⚠️ Potential issue | 🟡 Minor

Make the “unset” test case truly unset AICR_OS.

Right now the "unset" case sets AICR_OS to "" via t.Setenv, which is “set but empty”, not “absent”. If parseOSEnv() uses os.LookupEnv (or otherwise branches on presence), this test may not cover the intended branch.

✅ Suggested test adjustment
 import (
 	"context"
 	"fmt"
+	"os"
 	"testing"

 	"github.com/NVIDIA/aicr/pkg/collector"
 	"github.com/NVIDIA/aicr/pkg/header"
 	"github.com/NVIDIA/aicr/pkg/measurement"
 )

 func TestParseOSEnv(t *testing.T) {
 	tests := []struct {
 		name string
-		env  string
+		env  string
+		unset bool
 		want string
 	}{
-		{"unset", "", ""},
-		{"talos", "talos", "talos"},
-		{"ubuntu passthrough", "ubuntu", "ubuntu"},
+		{"unset", "", true, ""},
+		{"talos", "talos", false, "talos"},
+		{"ubuntu passthrough", "ubuntu", false, "ubuntu"},
 	}

 	for _, tt := range tests {
 		t.Run(tt.name, func(t *testing.T) {
-			t.Setenv("AICR_OS", tt.env)
+			if tt.unset {
+				prev, ok := os.LookupEnv("AICR_OS")
+				_ = os.Unsetenv("AICR_OS")
+				t.Cleanup(func() {
+					if ok {
+						_ = os.Setenv("AICR_OS", prev)
+					} else {
+						_ = os.Unsetenv("AICR_OS")
+					}
+				})
+			} else {
+				t.Setenv("AICR_OS", tt.env)
+			}

 			if got := parseOSEnv(); got != tt.want {
 				t.Errorf("parseOSEnv() = %q, want %q", got, tt.want)
 			}
 		})
 	}
 }

Also applies to: 320-338

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/snapshotter/snapshot_test.go` around lines 17 - 25, The "unset" test case
currently calls t.Setenv("AICR_OS", "") which leaves the variable present but
empty; change the test to remove the environment variable so it is truly absent
(use os.Unsetenv("AICR_OS") or call os.Unsetenv inside the test setup and
arrange to restore it via t.Cleanup) so parseOSEnv() exercises the "not present"
branch; update the same pattern in the other occurrence around the block that
covers lines 320-338 to ensure both tests really unset AICR_OS.
♻️ Duplicate comments (9)
pkg/recipe/doc.go (1)

31-141: ⚠️ Potential issue | 🟡 Minor

Fix stale godoc: GET query-parameter os list must include amazonlinux and talos.

Criteria.OS comment includes talos, but the GET query parameter section still documents os as ubuntu, cos, rhel, any—missing amazonlinux and talos.

✅ Suggested godoc diff
-//   - os: ubuntu, cos, rhel, any (default: any)
+//   - os: ubuntu, rhel, cos, amazonlinux, talos, any (default: any)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/recipe/doc.go` around lines 31 - 141, The GET query-parameter
documentation in pkg/recipe/doc.go is stale: the "os" list omits amazonlinux and
talos even though Criteria.OS (e.g., constants like CriteriaOSAmazonLinux,
CriteriaOSTalos or related CriteriaOS* types) supports them; update the "os"
line under "Query Parameters (HTTP API - GET)" to include amazonlinux and talos
(so it reads e.g. "ubuntu, cos, rhel, amazonlinux, talos, any") and ensure any
alias/description for the "os" parameter matches the Criteria.OS constants and
examples elsewhere in the file.
docs/user/cli-reference.md (1)

74-94: ⚠️ Potential issue | 🟠 Major

Add missing aicr snapshot --os flag row to the CLI reference.

The docs’ aicr snapshot flags table (lines ~74–94) does not list the new --os flag, even though recipe query mode does. This creates a user-visible mismatch between CLI behavior and documentation.

✅ Suggested doc update (table row insertion)
 | `--toleration` | | string[] | all taints | Tolerations for agent scheduling (key=value:effect, repeatable). **Default: all taints tolerated** (uses `operator: Exists`). Only specify to restrict which taints are tolerated. |
+| `--os` | | string | | Node OS family to use for collectors (e.g. ubuntu, rhel, cos, amazonlinux, talos) |
 | `--timeout` | | duration | 5m | Timeout for agent Job completion |
 | `--no-cleanup` | | bool | false | Skip removal of Job and RBAC resources on completion. **Warning:** leaves a cluster-admin ClusterRoleBinding active. |

Please confirm the exact --os flag description/default from the Go CLI flag definition so the docs mirror the usage text precisely.

#!/bin/bash
# Description: Find the Go definition of the `aicr snapshot --os` flag so we can mirror its usage/default in docs.

set -euo pipefail

# 1) Locate snapshot command source file(s)
SNAPSHOT_FILES=$(fd -a --type f "snapshot.go" pkg/cli pkg || true)
echo "Found snapshot.go files:"
echo "$SNAPSHOT_FILES"

# 2) Search for `os` flag definition within those files
for f in $SNAPSHOT_FILES; do
  echo "---- Searching $f for --os flag definition ----"
  rg -n --fixed-string 'Name:     "os"' "$f" || true
  rg -n --fixed-string 'Name:    "os"' "$f" || true
  rg -n -- '--os' "$f" || true
  rg -n -- "AICR_OS" "$f" || true
  rg -n -- "StringFlag\\{" "$f" | rg -n "os" || true
done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user/cli-reference.md` around lines 74 - 94, The docs are missing the
new aicr snapshot --os flag; locate the Go flag definition in the snapshot
command (search for snapshot.go and the StringFlag/Flag with Name: "os" in the
pkg/cli package or related files), copy the exact usage text and default value
for the --os flag from that Go definition, and add a new row to the Flags table
describing `--os` for the `aicr snapshot` command (include
short/Type/Default/Description exactly as in the Go CLI). Ensure the table row
references the `--os` flag name and mirrors the Go usage/default precisely so
docs and CLI output match.
pkg/snapshotter/snapshot.go (1)

85-100: ⚠️ Potential issue | 🟠 Major

Normalize and validate AICR_OS before factory selection.

parseOSEnv() currently forwards raw env input, so casing/whitespace/typos can silently fall back to non-Talos collectors and make failures hard to diagnose.

♻️ Proposed hardening diff
 import (
 	"context"
 	"log/slog"
 	"os"
 	"strconv"
+	"strings"
 	"sync"
 	"time"
@@
 	"github.com/NVIDIA/aicr/pkg/measurement"
+	"github.com/NVIDIA/aicr/pkg/recipe"
 	"github.com/NVIDIA/aicr/pkg/serializer"
 )
@@
 func parseOSEnv() string {
-	return os.Getenv("AICR_OS")
+	val := strings.ToLower(strings.TrimSpace(os.Getenv("AICR_OS")))
+	if val == "" {
+		return ""
+	}
+	if _, err := recipe.ParseCriteriaOSType(val); err != nil {
+		slog.Warn("invalid AICR_OS value, ignoring", slog.String("value", val))
+		return ""
+	}
+	return val
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/snapshotter/snapshot.go` around lines 85 - 100, Normalize and validate
AICR_OS before using it to select collectors: change parseOSEnv() to return a
trimmed, lowercased value (use strings.TrimSpace + strings.ToLower) and in
NodeSnapshotter.measure() validate that the normalized osVal is one of the
supported OS identifiers (e.g. "talos" or any values accepted by
collector.WithOS); if osVal is empty continue with default behavior, but if it's
present and not in the allowed set return an error (or log a clear warning and
fail fast) instead of silently passing the raw value to collector.WithOS,
referencing parseOSEnv, NodeSnapshotter.measure, and collector.WithOS to locate
the change points.
pkg/collector/talos/service.go (1)

100-104: ⚠️ Potential issue | 🟠 Major

Do not mark containerd.service active for non-containerd or unparseable runtime IDs.

At Line 101, containerd.service becomes active for any non-empty runtime string (e.g., cri-o://... or malformed values), which is an incorrect service-state signal.

🔧 Suggested fix
 	runtimeName, runtimeVersion := splitRuntimeID(info.ContainerRuntimeVersion)
-	b.SetString(keyActiveState, "active").
-		SetString(keyRuntimeName, runtimeName).
-		SetString(keyVersion, runtimeVersion)
+	b.SetString(keyRuntimeName, runtimeName)
+	if runtimeName != "containerd" || runtimeVersion == "" {
+		b.SetString(keyActiveState, "unknown")
+		return b.Build()
+	}
+	b.SetString(keyActiveState, "active").
+		SetString(keyVersion, runtimeVersion)

Please also update pkg/collector/talos/service_test.go expectations for the malformed runtime case accordingly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/collector/talos/service.go` around lines 100 - 104, The code currently
sets keyActiveState to "active" for any non-empty runtime string; change the
logic around splitRuntimeID(info.ContainerRuntimeVersion) so you only
SetString(keyActiveState, "active") when runtimeName == "containerd" and
runtimeVersion is non-empty/parseable, otherwise do not mark containerd active
(set inactive or omit the active state consistently). Update uses of
keyRuntimeName and keyVersion to still record parsed values only when parse
succeeded, and adjust pkg/collector/talos/service_test.go expectations for the
malformed runtime case to assert containerd is not active (and that
runtime/version are empty or reflect the parsing outcome).
pkg/k8s/agent/job_test.go (1)

119-128: ⚠️ Potential issue | 🟡 Minor

Assert AICR_OS presence/absence, not only value.

At Line 120, gotOSEnv == "" passes both when the variable is absent and when it is present-but-empty, so the empty-OS case is not strictly validated.

🔧 Suggested fix
-			gotOSEnv := ""
+			gotOSEnv := ""
+			foundOSEnv := false
 			for _, e := range spec.Containers[0].Env {
 				if e.Name == "AICR_OS" {
 					gotOSEnv = e.Value
+					foundOSEnv = true
+					break
 				}
 			}
+			if tt.wantAICROSEnv == "" && foundOSEnv {
+				t.Errorf("AICR_OS env should be absent for OS=%q", tt.os)
+			}
+			if tt.wantAICROSEnv != "" && !foundOSEnv {
+				t.Errorf("AICR_OS env should be present for OS=%q", tt.os)
+			}
 			if gotOSEnv != tt.wantAICROSEnv {
 				t.Errorf("AICR_OS env = %q, want %q", gotOSEnv, tt.wantAICROSEnv)
 			}

As per coding guidelines, "Explicitly test error conditions and edge cases in test files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/k8s/agent/job_test.go` around lines 119 - 128, The test currently only
checks the value of AICR_OS via gotOSEnv which conflates "absent" and
"present-but-empty"; change the loop over spec.Containers[0].Env to record both
a boolean foundOSEnv and the value (e.g., foundOSEnv := false; gotOSEnv := ""
and set foundOSEnv = true when e.Name == "AICR_OS"), then replace the single
value check with two assertions: assert presence/absence (foundOSEnv should
equal (tt.wantAICROSEnv != "")) and, only if foundOSEnv is true, assert gotOSEnv
== tt.wantAICROSEnv so you explicitly test existence and value separately.
pkg/collector/factory.go (1)

26-30: 🧹 Nitpick | 🔵 Trivial

Add a guard test to detect Talos OS constant drift across packages.

The "must stay in sync" comment is good, but enforcement is manual. Add a small invariant test in a package that can import all three constants (recipe, collector, agent) so drift fails fast in CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/collector/factory.go` around lines 26 - 30, Add a unit test that asserts
the Talos OS constant is identical across packages to catch drift: create a new
test (e.g., invariants_test) that imports the three symbols OSTalos from
pkg/collector (const OSTalos), CriteriaOSTalos from pkg/recipe, and whatever the
Talos constant is in pkg/agent, and assert they are equal (using testing.Fatal
or require/xtest equivalent) so CI fails on mismatch; name the test clearly
(e.g., TestTalosOSConstantInvariant) and keep it lightweight with no external
dependencies.
pkg/collector/talos/os_test.go (1)

163-164: ⚠️ Potential issue | 🟡 Minor

TestOSCollector_GracefulDegradation bypasses the env-resolution path it prepares.

At Line 163, WithNodeName("missing-node") short-circuits NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME fallback, so this test does not cover the empty-env path.

Proposed fix
-	c := NewOSCollector(WithClientSet(fake.NewSimpleClientset()), WithNodeName("missing-node"))
+	c := NewOSCollector(WithClientSet(fake.NewSimpleClientset()))

As per coding guidelines, "Explicitly test error conditions and edge cases in test files."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/collector/talos/os_test.go` around lines 163 - 164, The test is bypassing
the environment-resolution path by passing WithNodeName("missing-node") so it
never exercises NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME fallbacks; remove the
WithNodeName(...) option when constructing NewOSCollector (leave
NewOSCollector(WithClientSet(fake.NewSimpleClientset()))), ensure the
environment variables NODE_NAME, KUBERNETES_NODE_NAME and HOSTNAME are unset or
cleared in the test setup, then call c.Collect(ctx) to verify graceful
degradation; reference NewOSCollector, WithClientSet, WithNodeName and Collect
when making the change.
pkg/collector/talos/talos.go (1)

101-101: ⚠️ Potential issue | 🟠 Major

Enforce a 10s timeout at the fetchNode I/O boundary.

fetchNode issues Kubernetes API I/O directly on the incoming ctx at Line 101; this helper should guarantee its own timeout so it cannot block indefinitely when reused without a deadline.

Proposed fix
 import (
 	"context"
 	"log/slog"
 
 	"github.com/NVIDIA/aicr/pkg/collector/k8s"
+	"github.com/NVIDIA/aicr/pkg/defaults"
 	"github.com/NVIDIA/aicr/pkg/errors"
 	k8sclient "github.com/NVIDIA/aicr/pkg/k8s/client"
 	corev1 "k8s.io/api/core/v1"
@@
-	node, err := cs.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{})
+	getCtx, cancel := context.WithTimeout(ctx, defaults.CollectorTimeout)
+	defer cancel()
+	node, err := cs.CoreV1().Nodes().Get(getCtx, nodeName, metav1.GetOptions{})

As per coding guidelines, "All I/O operations and HTTP handlers must use context with explicit timeouts via context.WithTimeout(). Collectors: 10s timeout."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/collector/talos/talos.go` at line 101, The fetchNode helper currently
calls cs.CoreV1().Nodes().Get using the incoming ctx and must enforce a 10s I/O
timeout; wrap the call in a context.WithTimeout(ctx, 10*time.Second), defer
cancel(), and use that timed context when calling cs.CoreV1().Nodes().Get (the
node retrieval in fetchNode) so the function cannot block indefinitely when
invoked without a deadline; ensure you import time if not present and always
call cancel to avoid leaks.
pkg/collector/talos/os.go (1)

134-156: ⚠️ Potential issue | 🟡 Minor

parseOSImage emits misleading ID/VERSION_ID for non‑Talos OSImage formats.

At Line 146+ this parser converts multi-word distro names into invalid structured fields (e.g., RHEL-like strings become ID=red, VERSION_ID=Plow). Prefer Talos-only normalization and leave ID/VERSION_ID unset when format is not confidently mapped.

Proposed fix
 func parseOSImage(image string) (id, version string) {
 	if image == "" {
 		return "", ""
 	}
 	parts := strings.SplitN(image, " (", 2)
-	id = strings.ToLower(strings.TrimSpace(parts[0]))
-	// Strip secondary words from the ID for distros that omit the
-	// parenthesis form (e.g., "Ubuntu 22.04.5 LTS" → "ubuntu").
-	if spaceIdx := strings.Index(id, " "); spaceIdx >= 0 {
-		id = id[:spaceIdx]
-	}
-	if len(parts) == 2 {
+	base := strings.TrimSpace(parts[0])
+	if !strings.EqualFold(base, "talos") {
+		return "", ""
+	}
+	id = "talos"
+	if len(parts) == 2 {
 		v := strings.TrimSuffix(strings.TrimSpace(parts[1]), ")")
 		v = strings.TrimPrefix(v, "v")
 		version = v
 	}
 	return id, version
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/collector/talos/os.go` around lines 134 - 156, parseOSImage currently
force-parses multi-word distro strings into invalid ID/VERSION_ID; change it to
only yield values for confident Talos-style patterns: in parseOSImage check for
a parenthesized version suffix (parts := strings.SplitN(image, " (", 2)) and
only proceed if len(parts) == 2 AND the left-hand part contains no spaces OR the
left-hand part equals "Talos" (case-insensitive) — otherwise return empty
strings; when proceeding, keep the existing lowercasing/trim and version
stripping logic (v prefix trim) so only true Talos-style inputs produce ID and
VERSION_ID.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/k8s/agent/types.go`:
- Around line 43-47: Add a unit test that asserts the OSTalos constant in
pkg/k8s/agent (symbol OSTalos) matches the equivalent constants in the other
packages (e.g., collector.OSTalos and recipe.CriteriaOSTalos) to prevent drift;
implement a small cross-package test (in the appropriate _test.go) that imports
those packages and uses simple equality assertions (t.Fatalf or t.Errorf) if any
values differ so CI fails when they get out of sync.

---

Outside diff comments:
In `@pkg/snapshotter/snapshot_test.go`:
- Around line 17-25: The "unset" test case currently calls t.Setenv("AICR_OS",
"") which leaves the variable present but empty; change the test to remove the
environment variable so it is truly absent (use os.Unsetenv("AICR_OS") or call
os.Unsetenv inside the test setup and arrange to restore it via t.Cleanup) so
parseOSEnv() exercises the "not present" branch; update the same pattern in the
other occurrence around the block that covers lines 320-338 to ensure both tests
really unset AICR_OS.

---

Duplicate comments:
In `@docs/user/cli-reference.md`:
- Around line 74-94: The docs are missing the new aicr snapshot --os flag;
locate the Go flag definition in the snapshot command (search for snapshot.go
and the StringFlag/Flag with Name: "os" in the pkg/cli package or related
files), copy the exact usage text and default value for the --os flag from that
Go definition, and add a new row to the Flags table describing `--os` for the
`aicr snapshot` command (include short/Type/Default/Description exactly as in
the Go CLI). Ensure the table row references the `--os` flag name and mirrors
the Go usage/default precisely so docs and CLI output match.

In `@pkg/collector/factory.go`:
- Around line 26-30: Add a unit test that asserts the Talos OS constant is
identical across packages to catch drift: create a new test (e.g.,
invariants_test) that imports the three symbols OSTalos from pkg/collector
(const OSTalos), CriteriaOSTalos from pkg/recipe, and whatever the Talos
constant is in pkg/agent, and assert they are equal (using testing.Fatal or
require/xtest equivalent) so CI fails on mismatch; name the test clearly (e.g.,
TestTalosOSConstantInvariant) and keep it lightweight with no external
dependencies.

In `@pkg/collector/talos/os_test.go`:
- Around line 163-164: The test is bypassing the environment-resolution path by
passing WithNodeName("missing-node") so it never exercises
NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME fallbacks; remove the WithNodeName(...)
option when constructing NewOSCollector (leave
NewOSCollector(WithClientSet(fake.NewSimpleClientset()))), ensure the
environment variables NODE_NAME, KUBERNETES_NODE_NAME and HOSTNAME are unset or
cleared in the test setup, then call c.Collect(ctx) to verify graceful
degradation; reference NewOSCollector, WithClientSet, WithNodeName and Collect
when making the change.

In `@pkg/collector/talos/os.go`:
- Around line 134-156: parseOSImage currently force-parses multi-word distro
strings into invalid ID/VERSION_ID; change it to only yield values for confident
Talos-style patterns: in parseOSImage check for a parenthesized version suffix
(parts := strings.SplitN(image, " (", 2)) and only proceed if len(parts) == 2
AND the left-hand part contains no spaces OR the left-hand part equals "Talos"
(case-insensitive) — otherwise return empty strings; when proceeding, keep the
existing lowercasing/trim and version stripping logic (v prefix trim) so only
true Talos-style inputs produce ID and VERSION_ID.

In `@pkg/collector/talos/service.go`:
- Around line 100-104: The code currently sets keyActiveState to "active" for
any non-empty runtime string; change the logic around
splitRuntimeID(info.ContainerRuntimeVersion) so you only
SetString(keyActiveState, "active") when runtimeName == "containerd" and
runtimeVersion is non-empty/parseable, otherwise do not mark containerd active
(set inactive or omit the active state consistently). Update uses of
keyRuntimeName and keyVersion to still record parsed values only when parse
succeeded, and adjust pkg/collector/talos/service_test.go expectations for the
malformed runtime case to assert containerd is not active (and that
runtime/version are empty or reflect the parsing outcome).

In `@pkg/collector/talos/talos.go`:
- Line 101: The fetchNode helper currently calls cs.CoreV1().Nodes().Get using
the incoming ctx and must enforce a 10s I/O timeout; wrap the call in a
context.WithTimeout(ctx, 10*time.Second), defer cancel(), and use that timed
context when calling cs.CoreV1().Nodes().Get (the node retrieval in fetchNode)
so the function cannot block indefinitely when invoked without a deadline;
ensure you import time if not present and always call cancel to avoid leaks.

In `@pkg/k8s/agent/job_test.go`:
- Around line 119-128: The test currently only checks the value of AICR_OS via
gotOSEnv which conflates "absent" and "present-but-empty"; change the loop over
spec.Containers[0].Env to record both a boolean foundOSEnv and the value (e.g.,
foundOSEnv := false; gotOSEnv := "" and set foundOSEnv = true when e.Name ==
"AICR_OS"), then replace the single value check with two assertions: assert
presence/absence (foundOSEnv should equal (tt.wantAICROSEnv != "")) and, only if
foundOSEnv is true, assert gotOSEnv == tt.wantAICROSEnv so you explicitly test
existence and value separately.

In `@pkg/recipe/doc.go`:
- Around line 31-141: The GET query-parameter documentation in pkg/recipe/doc.go
is stale: the "os" list omits amazonlinux and talos even though Criteria.OS
(e.g., constants like CriteriaOSAmazonLinux, CriteriaOSTalos or related
CriteriaOS* types) supports them; update the "os" line under "Query Parameters
(HTTP API - GET)" to include amazonlinux and talos (so it reads e.g. "ubuntu,
cos, rhel, amazonlinux, talos, any") and ensure any alias/description for the
"os" parameter matches the Criteria.OS constants and examples elsewhere in the
file.

In `@pkg/snapshotter/snapshot.go`:
- Around line 85-100: Normalize and validate AICR_OS before using it to select
collectors: change parseOSEnv() to return a trimmed, lowercased value (use
strings.TrimSpace + strings.ToLower) and in NodeSnapshotter.measure() validate
that the normalized osVal is one of the supported OS identifiers (e.g. "talos"
or any values accepted by collector.WithOS); if osVal is empty continue with
default behavior, but if it's present and not in the allowed set return an error
(or log a clear warning and fail fast) instead of silently passing the raw value
to collector.WithOS, referencing parseOSEnv, NodeSnapshotter.measure, and
collector.WithOS to locate the change points.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 1f7b28e3-dac2-4dc1-bc72-5d0041e4d439

📥 Commits

Reviewing files that changed from the base of the PR and between edd1961 and 3065a31.

📒 Files selected for processing (27)
  • api/aicr/v1/server.yaml
  • docs/README.md
  • docs/contributor/api-server.md
  • docs/contributor/cli.md
  • docs/contributor/validations.md
  • docs/user/api-reference.md
  • docs/user/cli-reference.md
  • pkg/cli/recipe.go
  • pkg/cli/snapshot.go
  • pkg/collector/factory.go
  • pkg/collector/factory_test.go
  • pkg/collector/talos/doc.go
  • pkg/collector/talos/os.go
  • pkg/collector/talos/os_test.go
  • pkg/collector/talos/service.go
  • pkg/collector/talos/service_test.go
  • pkg/collector/talos/talos.go
  • pkg/k8s/agent/job.go
  • pkg/k8s/agent/job_test.go
  • pkg/k8s/agent/types.go
  • pkg/recipe/criteria.go
  • pkg/recipe/criteria_test.go
  • pkg/recipe/doc.go
  • pkg/snapshotter/agent.go
  • pkg/snapshotter/snapshot.go
  • pkg/snapshotter/snapshot_test.go
  • site/docs/getting-started/index.md

Comment thread pkg/k8s/agent/types.go Outdated
Comment on lines +43 to +47
// OSTalos is the agent.Config.OS value that selects the Talos-aware pod
// configuration (no systemd hostPath mounts) and the Kubernetes-API service
// collector backend. Must stay in sync with collector.OSTalos and
// recipe.CriteriaOSTalos.
const OSTalos = "talos"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Add a small cross-package guard test for Talos constant sync.

You already document that this must stay aligned with other packages; a unit test assertion would prevent silent drift.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/k8s/agent/types.go` around lines 43 - 47, Add a unit test that asserts
the OSTalos constant in pkg/k8s/agent (symbol OSTalos) matches the equivalent
constants in the other packages (e.g., collector.OSTalos and
recipe.CriteriaOSTalos) to prevent drift; implement a small cross-package test
(in the appropriate _test.go) that imports those packages and uses simple
equality assertions (t.Fatalf or t.Errorf) if any values differ so CI fails when
they get out of sync.

Extracts the OS criterion string values into a leaf package
pkg/recipe/oskind so pkg/recipe, pkg/collector, and pkg/k8s/agent
all reference the same constant rather than each duplicating an
"OSTalos" const with a "must stay in sync" comment that nothing
enforces. The leaf package was the lowest-impact way to break the
existing pkg/recipe -> pkg/snapshotter -> pkg/collector import cycle
that originally drove the duplication.

- New pkg/recipe/oskind package with the OS string constants
  (Any, Ubuntu, RHEL, COS, AmazonLinux, Talos), an All() helper
  used by recipe.GetCriteriaOSTypes, and an IsKnown() validator
  that downstream callers (e.g., AICR_OS env validation in the
  snapshotter) can use without taking on pkg/recipe's full surface.
- pkg/recipe/criteria.go: CriteriaOS* typed constants now alias
  oskind values; GetCriteriaOSTypes delegates to oskind.All.
- pkg/collector/factory.go: removes the local OSTalos string
  literal and its sync-warning comment; branches on oskind.Talos.
- pkg/k8s/agent/types.go: removes the local OSTalos const and its
  sync-warning comment; pkg/k8s/agent/job.go branches on
  oskind.Talos.
- Tests updated to reference oskind.Talos.

Addresses PR #714 review feedback (constant drift) by eliminating
the drift surface entirely rather than adding a guard test for it.
The --os flag was added to pkg/cli/snapshot.go in the prior commit
but only the recipe-command flag table was updated. Adds the row to
the snapshot-command flag table mirroring the Go usage text.

Addresses PR #714 review feedback that AGENTS.md requires user-
visible CLI flags to be documented in the same PR.
ServiceCollector and OSCollector ran in parallel from the snapshotter
errgroup, each calling cfg.fetchNode independently — two Nodes().Get()
round-trips for the same node per snapshot. Adds a sync.Once cache to
config.fetchNode and a NewCollectors constructor that pairs the two
collectors against a single config; the factory now uses NewCollectors
when OS=talos so a snapshot does exactly one Node API fetch regardless
of how many Talos collectors are added under this template in future.

NewServiceCollector / NewOSCollector remain for tests that exercise a
single collector in isolation; only the factory needs the shared-config
form. Adds a regression test using a fake clientset reactor that asserts
exactly one Get is observed across both collectors.

Addresses PR #714 review feedback.
buildContainerdSubtype previously stamped ActiveState=active for any
non-empty ContainerRuntimeVersion, including cri-o:// URIs and
malformed strings without a "://" separator. A constraint comparing
SystemD.containerd.service.Version >= "1.7" against a cri-o node
would silently never match with no signal that the data was unparsed,
and tests locked that behavior in.

Tightens the active-state semantics to:
  empty                          -> unknown (no signal from kubelet)
  parseable as "containerd://X"  -> active  (containerd is the runtime)
  any other non-empty value      -> unknown (CRI-O, malformed, etc.)

RuntimeName is still recorded so downstream tooling can see what was
observed; only ActiveState/Version are gated. Updates the malformed-id
test to assert unknown and adds a new cri-o:// case for symmetry.

Addresses PR #714 review feedback.
parseOSImage previously force-parsed any input — multi-word distro
names became invalid structured fields like ID=red, VERSION_ID=Plow
for "Red Hat Enterprise Linux 9.4 (Plow)". The Talos OS collector
only runs when the OS criterion is talos, so non-Talos inputs are
off the production path; declining to parse them keeps the data
shape honest if anyone ever exercises this code with another
OSImage.

Now only Talos-format strings (e.g., "Talos (vX.Y.Z)") yield ID
and VERSION_ID. Everything else returns empty strings; the caller
in buildReleaseSubtype already skips emitting those keys when
empty, so PRETTY_NAME stays populated but ID/VERSION_ID stay
unset rather than misleading.

Updates TestParseOSImage table to reflect the new contract: drops
the locked-in ("red", "Plow") and ("ubuntu", "") cases, asserts
empty for non-Talos inputs. Adds an absence assertion in
TestOSCollector_PopulatesReleaseAndExtensions so a regression that
re-introduces invented IDs would fail loudly.

Addresses PR #714 review feedback.
parseOSEnv used to forward whatever was in AICR_OS verbatim. A typo
like AICR_OS=talsoo would silently bypass the in-pod factory's
oskind.Talos check, fall back to the systemd collector — which then
fails because the systemd hostPath mounts were already skipped on
the controller side based on the (validated) --os flag. Hard to
diagnose.

Now mirrors parseMaxNodesPerEntryEnv's pattern: trim+lowercase, run
through oskind.IsKnown, log slog.Warn on invalid input, and return
"" so the factory falls back cleanly to the default backend. The
controller-side CLI still validates first via
recipe.ParseCriteriaOSType — this is defense-in-depth at the
agent's process boundary in case AICR_OS is ever set outside the
documented path.

Test cases extended for normalization (case, whitespace) and
invalid values.

Addresses PR #714 review feedback.
The previous assertion `gotOSEnv == ""` passed for both "AICR_OS
absent from container env" and "AICR_OS present with empty value".
The agent intentionally omits AICR_OS entirely when OS is unset, so
the test should fail loudly if a regression starts emitting an
empty-valued env var.

Tracks both presence (foundOSEnv bool) and value, asserts each
separately. No production change.

Addresses PR #714 review feedback.
The Criteria.OS comment was updated for amazonlinux/talos in the
prior commit, but the GET query-parameter section in the same file
still listed only ubuntu/cos/rhel/any. Bring it in line with the
actual OS values declared in pkg/recipe/oskind.

Addresses PR #714 review feedback.
TestOSCollector_GracefulDegradation passed WithNodeName("missing-node"),
which short-circuited the NODE_NAME/KUBERNETES_NODE_NAME/HOSTNAME
fallback chain — the test only covered "explicit node name not found"
and never reached resolveNodeName's env path.

Drops the WithNodeName option so the cleared env vars actually flow
through resolveNodeName -> k8s.GetNodeName -> empty -> graceful empty
measurement. No production change.

Addresses PR #714 review feedback.
t.Setenv("AICR_OS", "") leaves the variable present-but-empty, not
absent. parseOSEnv reads via os.Getenv which collapses both to the
same empty result today, so the assertion passed — but the test
description "unset" did not match the test reality, and a future
switch to os.LookupEnv (to distinguish absent from empty) would
silently skip the absent branch.

Splits the case into "unset" (uses os.Unsetenv with t.Cleanup
restoration) and "set but empty" (the existing t.Setenv path) so
both branches are exercised explicitly. No production change.

Addresses PR #714 review feedback.
coderabbitai[bot]

This comment was marked as resolved.

mchmarny
mchmarny previously approved these changes Apr 29, 2026

@mchmarny mchmarny left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/LGTM

All 6 of my previous review comments have been resolved

@mchmarny mchmarny enabled auto-merge (squash) April 30, 2026 13:43
@mchmarny mchmarny merged commit af8def3 into main Apr 30, 2026
34 checks passed
@mchmarny mchmarny deleted the feat/talos_collector branch April 30, 2026 13:53
ayuskauskas added a commit that referenced this pull request May 8, 2026
Adds an opt-in, kind-equivalent harness for spinning up a local Talos
Linux cluster on a developer laptop and exercising the AICR snapshot
agent against it. Closes the integration-test gap left by #714 (Talos
collector) — the OS=talos pod-shape branch in pkg/k8s/agent/job.go was
unit-tested only.

The harness has two independent pieces stitched together by make:

* tools/talos-test/{up.sh,down.sh,README.md}: cluster lifecycle.
  Wraps 'talosctl cluster create docker' (newer talosctl moved
  provisioners to subcommands), threads a containerd registry-mirror
  config patch so 'localhost:5001/aicr:local' resolves from inside
  Talos nodes, and on macOS Docker Desktop / Podman Machine loads
  br_netfilter via a one-shot privileged sidecar so the default
  Flannel CNI works. Streams 'talosctl health' progress, derives the
  apid endpoint and K8s API server URL from the docker-NAT'd host
  ports, rewrites the kubeconfig to the host-routable form, and
  relaxes Pod Security Standards on the default namespace so the
  privileged agent pod can be scheduled. Documented prerequisites and
  install commands for talosctl in the README.

* tests/chainsaw/snapshot/deploy-agent-talos/: chainsaw test.
  Invokes 'aicr snapshot --os talos -o cm://default/aicr-e2e-snapshot'
  from a script step; the CLI exercises pkg/k8s/agent.Deployer
  end-to-end (RBAC creation, OS=talos pod shape, log streaming, Job
  completion, ConfigMap write, cleanup). Asserts only the surviving
  ConfigMap content. Uses chainsaw JMESPath filters
  ((sort([*].type)), (@[?type == 'X'] | [0].subtypes | sort(...)))
  so the assertion is order-independent at both the measurement-type
  and per-type subtype levels.

Make targets (Makefile):
  talos-dev-env        spin up cluster
  talos-dev-env-clean  destroy cluster
  talos-snapshot-test  build + run chainsaw against the live cluster
None are wired into 'make qualify'; Talos is opt-in for now.

CLI:
The agent container's resource requests and limits were hardcoded
(Privileged: 4Gi req / 8Gi lim memory; Restricted: 256Mi / 512Mi).
That sizing fits production GPU nodes but blocks scheduling on small
dev clusters such as the talosctl Docker provisioner default (2Gi
worker), where there is no node-level option to bypass. Adds:

  --requests 'cpu=500m,memory=1Gi,ephemeral-storage=1Gi'
  --limits   'cpu=1,memory=2Gi,ephemeral-storage=2Gi'

Each accepts a comma-separated 'name=quantity' list. Unspecified keys
fall back to the existing per-mode defaults so the production path is
unchanged. The nvidia.com/gpu limit injected by --require-gpu is
preserved on top of any override. Plumbed through
pkg/snapshotter.AgentConfig and pkg/k8s/agent.Config as
corev1.ResourceList; applyPrivilegedSettings and
applyRestrictedSettings merge overrides over the defaults via a new
mergeResourceList helper.

Tests:
  pkg/k8s/agent: TestApplyPrivilegedSettings_ResourceOverrides
                 (no override, partial override, RequireGPU + override),
                 TestApplyRestrictedSettings_ResourceOverrides,
                 TestMergeResourceList (input-immutability invariant).

Docs:
  docs/user/cli-reference.md  --requests / --limits flag table entries
  tools/talos-test/README.md  prerequisites, quickstart, customization,
                              troubleshooting

Components Affected:
  CLI (pkg/cli)
  Agent (pkg/k8s/agent, pkg/snapshotter)
  Tests/CI (tests/chainsaw/snapshot/deploy-agent-talos, tools/talos-test)
  Build (Makefile)
  Docs (docs/user/cli-reference.md)

Tested locally end-to-end on macOS Podman Machine (Talos v1.9.0 +
talosctl 1.13.0) — bring-up, agent deployment, snapshot assertion,
teardown all pass.
@ayuskauskas ayuskauskas mentioned this pull request May 13, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants