Snapshot Collectors
A collector captures one dimension of system state — Kubernetes API,
GPU hardware, OS release, systemd services, node topology — and emits a
single *measurement.Measurement. Collectors run during aicr snapshot
on a workstation, or inside the in-cluster snapshot agent Job.
The orchestrator (pkg/snapshotter) fans collectors out in parallel
under errgroup.WithContext; the result is a flat []*Measurement
inside the resolved snapshot artifact.
The boundary is hard: collectors are read-only. They observe state;
they never Create, Update, Delete, Apply, Patch, exec into
pods, or mutate the host. Anything that mutates is a validator (see
/aicr/contributor-guide/validators), not a collector.
This page is for contributors adding a new collector. End-user snapshot semantics live in docs/user/cli-reference.md.
Where Collectors Live
All collectors live under
pkg/collector/<kind>/.
Each subdirectory is one collector; one collector emits one
measurement.Type.
The mapping from collector to measurement.Type is one-to-one for
all collectors except Talos, which substitutes for systemd and os in
the factory when the OS criteria is talos.
Collector Interface
The interface is in
pkg/collector/types.go:
Two rules:
- Context-cancellable. Every
Collectmust honorctx. Long loops checkctx.Done(). Outbound API calls takectxdirectly. - One Measurement out. Return
*measurement.MeasurementwithTypeset andSubtypespopulated. Returningnilplus an error is fine on hard failure; returning a partial measurement with a logged warning is fine on graceful degradation (the GPU collector models this — when sysfs/PCI enumeration is unavailable, it emits a GPU measurement with no subtypes rather than failing).
Registration via the Factory
Collectors are wired in
pkg/collector/factory.go.
Factory exposes one Create... method per collector kind; the
DefaultFactory constructs the production collector for each:
pkg/snapshotter calls these methods inside errgroup.WithContext —
it does not import collector subpackages directly. To add a new
collector kind, extend the Factory interface, add a constructor on
DefaultFactory, and add a g.Go(collectSafe(..., factory.CreateXxx()))
line in the snapshotter’s measure function.
There is no init()-based self-registration. Adding a collector is
explicit — both factory and snapshotter must reference it, which is the
trade-off for making the parallel fan-out static and trivially testable.
Context and Timeouts
Every collector must bound its own execution. The pattern at the top
of Collect:
defaults.CollectorTimeout is 10s — the default for any
host-local collector. Two collectors override:
Use the parent deadline if it is sooner — the GPU collector shows the
pattern (time.Until(deadline) < timeout). Long-lived watches do not
belong in a collector: collectors are one-shot. If you need a
watch, you are writing a validator or a controller, not a collector.
Adding a New Collector — Walkthrough
End-to-end, the smallest viable patch:
- Create the package.
pkg/collector/<kind>/<kind>.gowith aCollectorstruct and any options aspkg/defaults-backed fields. Constructor returns the interface type, not the concrete struct. - Implement
Collect. First line:ctx, cancel := context.WithTimeout(ctx, defaults.CollectorTimeout); defer cancel(). Then read state and build subtypes. Usemeasurement.NewSubtypeBuilder(name)andmeasurement.NewMeasurement(type).WithSubtypes(...).Build()frompkg/measurement/builder.go. - Add a
measurement.Typeif the dimension is new. Append the constant inpkg/measurement/types.go(TypeXxx) and to theTypesslice. Recipe constraints address measurements by type — leave this out and your data is unreachable. - Extend the factory. Add a
CreateXxxCollector() Collectormethod onFactoryandDefaultFactoryinpkg/collector/factory.go. - Wire into snapshotter. Add one
g.Go(collectSafe(gctx, "<kind>", n.Factory.CreateXxxCollector()))line inpkg/snapshotter/snapshot.go. - Test.
<kind>_test.gowith table-driven tests. Usek8s.io/client-go/kubernetes/fakefor K8s collectors. Cover the happy path, the missing-dependency degradation path, and acontext.Cancelcase. - Update docs. Add the row to docs/user/cli-reference.md if the snapshot output schema gains a new top-level entry, and to this page’s Where Collectors Live table.
Measurement Schema
Reading is a typed-scalar interface implemented by
Scalar[T] (Int, Int64, Uint, Uint64, Float64, Bool,
Str). Use the helpers in
pkg/measurement/types.go
— never store raw any.
The reading.Any() JSON gotcha. When a snapshot is read from
disk, JSON decoders deliver integer values as float64. Any
type-switch on reading.Any() must handle int, int64, and
float64. Forgetting case float64 is a CLAUDE.md anti-pattern —
constraints break the moment the snapshot round-trips through
JSON.
Boundary: Collectors Don’t Mutate
Allowed K8s verbs from a collector: Get, List, Watch (one-shot
only — drain and return). Anything in this column is a review block:
If your check requires mutation to know the answer, the answer
belongs in pkg/validator, not pkg/collector.
Concurrency Rules
- Collectors run in parallel under
errgroup.WithContext. The order in the snapshot is the order results are appended under the snapshotter’s mutex — do not rely on it. - Collectors do not share state with each other. The Talos pair is the one exception, and it shares lazily-initialized config via the factory — not via globals.
- Do not block on another collector’s output. If a dimension depends on another, fold both into the same collector or compose them at validation time.
- The snapshotter’s
errgroupis configured to cancel siblings on hard error today only structurally (collectSafeswallows errors and logs them). Returning a real error fromCollectis reserved for future fail-closed cases — flag a discussion before flipping a collector to that mode.
Error Wrapping
Use pkg/errors with codes — never fmt.Errorf:
Pick codes by intent: ErrCodeUnavailable for upstream/dependency
unreachable, ErrCodeTimeout for ctx deadline, ErrCodeInternal for
parse or invariant failures. Never swallow a non-context error
silently in a spawned goroutine — emit at least
slog.Warn("...", "error", err) (CLAUDE.md anti-pattern).
Cross-Cutting Topology Collector
pkg/collector/topology
is the only collector that reads cluster-wide state rather than
the local node. It paginates nodes.List, aggregates taints and
labels into taintID → []node and labelID → []node maps, and emits
them as a single TypeNodeTopology measurement. Bound by
CollectorTopologyTimeout (90s) and the MaxNodesPerEntry cap from
the factory (caps the per-entry node list to keep snapshot size
sane).
Treat it as the template for any future cluster-scoped collector — not for per-node ones.
Testing
Never write a test that hits a live cluster. CI runs without one.
Common Pitfalls
See Also
- /aicr/contributor-guide/architecture-overview — overall architecture and package map
- /aicr/contributor-guide/recipes-overlays-and-mixins — recipe constraints address measurement
values by
Type/Subtype/ key - /aicr/contributor-guide/validators — validators consume the snapshot measurements collectors produce, and are where mutation belongs
- CLAUDE.md
— error wrapping, context, K8s patterns, the
reading.Any()anti-pattern entry