Measurement Schema

View as Markdown

The Measurement type in github.com/NVIDIA/aicr/pkg/measurement is the on-wire shape used throughout aicr’s Snapshot → Recipe → Validate → Bundle pipeline. Snapshots serialize a []*Measurement to YAML/JSON; recipes and validators consume the same shape. This page is the schema contract — any external producer (cross-repo Go library, CI tool, custom collector) emitting Measurements should follow it exactly.

The Go types are documented in pkg/measurement/types.go. This page documents the conventions on top of the types (which Type appears how often, which Subtype names mean what, which fields live in Context vs Data).

Top-level structure

1measurements:
2 - type: K8s
3 subtypes: [...]
4 - type: GPU
5 subtypes: [...]
6 - type: OS
7 subtypes: [...]
8 - type: SystemD
9 subtypes: [...]
10 - type: NodeTopology
11 subtypes: [...]
12 - type: NetworkTopology # 0 or 1 today; future: 0..N (one per group)
13 subtypes: [...]

Type cardinality

TypeCardinality todayNotes
K8s0 or 1Cluster-scoped Kubernetes state.
GPU0 or 1GPU inventory + driver state.
OS0 or 1Host OS metadata.
SystemD0 or 1systemd unit states.
NodeTopology0 or 1Cluster-wide node taints + labels (aggregate).
NetworkTopology0 or 1Per-hardware-group network topology. Planned multi-instance: future versions will emit one per discovered group.

Find-first-by-Type consumers (constraint extractor, recipe validation, diff indexing) are sound today because every Type appears at most once. When NetworkTopology becomes multi-instance, the relevant consumer rewrites are tracked alongside the multi-group enablement work.

Subtype

A Subtype has a name plus up to three payload fields:

FieldTypePurpose
datamap[string]Reading (scalar values)Numeric / boolean / string measurements addressable by key.
contextmap[string]stringDescriptive metadata (provenance, identity, free-form labels).
items[]ItemEntryOrdered list of structured records. Used when the payload is naturally an array.

A subtype must carry at least one entry in data or items. data and items are independent and may both be populated.

ItemEntry

1- context:
2 pciAddress: "0000:03:00.0"
3 deviceID: "1023"
4 data:
5 rail: 0
6 numaNode: 0
7 traffic: east-west

Each ItemEntry mirrors a Subtype’s scalar contract: data holds Reading scalars; context holds string-typed descriptive fields. ItemEntry does NOT support nested items — the scalar Reading model is preserved.

NetworkTopology shape

TypeNetworkTopology describes one hardware group’s network layout (PFs, rails, RDMA capabilities, kernel modules, identity). When emitted, the Measurement MUST follow this layout:

1type: NetworkTopology
2subtypes:
3 - subtype: identity
4 context:
5 identifier: <stable group identifier, lowercase, RFC-1123>
6 machineType: <e.g. GB300-NVL>
7 gpuType: <e.g. NVIDIA-GB300>
8 linkType: <Ethernet | InfiniBand | ""> # empty if unknown
9 nodeSelector: <label=value selector that targets the group's nodes>
10 data:
11 pf-count: <int>
12 rail-count: <int>
13 - subtype: capabilities
14 data:
15 sriov: <bool>
16 rdma: <bool>
17 ib: <bool>
18 - subtype: pfs
19 items:
20 - context:
21 pciAddress: <e.g. 0000:03:00.0>
22 deviceID: <hex PCI device ID, e.g. 1023>
23 psid: <PSID string>
24 partNumber: <NVIDIA SKU / part number>
25 rdmaDevice: <e.g. mlx5_0>
26 networkInterface: <e.g. enp3s0f0np0>
27 model: <human-readable NIC model from VPD, when set>
28 connectedGPU: <GPU identifier from preset topology, e.g. GPU0>
29 gpuProximity: <PCIe-topology class to connectedGPU, e.g. PIX>
30 data:
31 rail: <int>
32 numaNode: <int>
33 traffic: <east-west | north-south>
34 - context: {...}
35 data: {...}
36 - subtype: kernel-modules
37 data:
38 storage.0: <module name>
39 storage.1: <module name>
40 thirdParty.0: <module name>
41 thirdParty.1: <module name>

Subtypes

  • identity — group identity and high-level facts. Strings (machineType, gpuType, linkType, identifier, nodeSelector) live in context. Numeric facts (pf-count, rail-count) live in data.
  • capabilities — boolean cluster capabilities (sriov, rdma, ib) as scalar Reading values in data.
  • pfs — per-PF records as items. Per-PF descriptive identifiers (PCI address, device ID, PSID, part number, RDMA device name, netdev name, VPD model string, connectedGPU + gpuProximity from preset topology) live in context; per-PF scalar facts (rail index, NUMA node, traffic class) live in data. Optional fields (model, connectedGPU, gpuProximity) are omitted when unset by l8k.
  • kernel-modules — flat ordered lists of storage and third-party RDMA modules. Keys are dotted with a numeric suffix (storage.0, storage.1, thirdParty.0, …) to preserve order and stay within the scalar Reading model. (This is a deliberate exception to the array-via-items pattern: the lists are short, lookup is rare, and the dotted-key form is cheap.)

Field-placement convention

  • context — values that describe or identify a record: textual, cardinality-low, used for grouping or display. Not constrained to be scalar Readings.
  • data — values that are measured or counted: int / float / bool / short string, addressable by key, comparable in validator constraints.

Constraint paths

The constraints package addresses a single value within a Measurement using:

{Type}.{Subtype}.{Key} # legacy form, looks in Subtype.Data
{Type}.{Subtype}[<selector>].{Key} # item form, looks in ItemEntry

Selector forms:

FormExampleMeaning
IndexNetworkTopology.pfs[0].railItems entry at index 0.
PredicateNetworkTopology.pfs[rail=3].pciAddressThe unique Items entry whose data["rail"].String() == "3" (or context["rail"] == "3" if not in data).

Predicate behavior — deterministic single-match resolution:

  • LHS is looked up in ItemEntry.Data first (stringified via Reading.String()); falls back to ItemEntry.Context if not found in Data.
  • Exactly one matching entry is required.
  • Zero matches returns ErrCodeNotFound.
  • Two or more matches returns ErrCodeConflict. Predicates that can match more than one entry are a recipe authoring error; pick a more specific field to disambiguate.

Key resolution inside the chosen ItemEntry:

  • Data is consulted first (returns Reading.String()).
  • Context is consulted next (returns the string directly).
  • Missing key returns ErrCodeNotFound.

Stability contract

pkg/measurement is part of aicr’s public API surface (see /aicr/integrator-guide/public-api-surface). The Go types AND the schema conventions documented above are part of the contract. Field-level changes (renames, type changes, semantic shifts in which fields go in data vs context) are breaking and require a pseudo-version bump that downstream consumers (k8s-launch-kit, external CI tools) pin against.

See also