Maintaining AICR

View as Markdown

Runbook for AICR maintainers. Two surfaces:

  • Releases — cadence, tag flow, supply-chain verification.
  • Recipe contributions — reviewing PRs against recipes/ paths, including the forthcoming evidence-backed flow from ADR-007.

For end-user release verification, see RELEASING.md. For contribution mechanics (DCO, CI, signing), see CONTRIBUTING.md.

Cutting a Release

The full release procedure lives in RELEASING.md. The short form:

StepCommandNotes
1. Pre-flightmake qualify on mainMust pass. Tests + lint + e2e + scan.
2. Bump./tools/release patch (or minor/rc/promote)Creates the signed tag locally.
3. Pushgit push origin <tag>Triggers release.yaml workflow.
4. Verifygh release view <tag> + cosign verify-attestation ...See RELEASING.md §Verification.
5. DemoCloud Run deploy auto-triggers on tag pushInspect aicrd.demo health.

Bi-weekly cadence; hotfix between cycles when a fix is critical.

Common Release Breakages

goreleaser fails with auth conflict. goreleaser panics if both GITLAB_TOKEN and GITHUB_TOKEN are set. Always unset GITLAB_TOKEN before make build, make qualify, make e2e, or any release tooling that wraps goreleaser. Local-shell hazard; CI is unaffected.

Tag exists but workflow did not trigger. Delete the local tag and re-push from a fresh shell. If the workflow ran but failed, fix on main and re-tag — never amend a published tag.

Attestation verification fails for users. Confirm the GitHub attestation predicate type matches https://slsa.dev/provenance/v1 and that the user’s gh is recent enough (gh attestation verify is v2.49+). RELEASING.md §Container Attestations has both gh and cosign flows.

Cloud Run demo deploy fails after tag push. Check deploy-demo.yaml workflow; the most common cause is GitHub Container Registry (GHCR) pull failure during the first 60s after tag publish. Re-run the workflow.

Reviewing Recipe Contributions

A recipe PR touches recipes/overlays/, recipes/mixins/, recipes/components/, or recipes/registry.yaml. Three concerns:

  1. The recipe parses and resolves. Covered by make qualify and the recipe unit tests; trust CI here.
  2. The BOM stays in sync. make bom-docs must have been run; the docs/user/container-images.md change must be present in the PR when a chart pin or values file changed. See /aicr/contributor-guide/recipes-overlays-and-mixins.
  3. The configuration is correct on the target hardware. This is the hard one — maintainers cannot run a contributor’s GB200 recipe on an H100. ADR-007 closes that gap with bundled evidence.

The forthcoming evidence flow is documented below as future state. Until ADR-007 PR-D lands, recipe acceptance still relies on author attestation + maintainer judgement.

Evidence-Backed Review (Future State per ADR-007)

Status: ADR-007 PR-D has not landed. recipes/evidence/ does not exist yet; aicr evidence verify ships but the CI gate is warning-only. The runbook below describes the target state after PR-D. Use it as the design contract, not an operational guide.

The motivating constraint: maintainers cannot independently re-run a contributor’s validator on hardware they don’t have. The evidence bundle is the trust artifact that lets a maintainer accept a recipe they cannot reproduce.

Reviewing a Recipe PR You Can’t Run

Use this checklist on any PR that touches recipes/overlays/**, recipes/mixins/**, recipes/components/**, or recipes/registry.yaml. Items 1–5 are validated automatically by the recipe-evidence check; items 6–8 are maintainer judgement calls.

  1. Pointer file present. recipes/evidence/<recipe>.yaml exists for every touched overlay. The CI gate fails closed when a recipe change has no matching pointer.
  2. recipe-evidence check is green. Exit 0 means the bundle signature, schema, inventory, fingerprint match, constraint replay, and BOM cross-reference all passed. Exit 1 requires explicit disposition (see Exit-1 Review Process). Exit 2 is a hard fail.
  3. Signer identity is acceptable. Open the sticky comment, find the recipe’s <details> section, and review the signer block. See Signer Identity Trust Patterns.
  4. Bundle Open Container Initiative (OCI) ref matches PR description. The recipe PR template asks the contributor to paste the bundle.oci field; confirm the sticky comment shows the same ref.
  5. Material slice digest matches. Verifier step 6a recomputes sha256(JCS(material-slice(post-resolution recipe))) and confirms it matches the attestation’s subject digest.
  6. Test environment is plausible. The PR template captures cloud, accelerator, OS, Kubernetes version, and cluster size. A GB200 recipe attested from a single-node Minikube is a red flag.
  7. BOM reflects the recipe’s image set. Spot-check the CycloneDX BOM in the bundle against docs/user/container-images.md for the touched components. Drift indicates the contributor’s aicr validate ran against a different recipe than the one in the PR.
  8. Recipe changes are scoped. A new accelerator overlay should not touch unrelated overlays or component values.

Signer Identity Trust Patterns

aicr evidence verify records the OIDC issuer and identity from the cosign keyless certificate but does not classify it. Three patterns cover most contributions in V1.

PatternIssuerIdentityTreatment
NVIDIA employeetoken.actions.githubusercontent.com or accounts.google.comGitHub user in NVIDIA org, or @nvidia.com GoogleAccept on identity
Unknown forkGitHub Actions or public OIDCNew GitHub userConfirm cosign identity == PR author; mismatch warrants a comment
Corporate tenantlogin.microsoftonline.com/<tenant>/v2.0 or workspace GoogleTenant userNote issuer; the tenant is the trust anchor

V1 deliberately ships without a formal trust-tier policy (see ADR-007 §“What V1 does not ship”). When a pattern recurs often enough to warrant filtering, the tier-policy work pulls in.

Exit-1 Review Process

Exit 1 means the bundle verified cleanly (signature, schema, inventory, fingerprint) but one or more validator phases reported failures. Common causes: a conformance check failed on the contributor’s hardware, a performance threshold was not met, an optional check requires a feature the contributor’s cluster does not have.

Exit 1 is not the same as evidence/exempt. Exit 1 means “evidence was produced and shows a partial failure”; exempt means “no evidence was produced.”

Workflow:

  1. Contributor declares exit-1 intent in the PR template’s “Evidence disposition” section, with a reason.
  2. If acceptable, apply evidence/known-failure label and merge.
  3. If not, request changes. Typical resolutions: narrow the recipe criteria so the failing check is not selected, fix the underlying constraint, or attest against a different cluster where the check passes.

Acceptable reasons cluster into: optional check not applicable to this hardware; performance ceiling is hardware-limited; validator under active rework. Unacceptable: “test was flaky, please merge” or any reason that asks the maintainer to extend trust beyond what the evidence shows.

evidence/exempt Bypass Policy

The evidence/exempt label bypasses the recipe-evidence check entirely. It exists for PRs that modify files under recipes/ for non-recipe reasons.

Appropriate uses:

  • Mechanical refactors (file renames, comment-only changes, license header sweeps).
  • Self-bootstrapping changes that wire up the evidence pipeline itself.
  • Documentation edits that touch recipes/ paths but no recipe semantics.

Inappropriate uses:

  • “I don’t have the hardware right now, please merge.” Maintainers MUST NOT apply the label to skip an inconvenient evidence check.
  • Recipe value changes (image versions, constraint thresholds, overlay merge behavior).

A PR carrying evidence/exempt must include a sentence in the description explaining why the bypass is appropriate. The label is queryable via is:pr label:evidence/exempt for audit.

6-Month Audit Runbook

Quarterly or semi-annually, walk the merged-recipe history to confirm that what merged is still verifiable:

$# Enumerate recently-touched pointers
$git log --since='6 months ago' --diff-filter=AM \
> --name-only --pretty=format: \
> -- recipes/evidence/ | sort -u
$
$# For each, re-verify against the current OCI artifact
$aicr evidence verify recipes/evidence/<recipe>.yaml

Exit 0 confirms the bundle is still fetchable and the signature still chains. If the OCI registry has been deleted, fall back to Rekor:

$cosign verify-attestation --type=recipe-evidence/v1 <bundle-oci-ref>

A passing Rekor verify confirms the bundle existed and was signed by the recorded identity, even if the bytes are no longer fetchable.

Pointers older than 24 months are past the V1 re-cert age cutoff (see ADR-007 §“What V1 does not ship”). File an issue asking the contributor (or a replacement) to re-attest.

maintainers: Block Routing (Post PR-D)

ADR-007 PR-D adds an optional maintainers: block to recipe metadata. It is a routing surface, not a merge-authority surface: it provides a durable contact for re-cert prompts and lets the audit runbook file re-cert issues. It does not confer merge authority and does not replace the signer identity on the bundle.