Appendix E. Failure Modes and Acceptance Tests#
The runbook covers node reboot, pod restart, pod reschedule, GPU reset, firmware drift, runtime-class mismatch, guest measurement mismatch, runtime-policy mismatch, attestation collateral expiry or revocation, KBS or KMS/HSM outage, model artifact rotation, certificate rotation, and emergency disablement of key release.
Table 14: Failure Modes
Failure mode |
Component expected to raise it |
Operator-facing signal |
|---|---|---|
Pod missing RuntimeClass, GPU request, node selector, or approved image setting |
Kubernetes admission, scheduler, runtime, policy controller |
Admission, scheduling, or sandbox launch denial naming missing or unsupported setting |
CPU, GPU, guest, image, runtime-policy, or policy evidence mismatch |
Guest Attestation Agent, Attestation Service, Trustee |
Attestation denial with measurement, collateral, or policy reason code |
GPU not in required confidential mode |
GPU Operator, host driver, guest driver, GPU management tooling, attestation verifier |
Node, pod, device-health, or attestation denial naming GPU CC state |
Reference value, collateral, or revocation data missing or expired |
Attestation Service, Trustee, reference-value service |
Verification failure naming missing collateral, expiry, or unsupported evidence |
KBS, KMS/HSM, or key-release policy denies request |
Trustee KBS, KMS/HSM integration |
Key-release denial or dependency error with key ID, policy version, request identity |
exec, debug, privileged pod, or node-level attach path blocked |
Kubernetes policy, runtime policy, guest hardening, break-glass workflow |
Explicit denial identifying the administrative path and policy that blocked it |
Table 15: Acceptance Tests
Test |
Expected result |
Evidence to retain |
|---|---|---|
Approved confidential pod attests and receives a sample key |
Pod reaches healthy state; attestation succeeds; KBS releases the non-sensitive test key only after attestation succeeds |
Pod events, runtime logs, verifier decision, KBS key-release audit, service health |
Unapproved workload image, guest image, or runtime policy is denied |
Pod does not receive the key |
Attestation denial with image, guest, runtime-policy, measurement, or reference-value ID |
Tampered pod spec or launch parameters |
Pod does not receive the key when RuntimeClass, image digest, GPU assignment, guest, or runtime policy differs from the approved build |
Admission, runtime, or attestation denial with changed measurement or build ID |
GPU is not in the required confidential-computing mode |
Pod launch, attestation, or key release fails closed |
GPU/node condition or attestation reason |
Expired or missing attestation collateral |
Attestation fails before key release |
Verifier error and collateral identifier |
Trustee/KBS/KMS outage or policy denial |
Workload fails closed with an actionable error |
Trustee/KMS error class, key ID, request ID, policy version |
KBS secret, app, or model key disabled by model provider |
Model decryption fails during boot or startup; service does not run with stale access |
KBS/KMS denial, app/key ID, policy version, guest startup error |
exec, debug, privileged pod, node attach, or memory dump attempted |
Administrative bypass fails or is controlled as break-glass without model/key exposure |
Kubernetes, node, guest, runtime, firewall, or policy denial; forensics record without model data or sensitive payloads |
Artifact or key rotation |
New artifact/key approved; retired key unavailable per policy |
New measurements/digests and audit record |