Operations and Failure Handling#

Operators still need to run the service. They need health, capacity, failure, scheduling, runtime, GPU, key-release, and attestation signals. They don’t need prompts, responses, model weights, keys, or decrypted intermediate data.

Enterprises can gather telemetry through instrumentation such as OpenTelemetry. Payload logging and tracing should be off by default to protect enterprise data. The model provider controls which events the workload emits and is responsible for keeping secrets or confidential material out of debug traces. The platform operator controls whether those events are gathered.

Platform metrics cover node health, firmware state, GPU mode, GPU device health, GPU Operator status, runtime class selection, pod sandbox launch, confidential runtime health, key-release service health, network reachability, and service readiness. Guest metrics cover service health, readiness, throughput, latency, GPU utilization, and application errors without payloads.

Attestation and key-release events go to SIEM with workload identity, namespace, image digest, runtime-policy ID, policy version, attestation and key-release decisions, timestamp, verifier identity, and failure reason. Payloads, keys, confidential model names, and customer data stay out.

Break-glass is explicit. The platform operator can restart, delete, reschedule, or replace the pod to recover availability, but break-glass doesn’t grant unencrypted keys or interactive access that bypasses the model-provider boundary. If break-glass changes a measured component, the old policy will no longer release production keys. Key release resumes only after the new measurement is reviewed, approved, and registered.

Detailed failure modes and acceptance tests are in Appendix E.