Operations and Failure Handling#

Operators still need to run the service. They need health, capacity, failure, and attestation signals. They don’t need prompts, responses, model weights, keys, or decrypted intermediate data.

The operational requirement is simple: enough signal to debug availability, not enough to reconstruct confidential payloads.

  • Enterprises can gather telemetry through instrumentation such as OpenTelemetry. Payload logging and tracing should be off by default to protect enterprise data. The model provider controls which events the workload emits and is responsible for keeping secrets or confidential material out of debug traces. The platform operator controls whether those events are gathered.

  • Host metrics cover CPU, memory, NVMe, NIC, CVM launch process health, GPU device health, GPU mode, firmware state, and network reachability. Guest metrics cover service health, readiness, throughput, latency, GPU utilization, and application errors — without payloads.

  • Attestation and key-release events go to SIEM with VM identity, image version, measurement ID, policy version, attestation and key-release decisions, timestamp, verifier identity, and failure reason. Payloads, keys, confidential model names, and customer data stay out.

  • Break-glass is explicit. The platform operator can relaunch or replace the CVM to recover availability, but break-glass doesn’t grant unencrypted keys or interactive access that bypasses the model-provider boundary. If break-glass changes a measured component, the old policy will no longer release production keys. Key release resumes only after the new measurement is reviewed, approved, and registered.

Detailed failure modes and acceptance tests are in Appendix E.