Introduction#

This Reference Architecture provides a proven architecture for frontier and proprietary model vendors to securely deploy their latest models in enterprise-controlled environments, where customers can protect sensitive data, reduce dependency on metered API usage, and improve cost predictability for high-volume AI inference workloads.

Most inference today is a compromise between competing security interests. Model providers protect model IP by serving through cloud APIs. Enterprise customers, however, want to process data inside environments they control: cloud accounts, on-premises AI factories, or compute resources behind a VPC they configure.

This Reference Architecture proposes a better path: using NVIDIA Confidential Computing, the model provider retains exclusive control of model weights and source code when the model is deployed on-premises. The enterprise data owner controls which inputs enter, where outputs are delivered, and what operational data may be logged or retained. Both parties can verify that inference runs in the approved confidential environment free from tampering.

_images/coco-secure-deployment-workflow.png

Figure 1 Secure model deployment workflow#

The basic concept of this Reference Architecture is that models are delivered to the compute environment fully encrypted. With confidential containers, Kubernetes still schedules and operates the workload, but each confidential pod runs inside a lightweight confidential VM. An attestation process verifies that the hardware, guest, runtime policy, and GPU state have not been tampered with: no unauthorized process is present to read data or change inputs or outputs. The key-release service verifies the attestation evidence, and only then can the confidential workload read the model weights inside its protected memory. Inference then happens inside the customer-controlled environment, where the model provider does not see the inputs or outputs.

This Reference Architecture is described using vendor-neutral components. Two reference implementations are detailed in Reference Implementations: an upstream open-source stack and a Red Hat-based stack.

Comparison of Integration Methods#

Table 1: Comparison of Integration Methods

Deployment method

SaaS API

Open weights in a customer-controlled environment

Confidential Computing

Optimized to protect

Model provider IP

Enterprise data privacy

Both

Confidential computing gives both parties the protection they want most: sensitive IP can process sensitive data under a zero-trust operating model.