The NCP Software Reference Guide introduces the concept
of a layered approach for implementing AI Services over the NCP Hardware
RA. The abstracted layered architecture can be broken into two views:
the Tenant Compute view, and the Operator view.
Tenant Compute View
The tenant-consumed compute resources can be broken into the following
abstracted layers, as shown in the Tenant Compute View diagram.

- Infrastructure-as-a-Service (IaaS) — This layer is responsible for
both Bare Metal (BM) and Virtual Machines (VM) as a consumable
infrastructure. To address the dynamic resource allocation, this
service responds to UI or API calls to create isolated and sanitized
infrastructure for a tenant.
- Container-as-a-Service (CaaS) — This is the managed K8s layer
built on top of the IaaS layer and provides the end user all the
advantages of K8s (such as extensibility, modularity, API-driven,
auto-scaling, simplified scheduling) while providing the
operational abstraction and automation of a managed service. This CaaS
layer can be disaggregated and provided independently, or as part of
an integrated platform solution by the NCP.
- AI Platform-as-a-Service (PaaS): This is the primary application
to enable GPU-based AI workloads. Slurm is widely used for training
and HPC use cases today, but there is an increasing migration to other
Cloud Native AI platforms that are good for model development,
inference and training (for example, Run.AI, and any number of industry
platforms).
- Slurm: Slurm, while not a cloud-native AI PaaS, is a well-known
single-tenant AI platform especially useful for HPC and training jobs.
When running Slurm, the NCP can use the open-source version or NVIDIA®
BCM Slurm, which is tailored to work well with NVIDIA GPUs.
- Ancillary compute/Native workloads: These are general-purpose
compute servers available in the core/ancillary services in the data
center. These workloads (such as business logic, load balancers, database
services) must be serviced as first-class citizens by the NCP
software stack.
Operator View
The generalized view of the services and features expected to run
the AI workload execution and the Control & Management stacks is
pictured below:

A few of these services are key technologies for the NCP Software
Reference Guide, such as the Software Defined
Networking (SDN) controller and the AI Platform control planes, and will
be discussed later in this document.
This operator view presents the core capabilities that each layer in the software reference architecture provides.