NCP Software Reference Guide#
The NCP Software Reference Guide introduces the concept of a layered approach for implementing AI Services over the NCP Hardware RA. The abstracted layered architecture can be broken into two views: the Tenant Compute view, and the Operator view.
Tenant Compute View#
The tenant-consumed compute resources can be broken into the following abstracted layers, as shown in the Tenant Compute View diagram.
Tenant View of the Software Reference Architecture#
Infrastructure-as-a-Service (IaaS) — This layer is responsible for both Bare Metal (BM) and Virtual Machines (VM) as a consumable infrastructure. To address the dynamic resource allocation, this service responds to UI or API calls to create isolated and sanitized infrastructure for a tenant.
Container-as-a-Service (CaaS) — This is the managed K8s layer built on top of the IaaS layer and provides the end user all the advantages of K8s (such as extensibility, modularity, API-driven, auto-scaling, simplified scheduling) while providing the operational abstraction and automation of a managed service. This CaaS layer can be disaggregated and provided independently, or as part of an integrated platform solution by the NCP.
AI Platform-as-a-Service (PaaS): This is the primary application to enable GPU-based AI workloads. Slurm is widely used for training and HPC use cases today, but there is an increasing migration to other Cloud Native AI platforms that are good for model development, inference and training (for example, Run.AI, and any number of industry platforms).
Slurm: Slurm, while not a cloud-native AI PaaS, is a well-known single-tenant AI platform especially useful for HPC and training jobs. When running Slurm, the NCP can use the open-source version or NVIDIA® BCM Slurm, which is tailored to work well with NVIDIA GPUs.
Ancillary compute/Native workloads: These are general-purpose compute servers available in the core/ancillary services in the data center. These workloads (such as business logic, load balancers, database services) must be serviced as first-class citizens by the NCP software stack.
Operator View#
The generalized view of the services and features expected to run the AI workload execution and the Control & Management stacks is pictured below:
Operator View of the Software Reference Architecture#
A few of these services are key technologies for the NCP Software Reference Guide, such as the Software Defined Networking (SDN) controller and the AI Platform control planes, and will be discussed later in this document.
This operator view presents the core capabilities that each layer in the software reference architecture provides.