Compute and Network Provisioning
Compute and Network Provisioning
This section outlines the requirements for provisioning compute and network. Compute instances can be provided as either Bare Metal instances (via BMaaS) or Virtual Machines (via VMaaS) to support the NVIDIA DGX Cloud engagement. All operations must be controlled via a fully documented and secure API, gRPC or REST preferred. All systems are expected to scale and perform at scale.
General, Compute and Lifecycle Management
| Req ID | Test Details (Legend) | Requirement Area | Description |
|---|---|---|---|
| CNP01 | BM: #49, #52, #53, #57, #106, #108, #105,VM: #42, #45, #46, #47, #50, #110, #111, TBD-topo | API/CLI Access | DGXC must have API or CLI access to the NCP provisioning system for: (1) Node lifecycle management (create, update, delete, list, or manage power states (reboot, on/off power cycle); (2) Network configuration; (3) Inventory and topology discovery; (4) security configuration (users, service accounts, groups, roles); (5) Maintenance and operations (see later section)(storage discussed in storage section) |
| CNP02 | INFO | Declarative Resource Interfaces | For resources requiring multiple steps and a workflow, please provide the appropriate mechanism. A terraform provider is preferred. E.g. automating filesystem provisioning |
| CNP03 | TBD-topo | NVLink-Aware Allocation | For NVL72 the API must support NVLink domain-aware allocation |
| CNP04 | BM: #51 VM: #45 | Resource States | Must support clear resource (e.g. instance, network) states where applicable. For example, provisioning, running, degraded, maintenance required, stopping, stopped, terminating, terminated. |
| CNP05 | VM: #112 TBD-tag | Tagging | Support for user-defined tags/labels and cloud-init metadata on instances. |
| CNP06 | TBD-console | Console Access | Serial console access is required (read-only sufficient, interactive preferred).Serial console output shall be logged and be available for historic queries (at least 1 month retention). |
| CNP07 | INFO | If VMaaS present: # VMs/Node | GPU Nodes: no more than one VM per Node General Purpose CPU Nodes: More than one per node, with ability to select via memory/core count shape. |
| CNP08 | add | Stable Identifiers | All resources (e.g. nodes, switches) must have a stable and persistent ID that does not change during the lifespan, even when it goes offline for a service event. VMs must also have a stable identifier. |
| CNP09 | TBD | Firmware | Between tenants, all firmware must be brought to a known good state, all firmware must be cryptographically signed and attested during boot. |
| CNP10 | add | Remote Management | Platform management solutions (e.g., BMC) must support Redfish over TLS (Disable IPMI). |
Boot Process and Disks
| Req ID | Test Details (Legend) | Requirement Area | Description |
|---|---|---|---|
| BOOT01 | #38, #39, #4 , #43, #44, #58, add-k8s | Image Deployment & Updates | API-driven workflow allowing DGXC to deploy, update, and manage vendor-provided or custom disk images via bare-metal, VM, or k8s node pool provisioning. |
| BOOT02 | BM: #107 VM: #113 | Access to Instance Metadata from Guest OS | Support for cloud-init and instance metadata discovery via link-local addresses or virtual devices. |
| BOOT03 | #104, #40 | Custom Disk Images | Support for tenant created custom OS images (either of: raw, qcow2, etc). API calls: get, list, create, delete. Images should be accessible across all tenant projects/clusters/environments. |
| BOOT04 | TBD | Node Local Storage | GPU and CPU nodes support access to node local storage (NVMe / SSD) for use as scratch storage or for caching services |
SDN and Virtual Networking
This section covers the virtual networking requirements. Physical transport and network are discussed later in the document.
| Req ID | Test Details (Legend) | Requirement Area | Description |
|---|---|---|---|
| SDN01 | #11, #12, #13, #14: #115 #116 | Virtual Networking | Full API/CLI lifecycle management (Create, Read, Update, Delete, List) for software-defined private networks. Must support non-conflicting BYOIP (including 7.0.0.0/8) and stable private IP allocations. Applicable to all types of resource nodes (CPU, GPU, Storage, etc). |
| SDN02 | TBD | Security Groups | Support for VPC-style security groups (or equivalent), including IP/CIDR-based allow and deny rules. Must define scope/application at workload, node, service (e.g. K8s API Service) and subnet/tenant levels. |
| SDN03 | add | Security Operations | Full API/CLI capabilities to Create, Read, Update, and Delete security groups, including defined audit processes |
| SDN04 | #16, #17, #18 | Tenant Isolation | Hard logical or physical network segmentation for out-of-band management (BMC), user traffic, and storage-specific operations. |
| SDN05 | #117 | Floating/Movable IP | Ability to automatically or API-driven switch a floating private IP between nodes via API within <10 seconds without requiring an instance reboot. |
| SDN06 | #118 | Localized DNS | Support for tenant-defined localized DNS configuration to enable internal domain resolution to private endpoints (e.g. storage endpoints) |
| SDN07 | #119 | VPC Peering | Support for cross-virtual-network connectivity with full bandwidth and no “hairpin” routing. |
| SDN08 | INFO | Storage Mesh Connectivity | The virtual network (from SDN01) must provide unrestricted L3 routing between all storage hosts, enabling full-mesh, all-to-all communication across different subnet (w/o going thru a gateway) |
| SDN09 | INFO | Observability | The platform shall provide comprehensive logging for network infrastructure, including hardware faults, latency/performance fluctuations, and a detailed audit trail of all configuration changes to network filtering rules. |