Compute and Network Provisioning

View as Markdown

Compute and Network Provisioning

This section outlines the requirements for provisioning compute and network. Compute instances can be provided as either Bare Metal instances (via BMaaS) or Virtual Machines (via VMaaS) to support the NVIDIA DGX Cloud engagement. All operations must be controlled via a fully documented and secure API, gRPC or REST preferred. All systems are expected to scale and perform at scale.

General, Compute and Lifecycle Management

Req IDTest Details (Legend)Requirement AreaDescription
CNP01BM: #49, #52, #53, #57, #106, #108, #105, VM: #42, #45, #46, #47, #50, #110, #111, TBD-topoAPI/CLI AccessDGXC must have API or CLI access to the NCP provisioning system for: (1) Node lifecycle management (create, update, delete, list, or manage power states (reboot, on/off power cycle); (2) Network configuration; (3) Inventory and topology discovery; (4) security configuration (users, service accounts, groups, roles); (5) Maintenance and operations (see later section) (storage discussed in storage section)
CNP02INFODeclarative Resource InterfacesFor resources requiring multiple steps and a workflow, please provide the appropriate mechanism. A terraform provider is preferred. E.g. automating filesystem provisioning
CNP03TBD-topoNVLink-Aware AllocationFor NVL72 the API must support NVLink domain-aware allocation
CNP04BM: #51 VM: #45Resource StatesMust support clear resource (e.g. instance, network) states where applicable. For example, provisioning, running, degraded, maintenance required, stopping, stopped, terminating, terminated.
CNP05VM: #112 TBD-tagTaggingSupport for user-defined tags/labels and cloud-init metadata on instances.
CNP06TBD-consoleConsole AccessSerial console access is required (read-only sufficient, interactive preferred).Serial console output shall be logged and be available for historic queries (at least 1 month retention).
CNP07INFOIf VMaaS present: # VMs/NodeGPU Nodes: no more than one VM per Node General Purpose CPU Nodes: More than one per node, with ability to select via memory/core count shape.
CNP08addStable IdentifiersAll resources (e.g. nodes, switches) must have a stable and persistent ID that does not change during the lifespan, even when it goes offline for a service event. VMs must also have a stable identifier.
CNP09TBDFirmwareBetween tenants, all firmware must be brought to a known good state, all firmware must be cryptographically signed and attested during boot.
CNP10addRemote ManagementPlatform management solutions (e.g., BMC) must support Redfish over TLS (Disable IPMI).

Boot Process and Disks

Req IDTest Details (Legend)Requirement AreaDescription
BOOT01#38, #39, #4 , #43, #44, #58, add-k8sImage Deployment & UpdatesAPI-driven workflow allowing DGXC to deploy, update, and manage vendor-provided or custom disk images via bare-metal, VM, or k8s node pool provisioning.
BOOT02BM: #107 VM: #113Access to Instance Metadata from Guest OSSupport for cloud-init and instance metadata discovery via link-local addresses or virtual devices.
BOOT03#104, #40Custom Disk ImagesSupport for tenant created custom OS images (either of: raw, qcow2, etc). API calls: get, list, create, delete. Images should be accessible across all tenant projects/clusters/environments.
BOOT04TBDNode Local StorageGPU and CPU nodes support access to node local storage (NVMe / SSD) for use as scratch storage or for caching services

SDN and Virtual Networking

This section covers the virtual networking requirements. Physical transport and network are discussed later in the document.

Req IDTest Details (Legend)Requirement AreaDescription
SDN01#11, #12, #13, #14: #115 #116Virtual NetworkingFull API/CLI lifecycle management (Create, Read, Update, Delete, List) for software-defined private networks. Must support non-conflicting BYOIP (including 7.0.0.0/8) and stable private IP allocations. Applicable to all types of resource nodes (CPU, GPU, Storage, etc).
SDN02TBDSecurity GroupsSupport for VPC-style security groups (or equivalent), including IP/CIDR-based allow and deny rules. Must define scope/application at workload, node, service (e.g. K8s API Service) and subnet/tenant levels.
SDN03addSecurity OperationsFull API/CLI capabilities to Create, Read, Update, and Delete security groups, including defined audit processes
SDN04#16, #17, #18Tenant IsolationHard logical or physical network segmentation for out-of-band management (BMC), user traffic, and storage-specific operations.
SDN05#117Floating/Movable IPAbility to automatically or API-driven switch a floating private IP between nodes via API within <10 seconds without requiring an instance reboot.
SDN06#118Localized DNSSupport for tenant-defined localized DNS configuration to enable internal domain resolution to private endpoints (e.g. storage endpoints)
SDN07#119VPC PeeringSupport for cross-virtual-network connectivity with full bandwidth and no “hairpin” routing.
SDN08INFOStorage Mesh ConnectivityThe virtual network (from SDN01) must provide unrestricted L3 routing between all storage hosts, enabling full-mesh, all-to-all communication across different subnet (w/o going thru a gateway)
SDN09INFOObservabilityThe platform shall provide comprehensive logging for network infrastructure, including hardware faults, latency/performance fluctuations, and a detailed audit trail of all configuration changes to network filtering rules.