Storage Requirements

View as Markdown

Storage Requirements

NCP must provide shared storage solutions (where applicable) that are manageable via standard APIs and UI, including auditing rights for NVIDIA access.

Home Directory Storage

  • Quota Feature: Configurable filesystem-wide limit, default user/gid quota settings, and per uid/gid overrides.
  • Accounting: Usage accounting for uid/gids must be available when the feature is enabled.
Req IDTest Details (Legend)Requirement AreaDescription
DIR01INFOFile Service uid/gid Quota featureConfigurable filesystem-wide limit, default user/gid quota settings, and per uid/gid overrides available. Usage accounting for uid/gids when the feature is enabled.
DIR02INFOMust be NFS storageNVIDIA requires NFSv4 protocol shared storage to work Access control based on DLs requires POSIX

High-Speed Storage Service Requirements

Req IDTest Details (Legend)Requirement AreaDescription
HSS01addProvisioning APIsStorage provisioning may be via vendor portal/API or NCP portal/API.
HSS02INFOPerformanceMust provision needed throughput requested for minimum bandwidth and IOPS.
HSS03INFOIntegrationK8s: CSI support Breakfix API required to report storage issues
HSS04INFOQuota SupportAbility for quota limits to be set on specific user workloads / volumes
HSS05INFOUpgrade, MaintenanceProvider / NCP initiates desired maintenance. NVIDIA can schedule actual maintenance and can defer maintenance up to 2 weeks. Upgrades should be non-disruptive.
HSS06INFORDMA Memory ProtectionStorage systems using RDMA must enforce memory protection via authorization keys for both local and remote access

High-Speed Storage Filesystem Requirements

Req IDTest Details (Legend)Requirement AreaDescription
HSS07INFOParallel High Speed FilesystemParallel or multi-path high-speed filesystem that supports scaling to thousands of simultaneous clients while sustaining requested performance.
HSS08INFOSingle File System SizeIt must be possible to allocate a file system of at least 1 PiB even if the initial request is less. Growing to > 10PiB as cluster size increases. This hard requirement may be higher for a specific site and if so will be communicated via the ancillary services document.
HSS09INFOMultiple Filesystems (Fungible Total Capacity)Can have >1 filesystem within our total capacity. Minimal file system size <= 50 TiB.
HSS10INFOFilesystem ExpansionLive file system expansion is supported, in terms of capacity, inodes, IO performance, and metadata operations performance. Performance should scale linearly with capacity.
HSS11INFOClientAbility to describe your client: In-Kernel, userspace, or bare-metal client installation requirements. Support integration with client kernels / OS used by NVIDIA, as needed. DKMS-enabled packages available for Ubuntu 20.04, 22.04, and 24.04-based operating systems. ARM64 versions compatible with GB200-ready kernels are mandatory, e.g. Linux 6.8.x. Managed Storage Service Provider will provide client configuration best practices and configuration guidelines for filesystem options and kernel module configuration to reliably achieve optimal performance on ARM and x86_64-based clients.
HSS12INFOQuota (User, Project & Group)Must support soft and hard quotas - uid / gid / project(directory)-id quotas with enforcement.
HSS13INFORoot-squashNvidia needs to be able to enable or disable and manage root-squash at any time.
HSS14INFOflockIt must be possible to mount the file system with flock.
HSS15INFOAbility to Audit ChangesEnable Nvidia to have access to changelog data for filesystem auditing and detailed user operations tracking. Tracking by uid/gid, create files, create dirs, rename files, rename dirs, delete files, delete dirs
HSS16INFOHAAll services are required to tolerate any critical component failure in the backend and provide continued client access to all storage services in such cases.
HSS17INFOMulti-Node CoherencyOne second or less for client attribute and dentry cache updates/invalidates
HSS18INFOClient MultipathingClients must have multipathing to all storage servers

Data Movement Systems Requirements

The Data Movement system is used to copy data from an external data source (NVIDIA, other Cloud, etc) to the NCP data center.

Req IDTest Details (Legend)Requirement AreaDescription
DMS01addDedicated K8s ClusterProvider managed k8s cluster (or ability to stand up our own) for Data Mover stack available ahead of the GPU cluster bringup to pre-stage data
DMS02INFOData Mover Nodes (CPU)Dedicated CPU nodes for running data mover - needs high performance networking (exact quantity will be communicated via ancillary services doc)
DMS03INFOAccess to Same GPU StorageSame filesystem as mounted on GPU nodes mounted on the Data Mover nodes (or ability to mount the same filesystem via CSI)
DMS04INFOAccess to NVIDIA Corp NetDedicate link (as described in network transport) to NVIDIA corp net, preferably with vpn, but otherwise with stable IP for allowlisting
DMS05addStable Egress IPStable IP to IP allowlist access to Nvidia services. (e.g. similar to NAT Gateway)

DGXC-Managed Storage System Deployment

For scenarios where the storage system software will be deployed and managed by DGXC rather than the NCP, the following requirements apply. These requirements enable DGXC to operate storage systems (such as high-speed parallel filesystems, capacity object storage, or block storage) using NCP-provided infrastructure while maintaining operational control.

Host Provisioning and Lifecycle

Req IDTest Details (Legend)Requirement AreaDescription
STG01INFOOperating System SupportNCP must support a workflow that allows DGXC storage operators to integrate vendor-provided or storage-specific operating system images via bare-metal or VM provisioning for storage servers. The workflow must: (a) Allow DGXC to deploy custom OS images (e.g., vendor-enhanced kernels for Lustre, Rocky Linux, Ubuntu 20.04/22.04/24.04).
STG02addDrive Sanitization PolicyCryptographically erase data drive contents between storage system tenants with full attestation of host firmware. Must support an optional flag to skip drive sanitization during break/fix flows (e.g., power supply replacement) where tenancy does not change. Critical hardware component replacements may require sanitization without override, this is inclusive of GPU / CPU node local storage
STG03INFOStable IP AssignmentStorage nodes must support static IP addressing that remains stable during host lifecycle operations and does not reset between maintenance events.
STG04INFOOut-of-Band Failure DetectionNCP must provide the ability to detect system failures out-of-band, including device, network, memory, and drive failures, enabling DGXC to proactively respond to hardware issues.
STG05INFOTopology ObservabilityNCP must provide visibility into failure domains to enable DGXC to provision storage nodes with physical diversity. Storage systems must be able to provision nodes that purposefully span failure domains for resilience.
STG06INFOBlueField/DPU SupportFor storage systems utilizing BlueField-based architectures, the host provisioning system must support lifecycle management and specific configuration requirements for BlueField “JBOF” systems that export NVMe-oF to hosts.