NVIDIA GPUDirect Storage Design Guide

The purpose of the Design Guide is to show OEMs, CSPs and ODMs how to design their servers to take advantage of GPUDirect Storage and to help application developers understand where GPUDirect Storage can bring value to application performance.

1. Introduction

This section provides an introduction to NVIDIA® GPUDirect® Storage (GDS).

GDS is the newest addition to the GPUDirect family. Like GPUDirect peer to peer (https://developer.nvidia.com/gpudirect) that enables a direct memory access (DMA) path between the memory of two graphics processing units (GPUs) and GPUDirect RDMA that enables a direct DMA path to a network interface card (NIC), GDS enables a direct DMA data path between GPU memory and storage, thus avoiding a bounce buffer through the CPU. This direct path can increase system bandwidth while decreasing latency and utilization load on the CPU and GPU (see Figure 1). Some people define a supercomputer as a machine that turns a compute-bound problem into an IO-bound problem. GDS helps relieve the IO bottleneck to create more balanced systems.

The GDS feature is exposed via new cuFile APIs that are being added to NVIDIA® CUDA®. It is delivered via a separate package, which is initially separated into a user-level library, libcufile.so, and a kernel driver, nvidia-fs.ko. The user-level library will be integrated into the CUDA user-level runtime. The kernel driver, which is initially delivered and installed separately, will be installed with the NVIDIA driver once integrated.

2. Data Transfer Issues for GPU and Storage

This section provides information about issues you might face during a data transfer for GDS and storage.

The movement of data between GPU memory and storage is set up and managed using system software drivers that execute on the CPU. We refer to this as the control path. Data movement may be managed by any of three agents that are listed.

  • The GPU and its DMA engine. The GPU’s DMA engine is programmed by the CPU. Third party devices do not generally expose their memory to be directly addressed by another DMA engine. Therefore, the GPU’s DMA engine can only copy to and from CPU memory, implying the use of a bounce buffer in CPU memory.
  • The CPU doing loads and stores. CPUs generally cannot copy directly between two other devices. So, it needs to use an intermediate bounce buffer in CPU memory.
  • A DMA engine near storage, for example, in an NVMe drive, NIC, or storage controller such as a RAID card. The GPU PCIe Base Address Register (BAR) addresses can be exposed to other DMA engines. GPUDirect RDMA, for example, exposes these to the DMA engine in the NIC, via the NIC’s driver. NIC drivers from Mellanox and others support this. However, when the endpoint is in file system storage, the operating system gets involved. Unfortunately, today’s OSes do not support passing a GPU PCIe BAR address down through the file system.

3. GPUDirect Storage Benefits

This section provides the benefits of using GDS.

Using the GDS functionality avoids the use of a “bounce buffer” in CPU system memory, where the bounce buffer is defined as a temporary buffer in system memory to facilitate data transfers between two devices such as a GPU and storage.

The following performance benefits can be realized by using GPUDirect Storage:

  • Bandwidth: The PCIe bandwidth into and out of a CPU may be lower than the bandwidth capabilities of the GPUs. This difference can be due to fewer PCIe paths to the CPU based on the PCIe topology of the server. GPUs, NICs, and storage devices sitting under a common PCIe switch will typically have higher PCIe bandwidth between them. PCIe traffic to the CPU may also incur snooping overhead that impacts bandwidth. Utilizing GPUDirect Storage should alleviate those CPU bandwidth concerns, especially when the GPU and storage device are sitting under the same PCIe switch. As shown in Figure 1, GDS enables a direct data path (green) rather than and an indirect path (red) through a bounce buffer in the CPU. This boosts bandwidth, lowers latency, and reduces CPU and GPU throughput load. It enables the DMA engine near storage to move data directly into GPU memory.

    Figure 1. Comparing GPUDirect Storage Paths



  • Latency: The use of a bounce buffer results in two copy operations:
    • Copying data from the source into the bounce buffer.
    • Copying again from the bounce buffer to target device.

    A direct data path has only one copy, from source to target. If the CPU performs the data movement, latencies may be impacted by conflicts over CPU availability, which can lead to jitter. GDS mitigates those latency concerns.

  • CPU Utilization: If the CPU is used to move data, overall CPU utilization increases and interferes with the rest of the work on the CPU. Using GDS reduces the CPU workload, allowing the application code to run in less time. As a result, both compute and memory bandwidth bottlenecks are avoided with GDS. Both components are relieved with GDS.

    Once data no longer needs to follow a path through CPU memory, new possibilities are opened.

  • New PCIe Paths: Consider systems where there are two levels of PCIe switches. NVMe drives hang off the first level of switches with up to 4 drives per PCIe tree. There may be two to four NVMe drives in each PCIe tree, hanging off the first level of switches. If fast enough drives are used, they can nearly saturate the PCIe bandwidth through the first level PCIe switch. The NVIDIA GPUDirect Storage engineering team measured 13.3 GB/s from a set of 4 drives in a 2x2 RAID 0 configuration. Use of RAID 0 on the control path via the CPU does not impede a direct data path. In an NVIDIA DGX™-2, eight PCIe slots hang off the second level switches, which may be populated with either NICs or RAID cards. In this configuration, NICs have been measured at 11 GB/s and RAID cards at 14 GB/s. These two paths, from local storage and remote storage, can be used simultaneously and importantly, bandwidth is additive across the system.

  • PCIe ATS: As PCIe Address Translation Service (ATS) support is added to devices, they may no longer need to use the CPU’s input output memory management unit (IOMMU) for the address translation that’s required for virtualization. Since the CPU’s IOMMU is not needed, the direct path can be taken.

  • Capacity and Cost: When data is copied through the CPU’s memory, space must be allocated in CPU memory. CPU memory has limited capacity, usually on the order of 1TB or more, with higher density memory being the most expensive. And it’s expensive. Local storage can have a capacity on the order of 10s of TB, and remote storage capacity can be in petabytes. Disk storage is much cheaper than CPU memory. To GDS, it does not matter where storage is, only that it is in the node, in the same rack, or far away.

  • Memory Allocation: CPU bounce buffers must be managed: allocated and deallocated. This takes time and energy. In some scenarios, that buffer management can get on the critical path for performance. If there is no CPU bounce buffer, this management cost is avoided. When no bounce buffer is needed on the CPU, system memory is freed for other purposes.

  • Asynchrony: While the initial set of cuFile APIs are not asynchronous, forthcoming enhancements will add a stream parameter, enabling asynchronous execution. In Figure 2, an NVIDIA DGX-2 system has two CPU sockets, and each has two PCIe trees. Each of the four PCIe trees (just one shown above) has two levels of switches. Up to four NVMe drives hang off the first level of switches. Each second level switch has a connection to the first level switch, a PCIe slot that can be populated with an NIC or RAID card, and two GPUs.

Figure 2. Sample Topology for Half a System



4. Application Suitability

This section provides information about application sustainability in GDS.

Several conditions must hold for an application to enjoy the benefits provided by GDS:
  • Data transfers or IO transfers are directly to and from the GPU, not through the CPU.
  • IO must be a significant performance bottleneck.
  • Data transfers or IO transfers must be explicit.
  • Buffers must be pinned in the GPU memory.
  • CUDA and the cuFile APIs must be used along with GPUDirect capable NVIDIA® GPUs (Quadro® or Tesla® only).

4.1. Transfers To and From the GPU

Here is some information about data transfers to and from a GPU.

GPUDirect Storage enables direct data transfers between GPU memory and storage. If an application uses the CPU to parse or process the data before or after GPU computation, then GPUDirect Storage doesn’t help. To benefit, the GPU must be the first and/or last agent that touches data transferred to or from storage.

4.2. IO Bottleneck

Here is some information help you understand IO bottlenecks.

For IO to be a bottleneck, it must be on the critical path. If computation time is far greater than the IO time, then GPUDirect Storage provides little benefit. If IO time can be fully overlapped with computation, e.g. with asynchronous IO, then it need not be a bottleneck. Workloads that stream large quantities of data and perform small amounts of compute on each data element tend to be IO bound.

4.3. Explicit

Here is some information about the explicit APIs in GDS.

The APIs provided by GDS are explicit, like Linux pread and pwrite, rather than being implicit, and using a memory faulting model. This may require changing some application code, for example, switching from a model that mmaps memory prior to accessing it directly as needed on the GPU. The direct model delivers higher performance because it is proactive and more efficient rather than a reactive pattern that could induce jitter.

4.4. Pinned

Here is some information about memory that needs to be pinned for DMA transfers.

The memory on the GPU must be pinned to enable DMA transfers. This requires that memory be allocated with cudaMalloc rather than cudaMallocManaged or malloc. This restriction might be relaxed in the future, with more OS enabling. Obviously, the size of each data transfer must fit into the allocated buffer. The transfer does not need to be aligned to anything other than a byte boundary.

4.5. cuFile APIs

Here is some information about the cuFile APIs.

Application and framework developers enable GPUDirect Storage capabilities by incorporating the cuFile APIs, which will be provided in an upcoming Open Beta program. Applications can use the cuFileRead and cuFileWrite APIs directly or they can leverage frameworks and higher-level APIs such as RAPIDS cuDF will take advantage of cuFileRead and cuFileWrite. These APIs enable read and write similar to POSIX pread and pwrite with O_DIRECT, driver initialization and finalization, buffer registration, and more. The cuFileRead and cuFileWrite transfers are explicit and direct, thereby enabling maximum performance.

Any application currently using mmap is indirect and slower because data is loaded from storage to CPU memory and then from CPU memory to GPU memory. To use cuFileRead and cuFileWrite GPU memory must be allocated with cudaMalloc, so that it is pinned, rather than with cudaMallocManaged. When applications know exactly what data to transfer and where, using these APIs is intended to be as simple and straightforward as possible.

5. System Requirements

This section provides the software and hardware requirements for GDS.

GDS is currently available as a limited distribution Alpha release unbundled from CUDA and the associated NVIDIA drivers. It has been made available to select customers and partners for functional evaluation, early performance characterization and developer usability feedback.

Here are the hardware and software requirements for GDS:

Software Requirements

Here are the software requirments for GDS:

  • OS: GPUDirect Storage is only supported on Linux, currently Ubuntu 20.04.
  • File or block system: A GDS-enabled distributed file system or block system must be used.

    This requires installing a kernel-level driver which is a privileged operation.

  • No CPU for IO: File operations that directly involve the CPU, such as RAID 5 or 6, checksums, or compression (ZFS, BTRFS) cannot be used with GPUDirect Storage.
  • Virtualization: Not supported.
  • SBIOS: In some desktop and workstation motherboards, the SBIOS limits the size of the GPU PCIe BAR1 resource, for example, the size of the window of addresses that can be exposed to other DMA engines at any given time.

    A full 64-bit window must be enabled. The name of the variable to change varies across vendors. For a list of Tesla-certified servers, see QUALIFIED SERVER CATALOG.

MOFED and Filesystem Requirements

Here are the requirements:
  • Ubuntu 18.04 and 20.04
  • MOFED 5.1-0.6.6.0 and later, which supports NVMe NVMeoF, NFSoRDMA (VAST) on Linux kernel 4.15.x and 5.4.X
  • The following distributed filesystems:
    • WekaFS 3.8.0
    • DDN Exascaler 5.2
    • VAST

Hardware Requirements

Here are the hardware requirments for GDS:

  • CPU: We currently support Intel and AMD CPUs, but we do not support Arm or IBM POWER platforms.

    Some CPUs offer limited bandwidth when transfers originate with PCIe devices rather than initiating from the CPU.

  • GPUs: NVIDIA Volta™ V100 GPUs and NVIDIA Ampere Architecture are supported and offer the best possible performance available today.

    In general, NVIDIA SKUs which support GDS today, including Tesla and Quadro SKUs will be supported as testing and QA schedules allow.

    • NVIDIA® GeForce®, Tegra®, and Jetson™ platforms are not supported.
    • Many Quadro products and some of the smaller profile Tesla SKUs have smaller BAR1 sizes and will perform differently than NVIDIA V100 GPUs.
  • PCIe Peer-to-Peer (P2P): PCIe P2P is a prerequisite for GPUDirect Storage.
    • PCIe P2P is relevant between the CPU root complex and its endpoints within a PCIe tree and between CPUs when a CPU-CPU connection is a segment in the path between endpoints in two PCIe trees.
    • When PCIe P2P is lacking support is lacking in the CPU, all relevant endpoints must pass through a PCIe switch instead, without using the connections to the CPU.

    • Even when supporting PCIe P2P, currently shipping mainstream CPUs may limit performance of P2P traffic, for example showing 80% or lower read BW than expected.
    • If endpoints of interest, such as a GPU and an NIC, are in different PCIe trees, PCIe switches can only be used to connect them without the CPU using a special “fabric mode.” Systems with such support are not common.
    • See the GPUDIrect RDMA documentation for more information.
      You can use lspci can be used to check the PCI topology:
      $ lspci -t 
    lspci can be used to check the PCI topology:

6. Platform Performance Suitability

GPUDirect Storage benefits can be maximized under the following conditions:

6.1. Bandwidth from Storage

Here is some information about bandwidth usage from storage.

For remote storage, there is benefit to a higher ratio of NICs or RAID cards to GPUs for remote storage, up to the limits of IO demand.

For local storage, a larger number of drives is needed to approach PCIe saturation. The number of drives is of first order importance. It takes at least 4 x4 PCIe drives to saturate a x16 PCIe link. The IO storage bandwidth of a system is proportional to the number of drives. Many systems such as an NVIDIA DGX-2 can take at most 16 drives which are attached via the Level- 1 PCIe switches. The peak bandwidth per drive is of secondary importance. NVMe drives tend to offer higher bandwidth and lower latency than SAS drives. Some file systems and block systems vendors support only NVMe drives and non SAS drives.

6.2. Paths from Storage to GPUs

Here is some information about the paths from storage to the GPUs.

PCIe switches aren’t required to achieve some of the performance benefits, since a direct path between PCIe endpoints may pass through the CPU without using a bounce buffer.

The use of PCIe switches can increase the peak bandwidth between NICs or RAID cards or local drives and GPUs. One level of switches on each PCIe tree can double potential bandwidth; with a second level, the potential bandwidth can quadruple to approach the peak input bandwidth. For example:
  • First level of switches in an NVIDIA DGX-2 enable GPU input simultaneous access from the CPU (12-12.5 GB/s) and local storage (13.3 GB/s).
  • Second level of switches in an NVIDIA DGX-2 enable that >25 GB/s to be combined with IO from RAID cards (14 GB/s) or NICs (11-12 GB/s). This can provide 11-14 GB/s for each of 16 GPUs, which is on the order of 200 GB/s.

These comparisons are displayed in Figure 3:

Figure 3. Comparing the Paths from Storage to the GPUs



6.3. GPU BAR1 Size

Here is some information about the GPU BAR1 size.

GPUDirect Storage enables DMA engines to move data through the GPU BAR1 aperture into or out of GPU memory. The transfer size might exceed the GPU BAR1 size. In such cases, the GPUDirect Storage software recognizes that and uses an intermediate buffer in GPU memory for the DMA engine to copy into and the GPU to copy out of into the target buffer. This is handled transparently but adds some overhead.

Increasing the GPU BAR1 size can reduce or eliminate such copy overheads.

7. Call to Action

Explain what the concept is and why the reader should care about it in 50 words or fewer.

The following list suggests things that can be done today or as part of a GPUDirect Storage implementation.
  • Choose to be part of the GPU storage platform of the future.
  • Enable your app by fully porting it to the GPU, so that the IO is directly between GPU memory and storage.
  • Use interfaces that make explicit transfers: use cuFile APIs directly or via a framework layer that is already enabled to use cuFile APIs.
  • Choose and use distributed file systems or distributed block systems that are enabled with GPUDirect Storage.
  • Send feedback and questions to gpudirect-storage@nvidia.com and a team member will respond with the next steps.

Notices

Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

Notices

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

VESA DisplayPort

DisplayPort and DisplayPort Compliance Logo, DisplayPort Compliance Logo for Dual-mode Sources, and DisplayPort Compliance Logo for Active Cables are trademarks owned by the Video Electronics Standards Association in the United States and other countries.

HDMI

HDMI, the HDMI logo, and High-Definition Multimedia Interface are trademarks or registered trademarks of HDMI Licensing LLC.

OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.

Notices

Trademarks

NVIDIA, the NVIDIA logo, DGX, DGX-1, DGX-2, Tesla, and Quadro are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.