NVIDIA GPUDirect Storage Overview Guide

The NVIDIA® GPUDirect® Storage Overview Guide provides a high-level overview of GDS, guidance to help you enable filesystems for GDS, and some insights about the features of a filesystem and how it relates to GDS.

1. Introduction

GDS enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.

This guide provides a high-level overview of GDS, guidance to help you enable filesystems for GDS, and some insights about the features of a filesystem and how it relates to GDS. The guide also outlines the functionalities, considerations, and software architecture about GDS. This high-level introduction sets the stage for deeper technical information in the cuFile API Reference Guide for GDS users who need to modify the kernel.

1.2. Benefits for a Developer

Here is some informaton about the benefits that GDS provides for application developers.

Here are the benefits that are provided by GDS:
  • Enables a direct path between GPU memory and storage.
  • Increases the bandwidth, reduces the latency, and reduces the load on CPUs and GPUs for data transferral.
  • Reduces the performance impact and dependence on CPUs to process storage data transfer.
  • Performance force multiplier on top of the compute advantage for computational pipelines that are fully migrated to the GPU so that the GPU, rather than the CPU, has the first and last touch of data that moves between storage and the GPU.
  • Supports interoperability with other OS-based file access, which enables data to be transferred to and from the device by using traditional file IO, which is then accessed by a program that uses the cuFile APIs.
Here are the benefits that are provided by the cuFile APIs and their implementations:
  • A family of APIs that provide CUDA applications with the best-performing access to local or distributed file and block storage.

    Block storage validation might be added in the future.

  • These APIs are consistent with the long-term direction of the Linux community, for example, with respect to peer to peer RDMA.
  • When transferring to and from the GPU, increased performance relative to existing standard Linux file IO.
  • Greater ease of use by removing the need for the careful expert management of memory allocation and data movement.
  • A simpler API sequence that is relative to existing implicit file-GPU data movement methods, which require a more complex management of memory and data movement on and between the CPU and GPU.
  • Broader support for unaligned transfers than POSIX pread and pwrite APIs with O_DIRECT.

    In the application code, the POSIX APIs require a buffered IO or unaligned handling.

  • Generality across a variety of storage types that span various local and distributed filesystems, block interfaces, and namespace systems, including standard Linux and third-party solutions.
Here are the benefits that are provided by the Stream subset of the cuFile APIs:
  • Asynchronous offloaded operations are ordered with respect to a CUDA stream.
    • O after compute: The GPU kernel produces data before it is transferred to IO.
    • Compute after IO: After the data transfer is complete, the GPU kernel can proceed.
  • Available concurrency across streams.
    • Using different CUDA streams allows the possibility of concurrent execution and the concurrent use of multiple DMA engines.

1.3. Intended Uses

Here is some information that explains how to use the cuFile features.

Here is a list of how you can use the cuFile features:
  • cuFile implementations boost throughput when IO between storage and GPU memory is a performance bottleneck.

    This condition arises in cases where the compute pipeline has been migrated to the GPU from the CPU, so that the first and last agents to touch data, before or after transfers with storage, execute on the GPU.

  • cuFile APIs are currently explicit, and reading or writing between storage and buffers that completely fit into the available GPU physical memory.
  • Rather than fine-grained random access, the cuFile APIs are a suitable match for coarse-grained streaming transfers.
  • For fine-grained accesses, the underlying software overheads for making a kernel transition and going through the operating system can be amortized.

1.4. Versioning History

Here is some information about the numbering scheme that is used for the documentation.

A common versioning scheme is used for documents and utilities with a -v switch that corresponds to the following major releases:
  • 0.4 Pre-Alpha
  • 0.5 Alpha
  • 0.7 Beta, April 2020
    • CPU-staged fallback path to POSIX-compliant filesystems when driver is absent
    • Add support for DDN EXAScaler®, parallel filesystem solutions (based on the Lustre filesystem) and WekaFS™.
    • Deployment: tarball with installer
  • 0.7.1 Beta Update 1, June 2020
    • Bug fixes.
    • Documentation improvements.
  • 0.8 Release in October 2020

The overall schema is <major release>.<minor release>.<patch number>. The minor release number is incremented for each validated minor release, and this value returns to 0 with each major release. Patch numbers might be used for bug fixes in unofficial releases. Until version 1.0, API definitions may continue to change.

See GPUDirect Storage Release Notes for details of functional and performance changes since previous releases. The following subsections pertain to documentation changes.

Note: These are API name changes and API argument changes that have occurred since the Alpha release and that developers must accommodate to compile.

1.5. Update History

This section provides information about the updates to this guide.

Updates Since Version 0.7

Since version 0.7 of this guide was created, the following sections are new:
  • Functional Overview
  • GPUDirect Storage Requirements, except Software Components and Alignment with Other Linux Initiatives, which also had minor updates.
  • Using GPUDirect Storage in Containers

Sections 1, 2, and a part of section 3 from the cuFile API Reference Guide have been moved to this guide.

Updates Since Version 0.5 (Alpha 2)

The following sections have been updated since version 0.5, relative to the cuFile API Reference Guide:
  • 1.4: Shifted from upcoming features to Beta availability.

    Concretized references to partner support, which will simplify future work.

  • 2.2: Compatibility mode has been added.
  • 2.4: Added monitoring functionalities like Ftrace, logging, profiling.
  • 5.1: Updated deployment specifics and library dependencies.
  • 5.3: Dependencies have been refined
  • 5.4: Limitations were updated, and specifics of distributed filesystem support were added.

Updates Since Version 0.4 (Alpha 1)

The following updates have been made to this document since versions 0.4 of the cuFile API Reference Guide guide:
  • 1.2: Greater clarity around ease of use and unaligned IO as a benefit.
  • 2.5: Increased clarity around GPLv2.

2. Functional Overview

This section provides a functional overview of GDS. It covers basic usage, generality, performance considerations, and a scope of the solution. This documentation applies to the cuFile APIs, which are issued from the CPU.

2.1. Explicit and Direct

GDS is a performance-centric solution, so the performance of an end-to-end transfer is a function of latency overheads and the maximal achievable bandwidth.

Here are some terms:
Explicit programmatic request
An explicit programmatic request that immediately invokes the transfer between the storage and the GPU memory is proactive.
Implicit request
An implicit request to storage, which is induced by a memory reference that causes a page miss from the GPU back to the CPU, and potentially the CPU to storage, is reactive.
Note: Reactive activity tends to induce more overhead. As a result of being explicit and proactive, GDS maximizes performance with its explicit cuFile APIs.

Latency is lower when extra copies are avoided, and the highest bandwidth paths are taken. Without GDS, an extra copy through a bounce buffer in the CPU is necessary, which introduces latency and lowers effective bandwidth.

Note: The latency improvements from GDS are most apparent with small transfers.

With GDS, although there are exceptions, a zero-copy approach is possible. Additionally, when a copy through the CPU is no longer necessary, the data path does not include the CPU. On some systems, a direct path between local or remote storage that goes through a PCIe switch offers at least twice the peak bandwidth as compared to taking a data path through the CPU. Using cuFile APIs to access GDS technology enables explicit and direct transfers, which offers lower latency and higher bandwidth.

For direct data transfers between GPU memory and storage, the file must be opened in O_DIRECT mode. If the file is not opened in this mode, contents might be buffered in the CPU system memory, which is incompatible with direct transfers.

The following graphic compares code sequences of an explicit copy versus using mmap and incurring an implicit page fault where necessary:

Figure 1. Explicit Copy versus Using mmap

In the left pane, pread is used to move data from storage into a CPU bounce buffer, sysmem_buf, and cudaMemcpy is used to move that data to the GPU. In the right pane, mmap makes the managed memory backed by the file. The references to managed memory from the GPU that are not present in GPU memory will induce a fault back to the CPU and then to storage, which causes an implicit transfer.

GDS enables DMA between agents (NICs or NVMe drives) near storage and GPU memory. Traditional POSIX read and write APIs only work with addresses of buffers that reside in CPU system memory. cuFile APIs, in contrast, operate on addresses of buffers that reside in GPU memory. So they look very similar, but have a few differences, as shown in Figure 2.

The following graphic compares the POSIX APIs and cuFile APIs. POSIX pread and pwrite require buffers in CPU system memory and an extra copy, but cuFile read and write only requires file handle registration.

Figure 2. Comparing the POSIX APIs and the cuFile APIs

Here are the essential cuFile functionalities:
  • Explicit data transfers between storage and GPU memory, which closely mimic POSIX pread and pwrite.
  • Non-buffered IO (using O_DIRECT), which avoids the use of the filesystem page cache and creates an opportunity to completely bypass the CPU system memory.
  • Performing IO in a CUDA stream, so that it is both async and ordered relative to the other commands in that same stream.

The direct data path that GDS provides relies on the availability of filesystem drivers that are enabled with GDS. These drivers run on the CPU and implement the control path that sets up the direct data path.

2.2. Performance Optimizations

After there is a viable path to explicitly and directly move data between storage and GPU memory, there are additional opportunities to improve performance.

2.2.1. Implementation Performance Enhancements

GDS provides a user interface that abstracts the implementation details. With the performance optimizations in that implementation, there are trade offs that are enhanced over time and are tuned to each platform and topology.

Here is a list of some of those performance optimizations (Figure 3):

Figure 3. Performance Optimizations

  • Path selection
    There might be multiple paths available between endpoints. In an NVIDIA® DGX-2™ system, for example, GPU A and GPU B that are connected to CPU sockets CPU A and CPU B respectively may be connected via two paths.
    • GPU A --> CPU A PCIe root port --> CPU A to CPU B via the CPU interconnect --> CPU B along another PCIe path to GPU B.
    • GPU A --> GPU B using NVLink.
    Similarly, a NIC that is attached to CPU A and to GPU A via PCIe by using an intervening switch has a choice of data paths to GPU B:
    • The NIC --> CPU A PCIe root port, CPU A --> CPU B via CPU interconnect, and CPU B along another PCIe path --> GPU B.
    • The NIC --> a staging buffer in GPU A and NVLink --> GPU B.
  • Staging in intermediate buffers

    Bulk data transfers are performed with DMA copy engines. Not all paths through a system are possible with a single-stage transfer, and sometimes a transfer is broken into multiple stages with a staging buffer along the way.

    In the NIC-GPU A-GPU B example in the graphic, a staging buffer in GPU A is required, and the DMA engine in GPU A or GPU B is used to transfer data between GPU A’s memory and GPU B’s memory.

    Data might be transferred through the CPUs along PCIe only or directly between GPUs over NVLink. Although DMA engines can reach across PCIe endpoints, paths that involve the NVLink may involve staging through a buffer (GPU A).

  • Dynamic routing

    Paths and staging. The two paths in the following graphic are available between endpoints on the left half and the right half, the red PCIe path or the green NVLink path.

2.2.2. Concurrency Across Threads

Here is some information about how GDS manages concurrency across threads.

Note: All APIs are expected to be thread safe.

Using GDS is a performance optimization. After the applications are functionally enabled to move data directly between storage and a GPU buffer by passing a pointer to the GPU buffer down through application layers, performance is the next concern. IO performance at the system level comes from concurrent transfers on multiple links and across multiple devices. Concurrent transfers for each 4 x 4 NVMe PCIe device is necessary to get full bandwidth from one x16 PCIe link. Since there are PCIe links to each GPU and to each NIC, many concurrent transfers are necessary to saturate the system. GDS does not boost concurrency, so this level of performance tuning is managed by the application.


Another form of concurrency, between the CPU and one or more GPUs, can be achieved in an application thread through asynchrony.

In this process, work is submitted for deferred execution by the CPU, and the CPU can continue to submit more work to a GPU or complete the work on the CPU. This process adds support in CUDA for async IO, which can enable a graph of interdependent work that includes IO to be submitted for deferred execution.

There is a plan for this to be enabled in a future version of GDS with an asynchronous subset of the cuFile APIs, and this feature will add a CUDA stream as an argument. These APIs will also add a pointer to an integer to hold the number of transferred bytes, which is asynchronously updated. Refer to the cuFile API Reference Guide for more information.

2.2.4. Batching

Here is some information about how batching is used in GDS.

There is some fixed overhead involved with each submission from an application into the cuFile implementation. For usage models where many IO transactions get submitted simultaneously, batching reduces the overhead by amortizing that fixed overhead across the transactions in the batch, which improves performance.

Applications might also submit a batch of IO transactions and start working on a subset of completed transactions without having to wait for the whole set. An automatically updated set of flags that indicate which transactions in a batch have completed allows the application to proceed before the entire set of transactions in the batch have completed.

The cuFile batch APIs require the application developer to allocate and populate a data structure with a set of descriptors for IO transactions in the batch, and an initialized bit vector to indicate the completion status. Batch APIs are also asynchronous and use a CUDA stream argument. Refer to the cuFile API Reference Guide for more information.

2.3. Compatibility and Generality

Although the purpose of GDS is to avoid using a bounce buffer in CPU system memory, the ability to fall back to this approach allows the cuFile APIs to be used ubiquitously even under suboptimal circumstances. A compatibility mode is available for unsupported configurations that maps IO operations to a fallback path.

This path stages through CPU system memory for systems where one or more of the following conditions is true:
  • Explicit configuration control by using the user version of the cufile.json file.

    Refer to the cuFile API Reference Guide for more information.

  • The lack of availability of the nvidia-fs.ko kernel driver, for example, because it was not installed on the host machine, where a container with an application that uses cuFile, is running.
  • The lack of availability of relevant GDS-enabled filesystems on the selected file mounts, for example, because one of several used system mounts does not support GDS.
  • File-system-specific conditions, such as when O_DIRECT cannot be applied.

    Vendors, middleware developers, and users who are doing a low-level analysis of file systems should review the GPUDirect Storage O_DIRECT Requirements Guide for more information.

Refer to cuFileHandleRegister in the cuFile API Reference Guide for more information. Performance on GPU-based applications that transfer between the storage and GPU memory in compatibility mode is at least the same or better than current CPU-based APIs when GDS is not used. Testing for the CPU path is limited to POSIX-based APIs and qualified platforms and filesystems that do not include GDS.

Even when transfers are possible with GDS, a direct transfer is not always possible. Here is a sampling of cases that are handled seamlessly by the cuFile APIs:
  • The buffer is not aligned, such as the following:
    • The offsets of the file are not 4KB-page aligned.
    • The GPU memory buffer address is not 4KB-page aligned.
    • The IO request size is not a multiple of 4KB.
    • The requested IO size is too small, and the filesystem cannot support RDMA.
  • The size of the transfer exceeds the size of the GPU BAR1 aperture.
  • The optimal transfer path between the GPU memory buffer and storage involves an intermediate staging buffer, for example, to use NVLink.

The compatibility mode and the seamless handling of cases that require extra steps broaden the generality of GDS and makes it easier to use.

2.4. Monitoring

This section provides information about the monitoring facilities that are available to track functional and performance issues in GDS.

GDS supports the following monitoring facilities for tracking functional and performance issues:
  • Ftrace

    Exported symbols for GDS functions can be traced using Ftrace. You can also use static tracepoints in the libcufile.so library, but the tracepoints are not yet supported for nvidia-fs.ko. Refer to the GPUDirect Storage Troubleshooting Guide for more information.

  • Logging

    Error conditions and debugging outputs can be generated in a log file. This information is useful for conditions that affect many of the APIs but need only be reported once or affect APIs with no return value to report errors. The cufile.json file is used to select at least reporting level, such as ERROR, WARN, INFO, DEBUG, and TRACE.

  • Profiling

    GDS can be configured to collect a variety of statistics.

These facilities, and the limitations of third-party tools support, are described in greater detail in the GPUDirect Storage Troubleshooting Guide.

2.5. Scope of the Solutions in GDS

Here is some information about the solutions that are available in GDS.

GDS has added new APIs with functionality that is not supported by today’s operating systems, including direct transfers to GPU buffers, asynchrony, and batching. These APIs offer a performance boost, with a platform-tuned and topology-tuned selection of paths and staging, which add enduring value.

The implementations under cuFile APIs overcome limitations in current operating systems. Some of those limitations are transient and may be removed in future versions of operating systems. Although these solutions are not currently available and may require time for adoption, other GDS-enabled solutions are needed today. Here are the solutions currently available in GDS:
  • Third-party vendor solutions for distributed filesystems.
  • Long-term support through open source, upstreamed Linux that future GDS implementations will seamlessly use.
  • Local filesystem support by using modified storage drivers (currently for experimentation only).
  • The overall cuFile architecture involves a combination of components, some from NVIDIA and some from third parties.
  • Here is a list of the NVIDIA-originated content:
    • User-level cuFile library, libcufile.so, which implements the following in the closed source code:
      • cuFile Driver APIs:
        • cuFileDriver{Open, Close}
        • cuFileDriver{GetProperties, Set*}
      • cuFile IO APIs:
        • cuFileHandle{Register, Deregister}
        • cuFileBuf(Register, Deregister}
        • cuFile{Read, Write}
      • Stream subset of the cuFile APIs (Future):
        • cuFile{Read, Write}Async
      • cuFileBatch APIs (Future):
        • cuFileBatchIO(Submit, GetStatus, Cancel, Destroy}
        • Calls to VFS components in standard Linux whether the filesystem is standard Linux, NFS, distributed filesystems, and so on.
      • nvidia-fs.ko, the kernel-level driver:
        • Implements callbacks from modified Linux kernel modules or from proprietary filesystems that enable direct DMA to GPU memory.
        • Licensed under GPLv2.

          Likewise, any kernel third-party kernel components that call the nvidia-fs APIs should expect to be subject to GPLv2.

  • Third-party content
    • Proprietary code stacks that replace portions of the Linux filesystem and block system, and so on.

3. Software Architecture

This section provides some basic information on how GDS works.

GDS enables a DMA engine near storage (NVMe or NIC) to push (or pull) data directly into (and out of) GPU memory. cuFile APIs are passed parameters for one file, a file offset, a size to transfer, and a GPU virtual address to which the parameters can read or write. Although the resulting aggregate transfer is one contiguous virtual address range, several smaller transfers may occur in the implementation. The filesystem breaks the contiguous virtual address range into what might become multiple transfers that might span multiple devices. An example is RAID-0 and potentially multiple pages with non-contiguous physical address ranges. The resulting set of physical address ranges is called a scatter-gather list.

Existing operating systems attempting to program DMA engines cannot process GPU virtual addresses without help. The GDS-enabled kernel drivers use callbacks to the GDS kernel module, nvidia-fs.ko. These callbacks provide the GPU virtual addresses needed in the final scatter-gather list used to program the DMA engine.

3.1. Software Components

This section provides information about the software stack in GDS.

The following layers exist in the GDS software stack:
  • The application, which includes cufile.h and which makes cuFile API calls from the CPU.
  • The GDS user-level library, libcufile.so.
  • The Linux virtual filesystem, VFS.
  • Linux or vendor kernel storage drivers.
  • The GDS kernel-level library, nvidia-fs.ko.

The following graphic illustrates a simple software stack:

Figure 4. A Simple GDS Software Stack

3.2. Primary Components

Here is some information about the primary components in the GDS software architecture.

Here are the primary components:
  • (From NVIDIA)libcufile.so, which is the user-level cuFile library:
    • Implements the cuFile API, which is the application-facing API for GDS.

      cuFileRead is shown in the architecture overview graphic in Software Components.

    • There are two alternatives to implement the cuFile API:
      • Use the nvidia-fs.ko kernel driver.

        All filesystems that use VFS use this path.

      • The cuFile user library implements an alternative implementation that does the following:
        • Uses its non-page cache buffering in the CPU system memory.
        • Uses the standard POSIX call implementations.
        • Does not need to use the NVFS kernel driver.

          This is a compatibility mode that does not enjoy the GDS benefits.

  • (Not from NVIDIA) Non-block-based or distributed filesystems:
    • These filesystems might be the standard Linux virtual filesystem (VFS), for example an NFS driver or a third-party proprietary system.

      The selection of control paths is based on how filesystems are mounted:

      <file path> --> <mount point> --> <filesystem selection>

    • In some cases, NVIDIA provides patches to these, or alternate, implementations, for example, to kernel modules for NVMe and NVMe-oF.
  • (From NVIDIA) Kernel-level nvidia-fs driver:
    • Handles IOCTLs from the cuFile user library.
    • Implements DMA callbacks to check and translate GPU virtual addresses to physical addresses. These callbacks are called from storage drivers.
    • Manages the mechanisms and buffering that enable DMA from the device.
    Note: The Linux kernel core is completely unmodified.

3.2.1. Workflows for GDS Functionality

Here is some information about the workflows that are associated with GDS.

The two flows that are associated with GDS functionality are illustrated in the following graphic:

Figure 5. Workflows for GDS Functionality

For more information about these workflows, see Workflow 1 and Workflow 2.

3.2.2. Workflow 1

Here are the steps to complete Workflow 1.

The first workflow pertains to cuFileRead and cuFileWrite usage. The GPU virtual addresses are represented by proxy CPU system memory addresses. The proxy CPU system memory addresses are passed through the Linux IO stack and are converted to device-specific DMA bus addresses.

Note: None of the following steps are used on a standard pread or pwrite POSIX call.
  1. App to libcufile.so.
    1. GPU applications or GPU-enabled frameworks link to the cuFile library
    2. The applications or frameworks call the cuFile Driver and IO APIs, such as cuFileRead and cuFileWrite.

    The alignment is handled at this level, and there might be some performance impact, so that buffers do not need to be aligned, such as to 4KB pages or 512KB storage offsets and chunk sizes.

  2. libcufile.libcufile makes decisions about which mode to use based on the filesystem, the configuration, and the hardware support to select between compatibility mode and GDS, and whether to use internal GPU buffers for efficiency.
  3. libcufile to nvidia-fs.
    1. The cuFile library, libcufile.so, services those calls and makes appropriate IOCTL calls to the nvidia-fs.ko driver.
    2. The library interacts with the CUDA user-mode driver library, libcuda.so, as necessary for the stream subset of the cuFile APIs.
  4. nvidia-fs to VFS.
    1. The kernel driver iterates through the set of necessary IO operations and passes in the IO completion callback, in kiocb->common.ki_complete with the callback function value nvfs_io_complete that will be used in step 7. Those calls are to the VFS, which calls the appropriate lower layers, such as the standard Linux block system (ext4 and NVMe) or another vendor distributed filesystem such as EXAScaler®.
  5. Storage kernel drivers to nvidia-fs.ko: Callback APIs are registered via the cuFileDriverOpen initialization, as described in Filesystems Interoperability in the GDS External Architecture Spec.
    With this design, drivers need only need to handle GPU addresses through the substeps below. GPU memory addresses are available in a separate map, outside the Linux page map, so the nvidia-fs.ko APIs are used to complete the following tasks:
    • Check whether the DMA target address is on the GPU (nvfs_is_gpu_page) and needs to be handled differently.
    • Query the list of S GPU DMA target addresses by using nvfs_dma_map_sg*, which are used instead of the CPU system memory address that is passed through the VFS.
  6. Storage kernel drivers to DMA/RDMA engines: After the appropriate GPU memory addresses are obtained, the underlying DMA engines at (for example, NVMe drivers) or near (for example, NIC) storage can be programmed to move data directly between storage (for example, NVMe or storage controller or NIC) and GPU memory. The special proxy addresses in CPU system memory are not accessed by the DMA engines.
  7. DMA/RDMA engines to storage kernel driver: Completion of each block transfer is signaled back to the storage driver layer.
The completion of each iteration is signaled back to the nvidia-fs driver by using the callback that was registered in step 4.

3.2.3. Workflow 2

This section provides information about the second flow that relates to reads and writes with user-space RDMA using ib_verbs.

  1. App to libcufile.so: GPU applications or GPU-enabled frameworks link to the cuFile library and call the cuFile Driver and IO APIs. The alignment is handled at this level, though perhaps with some performance impact, so that buffers do not need to be aligned, such as 4KB pages or 512KB storage offsets and chunk sizes.
  2. libcufile.so: obtain RDMA info (keys, GID, LID, and so on) to libcufile.
  3. libcufile.so to vendor library: libcufile calls the appropriate vendor library callback functions to communicate the Rkeys directly in userspace or through the nvidia-fs kernel callbacks depending on the vendor driver implementation.

Aligning with Other Linux Initiatives

This section provides information about how to align GDS with other Linux initiatives.

There are efforts in the Linux community to add native support for DMA among peer devices, which can include NICs and GPUs. After this support is upstreamed, it will take time for all users to adopt the new Linux versions via distributions. Until then, NVIDIA will work with third-party vendors to enable GDS.

The cuFile APIs and their implementation is the mechanism by which CUDA adds support for file IO. The cuFile APIs cover explicit transfers between CPU and GPU storage and memory. The APIs also add support for asynchrony and batching, which are not available in POSIX IO. The cuFile APIs will remain relevant after the functionalities mentioned earlier are added to Linux. Only the underlying implementations will change, but not the existing cuFile APIs.

NVIDIA's initial implementations for cuFile focus on distributed filesystems and systems where appropriate drivers have been installed to enable a direct transfer between storage and GPU memory without using a bounce buffer in the CPU. For compatibility and broader applicability, later implementations may support extensions for local storage and implicit transfers.

4. Deployment

This section provides information about how GDS is deployed, its dependencies, and its limitations and constraints.

4.1. Software Components for Deployment

Here is some information about the software components that are required to deploy GDS.

cuFile APIs are a supplement to the CUDA® driver and runtime APIs and might eventually be distributed and installed with the CUDA driver.

Applications access cuFile functionality by including cuFile.h and linking against the libcufile.so library. The forthcoming stream subset of the cuFile APIs, a CUDA stream parameter is needed, which takes different forms for runtime and driver APIs. The cudaFile and cuFile prefixes are used for those two cases, respectively. The conversion from runtime to drive APIs can be done in header files.

Beyond libcufile.so, there are no linker dependencies that are required to use the cuFile API, but a runtime dynamic dependency on libcuda.so exists. No link dependency on other CUDA Toolkit libraries, CUDA Runtime libraries, or any other components of the display driver currently exist. However, an eventual runtime dependency on the CUDA Runtime library should be anticipated for applications that are using the cudaFile* APIs after they are added to CUDA Runtime. This is step is consistent with the application using any other cuda* API and use of the CUDA Runtime in deployment is covered in the CUDA deployment documentation at NVIDIA Developer Documentation.

In addition to libcuda.so, cuFile has dependencies on external third-party libraries.

The following table provides information about the third-party libraries and CUDA library levels:

Table 1. Third-Party Libraries and CUDA Library Levels
Level APIs, Types, and Enum Style Dependencies Packaged Together
cuFile user library
Matches the following CUDA driver conventions:
  • cuFile APIs
  • cuFile_ enum vals and defines
  • CU_FILE_ errors
Here are the dependencies:
  • Provides libcufile.so until perhaps it gets merged into libcuda.so.
  • Provides cuFile.h until perhaps it gets merged into cuda.h.
  • External library dependencies: libudev-dev liburcu-dev libmount-dev libnuma-dev libJSONcpp-dev
Shipped separately from libcufile.so.
CUDA runtime + toolkit Compatibility in cufile.h for streams APIs’ usage of cudaStream_t. None cufile.h remains distinct from cuda.h and cuda_runtime.h.
nvidia-fs kernel driver nvfs_ prefix

Provides nvidia-fs.ko GPL

Separately shippable with respect to the NVIDIA driver (until perhaps it gets merged), but it might become co-installed.

Until there is a complete integration with CUDA and its installer, a separate installer is used to deploy libcufile.so, cuFile.h, and nvidia-fs.ko. Depending on the filesystem enabling, the installer utilities and scripts may install the software as a stand-alone deployment, or the software may be integrated into a third-party vendor’s installation framework.

4.2. Using GPUDirect Storage in Containers

This section provides information about using GDS in containers.

GDS has user-level and kernel-level components. Containers include only user-level code and rely on kernel-level components having been installed on the host machine. Applications can be developed with GDS’s header files and user-level library and distributed in containers. When the appropriate drivers and vendor-enable kernel software are not installed or properly configured, GDS’s compatibility mode enables the cuFile APIs to continue to maintain functional operation with minimized performance impact.

4.3. Internal and External Dependencies

This section provides information about the dependencies for GDS.

GDS has no internal dependencies on other libraries, but it does have the following external dependencies:
  • Internal dependencies: none
    • cuFile libraries and drivers do not modify CUDA.
    • The streams subset of the cuFile APIs use the CUDA user driver (libcuda.so) and CUDA runtime (libcudart.so).

      The only APIs used by those drivers are public APIs.

  • External dependencies
    • cuFile uses kernel facilities that are in Linux kernel version 4.15.0.x and later.
    • cuFile has a dependency on MOFED versions (4.6 and later) for support for RDMA-based filesystems.
    • GPUDirect partners may have dependencies on host-channel adapters that are Mellanox Connect X-5 or later.



This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.


No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.


VESA DisplayPort

DisplayPort and DisplayPort Compliance Logo, DisplayPort Compliance Logo for Dual-mode Sources, and DisplayPort Compliance Logo for Active Cables are trademarks owned by the Video Electronics Standards Association in the United States and other countries.


HDMI, the HDMI logo, and High-Definition Multimedia Interface are trademarks or registered trademarks of HDMI Licensing LLC.


OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.



NVIDIA, the NVIDIA logo, DGX, DGX-1, DGX-2, Tesla, and Quadro are trademarks and/or registered trademarks of NVIDIA Corporation in the Unites States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.