GPUDirect Storage Parameters

This section describes the JSON configuration parameters used by GDS.

When GDS is installed, the /etc/cufile.json parameter file is installed with default values. The implementation allows for generic GDS settings and parameters specific to a file system or storage partner.

Note:

Consider compat_mode for systems or mounts that are not yet set up with GDS support.


Table 1. GPUDirect Storage cufile.json Variables
Parameter Default Value Description
logging:dir CWD Location of the GDS log file.
logging:level ERROR Verbosity of logging.
profile:nvtx false Boolean which if set to true, generates NVTX traces for profiling.
profile:cufile_stats 0 Enable cuFile IO stats. Level 0 means no cuFile statistics.
profile:io_batchsize 128 Maximum size of the batch allowed.
properties:max_direct_io_size_kb 16384 Maximum IO chunk size (4K aligned) used by cuFile for each IO request (in KB).
properties:max_device_cache_size_kb 131072 Maximum device memory size (4K aligned) for reserving bounce buffers for the entire GPU (in KB).
properties:max_device_pinned_mem_size_kb 33554432 Maximum per-GPU memory size in KB, including the memory for the internal bounce buffers, that can be pinned.
properties:use_poll_mode false Boolean that indicates whether the cuFile library uses polling or synchronous wait for the storage to complete IO. Polling might be useful for small IO transactions. Refer to Poll Mode below.
properties:poll_mode_max_size_kb 4 Maximum IO request size (4K aligned) in or equal to which library will be polled (in KB).
properties.force_compat_mode false If true, this option can be used to force all IO to use compatibility mode. Alternatively the admin can unload the nvidia_fs.ko or not expose the character devices in the docker container environment.
properties:allow_compat_mode false If true, enables the compatibility mode, which allows cuFile to issue POSIX read/write. To switch to GDS-enabled I/O, set this to false. Refer to Compatibility Mode below.
properties:rdma_dev_addr_list [] Provides the list of relevant client IPv4 addresses for all the interfaces that can be used for RDMA.
properties:rdma_load_balancing_policy RoundRobin Specifies the load balancing policy for RDMA memory registration. By default, this value is set to RoundRobin. Here are the valid values that can be used for this property:

FirstFit - Suitable for cases where numGpus matches numPeers and GPU PCIe lane width is greater or equal to the peer PCIe lane width.

MaxMinFit - This will try to assign peers in a manner that there is least sharing. Suitable for cases, where all GPUs are loaded uniformly.

RoundRobin - This parameter uses only the NICs that are the closest to the GPU for memory registration in a round robin fashion.

RoundRobinMaxMin - Similar to RoundRobin but uses peers with least sharing.

Randomized - This parameter uses only the NICs that are the closest to the GPU for memory registration in a randomized fashion.

properties:rdma_dynamic_routing false Boolean parameter applicable only to Network Based File Systems. This could be enabled for platforms where GPUs and NICs do not share a common PCIe-root port.
properties:rdma_dynamic_routing_order [ "GPU_MEM_NVLINKS", "GPU_MEM", "SYS_MEM", "P2P" ] The routing order applies only if rdma_dynamic_routing is enabled. Users can specify an ordered list of routing policies selected when routing an IO on a first-fit basis.
properties:io_batchsize 128

The max number of IO operations per batch.

properties:gds_rdma_write_support true Enable GDS write support for RDMA based storage.
properties:io_priority default

Enable io priority w.r.t. compute streams

Valid options are "default", "low", "med", "high"

Tuning this might be helpful in cases where cudaMemcpy is not performing as expected because of the GPU being consumed by the compute.

fs:generic:posix_unaligned_writes false Setting to true forces the use of a POSIX write instead of cuFileWrite for unaligned writes.
fs:lustre:posix_gds_min_kb 4KB Applicable only for the EXAScaler filesystem. This is applicable for reads and writes. IO threshold for read/write (4K aligned) that is equal to or below the threshold that cufile will use for a POSIX read/write.
fs:lustre:rdma_dev_addr_list [] Provides the list of relevant client IPv4 addresses for all the interfaces that can be used by a single lustre mount. This property is used by the cuFile dynamic routing feature to infer preferred RDMA devices.
fs:lustre:mount_table [] Specifies a dictionary of IPv4 mount addresses against a Lustre mount point.This property is used by the cuFile dynamic routing feature. Refer to the default cufile.json for sample usage.
fs:nfs:rdma_dev_addr_list [] Provides the list of IPv4 addresses for all the interfaces a single NFS mount can use. This property is used by the cuFile dynamic routing feature to infer preferred RDMA devices.
fs:nfs:mount_table [] Specifies a dictionary of IPv4 mount addresses against a Lustre mount point. This property is used by the cuFile dynamic routing feature. Refer to the default cufile.json for sample usage.
fs:weka:rdma_write_support false If set to true, cuFileWrite will use RDMA writes instead of falling back to posix writes for a WekaFs mount.
fs:weka:<rdma_dev_addr_list> [] Provides the list of relevant client IPv4 addresses for all the interfaces a single WekaFS mount can use. This property is also used by the cuFile dynamic routing feature to infer preferred rdma devices.
fs:weka:mount_table [] Specifies a dictionary of IPv4 mount addresses against a WekaFS mount point. This property is used by the cuFile dynamic routing feature. Refer to the default cufile.json for sample usage.
denylist:drivers [] Administrative setting that disables supported storage drivers on the node.
denylist:devices []

Administrative setting that disables specific supported block devices on the node.

Not applicable for DFS.

denylist:mounts [] Administrative setting that disables specific mounts in the supported GDS-enabled filesystems on the node.
denylist:filesystems [] Administrative setting that disables specific supported GDS-ready filesystems on the node.
miscellaneous:skip_topology_detection false Setting this to true will skip topology detection in compat mode. This will reduce the high startup latency seen in compat mode on systems with multiple PCI devices.
execution::max_io_queue_depth 128 This specifies the maximum number of pending work items that can be held by the cuFile library’s internal threadpool sub-system.
execution::max_io_threads 4 This specifies the number of threadpool threads that can process work items produced into a work queue corresponding to a single GPU on the system.
execution::parallel_io true Setting this to true will allow parallel processing of work items by enqueuing into the threadpool subsystem provided by the cuFile library
execution::min_io_threshold_size_kb 8192 This option specifies the size in KB that the I/O work item submitted by the application would be split into, when enqueuing into threadpool sub-system, provided there are enough parallel buffers available.
execution::max_request_parallelism 4 This number specifies the maximum number of parallel buffers available, that the original I/O work item buffer can be split into, when enqueuing into the threadpool sub-system.

Note:

Workload/application-specific parameters can be set by using the CUFILE_ENV_PATH_JSON environment variable that is set to point to an alternate cufile.json file, for example, CUFILE_ENV_PATH_JSON=/home/gds_user/my_cufile.json.


There are two mode types that you can set in the cufile.json configuration file:
  • Poll Mode

    The cuFile API set includes an interface to put the driver in polling mode. Refer to cuFileDriverSetPollMode() in the cuFile API Reference Guide for more information. When the poll mode is set, a read or write issued that is less than or equal to properties:poll_mode_max_size_kb (4KB by default) will result in the library polling for IO completion, rather than blocking (sleep). For small IO size workloads, enabling poll mode may reduce latency.

  • Compatibility Mode

    There are several possible scenarios where GDS might not be available or supported, for example, when the GDS software is not installed, the target file system is not GDS supported,O_DIRECT cannot be enabled on the target file, and so on. When you enable compatibility mode, and GDS is not functional for the IO target, the code that uses the cuFile APIs fall backs to the standard POSIX read/write path. To learn more about compatibility mode, refer to cuFile Compatibility Mode.

From a benchmarking and performance perspective, the default settings work very well across a variety of IO loads and use cases. We recommended that you use the default values for max_direct_io_size_kb, max_device_cache_size_kb, and max_device_pinned_mem_size_kb unless a storage provider has a specific recommendation, or analysis and testing show better performance after you change one or more of the defaults.

The cufile.json file has been designed to be extensible such that parameters can be set that are either generic and apply to all supported file systems (fs:generic), or file system specific (fs:lustre). The fs:generic:posix_unaligned_writes parameter enables the use of the POSIX write path when unaligned writes are encountered. Unaligned writes are generally sub-optimal, as they can require read-modify-write operations.

If the target workload generates unaligned writes, you might want to set posix_unaligned_writes to true, as the POSIX path for handling unaligned writes might be more performant, depending on the target filesystem and underlying storage. Also, in this case, the POSIX path will write to the page cache (system memory).

When the IO size is less than or equal to posix_gds_min_kb, the fs:lustre:posix_gds_min_kb setting invokes the POSIX read/write path rather than cuFile path. When using Lustre, for small IO sizes, the POSIX path can have better (lower) latency.

The GDS parameters are among several elements that factor into delivered storage IO performance. It is advisable to start with the defaults and only make changes based on recommendations from a storage vendor or based on empirical data obtained during testing and measurements of the target workload.

The following is the JSON schema:

# /etc/cufile.json
{
  "logging": {
    // log directory, if not enabled will create log file
    // under current working directory
    //"dir": "/home/<xxxx>",
    // ERROR|WARN|INFO|DEBUG|TRACE (in decreasing order of priority)
         
     "level": "ERROR"
  },
         
  "profile": {
    // nvtx profiling on/off
    "nvtx": false,
    // cufile stats level(0-3)
    "cufile_stats": 0
   },
   
"execution" : {
                    // max number of workitems in the queue;
                    "max_io_queue_depth": 128,
                    // max number of host threads per gpu to spawn for parallel IO
                    "max_io_threads" : 4,
                    // enable support for parallel IO
                    "parallel_io" : true,
                    // minimum IO threshold before splitting the IO
                    "min_io_threshold_size_kb" :8192,
                    // maximum parallelism for a single request
                    "max_request_parallelism" : 4
            },
     
   "properties": {
     // max IO size (4K aligned) issued by cuFile to nvidia-fs driver(in KB)
     "max_direct_io_size_kb" : 16384,
     // device memory size (4K aligned) for reserving bounce buffers
     // for the entire GPU (in KB)
     "max_device_cache_size_kb" : 131072,
     // limit on maximum memory (4K aligned) that can be pinned
     // for a given process (in KB)
     "max_device_pinned_mem_size_kb" : 33554432,
     // true or false (true will enable asynchronous io submission to nvidia-fs driver)
     "use_poll_mode" : false,
     // maximum IO request size (4K aligned) within or equal
     // to which library will poll (in KB)
     "poll_mode_max_size_kb": 4,
     // allow compat mode, this will enable use of cufile posix read/writes
     "allow_compat_mode": false,
     // client-side rdma addr list for user-space file-systems
        
     // (e.g ["10.0.1.0", "10.0.2.0"])
     "rdma_dev_addr_list": [ ]
   },
        
   "fs": {
     "generic": {
        // for unaligned writes, setting it to true 
        // will use posix write instead of cuFileWrite
        
        "posix_unaligned_writes" : false
      },
         
      "lustre": {
        // IO threshold for read/write (4K aligned)) equal to or below 
        // which cufile will use posix reads (KB)
        "posix_gds_min_kb" : 0
      }
    },
        
    "blacklist": {
      // specify list of vendor driver modules to blacklist for nvidia-fs
      "drivers": [ ],
      // specify list of block devices to prevent IO using libcufile
      "devices": [ ],
      // specify list of mount points to prevent IO using libcufile
      // (e.g. ["/mnt/test"])
      "mounts": [ ],
      // specify list of file-systems to prevent IO using libcufile
      // (e.g ["lustre", "wekafs", "vast"])
      "filesystems": [ ]
    }
    // Application can override custom configuration via 
    // export CUFILE_ENV_PATH_JSON=<filepath>
    // e.g : export CUFILE_ENV_PATH_JSON="/home/<xxx>/cufile.json"
  }

© Copyright 2024, NVIDIA. Last updated on Apr 3, 2024.