Contents#
- 1. NVIDIA GPUDirect Storage Installation and Troubleshooting Guide
- 2. Introduction
- 3. Installing GPUDirect Storage
- 3.1. Before You Install GDS
- 3.2. Installing GDS
- 3.3. Installed GDS Libraries and Tools
- 3.4. Uninstalling GPUDirect Storage
- 3.5. Environment Variables Used by GPUDirect Storage
- 3.6. JSON Config Parameters Used by GPUDirect Storage
- 3.7. GDS Configuration File Changes to Support Dynamic Routing
- 3.8. Determining Which Version of GDS is Installed
- 3.9. Experimental Repos for Network Install of GDS Packages for DGX Systems
- 4. API Errors
- 5. Basic Troubleshooting
- 5.1. Log Files for the GDS Library
- 5.2. Enabling a Different cufile.log File for Each Application
- 5.3. Enabling Tracing GDS Library API Calls
- 5.4. cuFileHandleRegister Error
- 5.5. Troubleshooting Applications that Return cuFile Errors
- 5.6. cuFile-* Errors with No Activity in GPUDirect Storage Statistics
- 5.7. CUDA Runtime and Driver Mismatch with Error Code 35
- 5.8. CUDA API Errors when Running the cuFile-* APIs
- 5.9. Finding GDS Driver Statistics
- 5.10. Tracking IO Activity that Goes Through the GDS Driver
- 5.11. Read/Write Bandwidth and Latency Numbers in GDS Stats
- 5.12. Tracking Registration and Deregistration of GPU Buffers
- 5.13. Enabling RDMA-specific Logging for Userspace File Systems
- 5.14. CUDA_ERROR_SYSTEM_NOT_READY After Installation
- 5.15. Adding udev Rules for RAID Volumes
- 5.16. When You Observe “Incomplete write” on NVME Drives
- 5.17. CUFILE async I/O is failing
- 6. Advanced Troubleshooting
- 6.1. Resolving Hung cuFile* APIs with No Response
- 6.2. Sending Relevant Data to Customer Support
- 6.3. Resolving an IO Failure with EIO and Stack Trace Warning
- 6.4. Controlling GPU BAR Memory Usage
- 6.5. Determining the Amount of Cache to Set Aside
- 6.6. Monitoring BAR Memory Usage
- 6.7. Resolving an ENOMEM Error Code
- 6.8. GDS and Compatibility Mode
- 6.9. Enabling Compatibility Mode
- 6.10. Tracking the IO After Enabling Compatibility Mode
- 6.11. Bypassing GPUDirect Storage
- 6.12. GDS Does Not Work for a Mount
- 6.13. Simultaneously Running the GPUDirect Storage IO and POSIX IO on the Same File
- 6.14. Running Data Verification Tests Using GPUDirect Storage
- 7. Troubleshooting Performance
- 7.1. Running Performance Benchmarks with GDS
- 7.2. Tracking Whether GPUDirect Storage is Using an Internal Cache
- 7.3. Tracking when IO Crosses the PCIe Root Complex and Impacts Performance
- 7.4. Using GPUDirect Statistics to Monitor CPU Activity
- 7.5. Monitoring Performance and Tracing with cuFile-* APIs
- 7.6. Example: Using Linux Tracing Tools
- 7.7. Tracing the cuFile* APIs
- 7.8. Improving Performance using Dynamic Routing
- 8. Troubleshooting IO Activity
- 9. EXAScaler File System LNet Troubleshooting
- 9.1. Determining the EXAScaler File system Client Module Version
- 9.2. Checking the LNet Network Setup on a Client
- 9.3. Checking the Health of the Peers
- 9.4. Checking for Multi-Rail Support
- 9.5. Checking GDS Peer Affinity
- 9.6. Checking for LNet-Level Errors
- 9.7. Resolving LNet NIDs Health Degradation from Timeouts
- 9.8. Configuring LNet Networks with Multiple OSTs for Optimal Peer Selection
- 10. Understanding EXAScaler File System Performance
- 10.1. osc Tuning Performance Parameters
- 10.2. Miscellaneous Commands for osc, mdc, and stripesize
- 10.3. Getting the Number of Configured Object-Based Disks
- 10.4. Getting Additional Statistics related to the EXAScaler File System
- 10.5. Getting Metadata Statistics
- 10.6. Checking for an Existing Mount
- 10.7. Unmounting an EXAScaler File System Cluster
- 10.8. Getting a Summary of EXAScaler File System Statistics
- 10.9. Using GPUDirect Storage in Poll Mode
- 11. Troubleshooting and FAQ for the WekaIO File System
- 11.1. Downloading the WekaIO Client Package
- 11.2. Determining Whether the WekaIO Version is Ready for GDS
- 11.3. Mounting a WekaIO File System Cluster
- 11.4. Resolving a Failing Mount
- 11.5. Resolving 100% Usage for WekaIO for Two Cores
- 11.6. Checking for an Existing Mount in the Weka File System
- 11.7. Checking for a Summary of the WekaIO File System Status
- 11.8. Displaying the Summary of the WekaIO File System Statistics
- 11.9. Why WekaIO Writes Go Through POSIX
- 11.10. Checking for nvidia-fs.ko Support for Memory Peer Direct
- 11.11. Checking Memory Peer Direct Stats
- 11.12. Checking for Relevant nvidia-fs Statistics for the WekaIO File System
- 11.13. Conducting a Basic WekaIO File System Test
- 11.14. Unmounting a WekaIO File System Cluster
- 11.15. Verify the Installed Libraries for the WekaIO File System
- 11.16. GDS Configuration File Changes to Support the WekaIO File System
- 11.17. Check for Relevant User-Space Statistics for the WekaIO File System
- 11.18. Check for WekaFS Support
- 12. Enabling IBM Spectrum Scale Support with GDS
- 12.1. IBM Spectrum Scale Limitations with GDS
- 12.2. Checking nvidia-fs.ko Support for Mellanox PeerDirect
- 12.3. Verifying Installed Libraries for IBM Spectrum Scale
- 12.4. Checking PeerDirect Stats
- 12.5. Checking for Relevant nvidia-fs Stats with IBM Spectrum Scale
- 12.6. GDS User Space Stats for IBM Spectrum Scale for Each Process
- 12.7. GDS Configuration to Support IBM Spectrum Scale
- 12.8. Scenarios for Falling Back to Compatibility Mode
- 12.9. GDS Limitations with IBM Spectrum Scale
- 13. NetApp E-series BeeGFS with GDS Solution Deployment
- 13.1. Netapp BeeGFS/GPUDirect Storage and Package Requirements
- 13.2. BeeGFS Client Configuration for GDS
- 13.3. GPU/HCA Topology on the Client - DGX-A100 and OSS servers Client Server
- 13.4. Verify the Setup
- 13.4.1. List the Management Node
- 13.4.2. List the Metadata Nodes
- 13.4.3. List the Storage Nodes
- 13.4.4. List the Client Nodes
- 13.4.5. Display Client Connections
- 13.4.6. Verify Connectivity to the Different Services
- 13.4.7. List Storage Pools
- 13.4.8. Display the Free Space and inodes on the Storage and Metadata Targets
- 13.5. Testing
- 14. Setting Up and Troubleshooting VAST Data (NFSoRDMA+MultiPath)
- 15. Troubleshooting and FAQ for NVMe Support Using Linux PCI P2PDMA
- 15.1. Linux Kernel Requirements
- 15.2. Supported GPUs
- 15.3. Setting the Driver Registries for Enabling PCI P2PDMA
- 15.4. cufile.json Settings
- 15.5. Verify P2P Mode is Supported by GDS
- 15.6. RAID Support
- 15.7. Mounting a Local File System for GDS
- 15.8. Check for an Existing EXT4 Mount
- 15.9. Check for IO Statistics with Block Device Mount
- 15.10. Conduct a Basic EXT4 File System Test
- 15.11. Unmount an EXT4 File System
- 15.12. Udev Device Naming for a Block Device
- 15.13. BATCH I/O Performance
- 15.14. Statistics
- 16. Troubleshooting and FAQ for NVMe and NVMeOF Support Using nvidia-fs
- 16.1. MLNX_OFED Requirements and Installation
- 16.2. DOCA Requirements and Installation
- 16.3. Determining Whether the NVMe device is Supported for GDS
- 16.4. RAID Support in GDS
- 16.5. Mounting a Local File System for GDS
- 16.6. Check for an Existing EXT4 Mount
- 16.7. Check for IO Statistics with Block Device Mount
- 16.8. RAID Group Configuration for GPU Affinity
- 16.9. Conduct a Basic EXT4 File System Test
- 16.10. Unmount a EXT4 File System
- 16.11. Udev Device Naming for a Block Device
- 16.12. BATCH I/O Performance
- 17. Displaying GDS NVIDIA FS Driver Statistics
- 17.1. nvidia-fs Statistics
- 17.2. Analyze Statistics for Each GPU
- 17.3. Resetting the nvidia-fs Statistics
- 17.4. Checking Peer Affinity Stats for a Kernel File System and Storage Drivers
- 17.5. Checking the Peer Affinity Usage for a Kernel File System and Storage Drivers
- 17.6. Display the GPU-to-Peer Distance Table
- 17.7. The GDSIO Tool
- 17.8. Tabulated Fields
- 17.9. The gdscheck Tool
- 17.10. NFS Support with GPUDirect Storage
- 17.11. NFS GPUDirect Storage Statistics and Debugging
- 17.12. GPUDirect Storage IO Behavior
- 17.12.1. Read/Write Atomicity Consistency with GPUDirect Storage Direct IO
- 17.12.2. Write with File a Opened in O_APPEND Mode (cuFileWrite)
- 17.12.3. GPU to NIC Peer Affinity
- 17.12.4. Compatible Mode with Unregistered Buffers
- 17.12.5. Unaligned writes with Non-Registered Buffers
- 17.12.6. Process Hang with NFS
- 17.12.7. Tools Support Limitations for CUDA 9 and Earlier
- 17.13. GDS Statistics for Dynamic Routing
- 18. GDS Library Tracing
- 18.1. Example: Display Tracepoints
- 18.2. Example: Track the IO Activity of a Process that Issues cuFileRead/ cuFileWrite
- 18.3. Example: Display the IO Pattern of all the IOs that Go Through GDS
- 18.4. Understand the IO Pattern of a Process
- 18.5. IO Pattern of a Process with the File Descriptor on Different GPUs
- 18.6. Determine the IOPS and Bandwidth for a Process in a GPU
- 18.7. Display the Frequency of Reads by Processes that Issue cuFileRead
- 18.8. Display the Frequency of Reads when cuFileRead Takes More than 0.1 ms
- 18.9. Displaying the Latency of cuFileRead for Each Process
- 18.10. Example: Tracking the Processes that Issue cuFileBufRegister
- 18.11. Example: Tracking Whether the Process is Constant when Invoking cuFileBufRegister
- 18.12. Example: Monitoring IOs that are Going Through the Bounce Buffer
- 18.13. Example: Tracing cuFileRead and cuFileWrite Failures, Print, Error Codes, and Time of Failure
- 18.14. Example: User-Space Statistics for Each GDS Process
- 18.15. Example: Viewing GDS User-Level Statistics for a Process
- 18.16. Example: Displaying Sample User-Level Statistics for Each GDS Process
- 19. User-Space Counters in GPUDirect Storage
- 20. User-Space RDMA Counters in GPUDirect Storage
- 21. Cheat Sheet for Diagnosing Problems