Memory Bandwidth Plugin#
Overview#
The Memory Bandwidth plugin tests and validates the memory bandwidth performance of individual NVIDIA GPUs. It measures how fast each GPU can read from and write to its own memory, which is critical for applications that require high memory throughput.
The plugin targets the following aspects of GPU memory performance:
Memory Bandwidth: Measures the rate at which data can be read from and written to GPU memory within a single GPU
Memory Subsystem Stability: Validates that the memory subsystem can handle sustained high-bandwidth operations without errors
Memory Performance Validation: Ensures the GPU meets expected memory performance thresholds
Test Description#
This test performs memory bandwidth measurements using the STREAM benchmark’s TRIAD operation. It allocates large memory arrays on each GPU and runs intensive memory operations to stress the memory subsystem. During this process, the GPU will be monitored for memory errors, CUDA errors, and performance thresholds.
The test runs multiple iterations with different memory access patterns to find the optimal configuration for each GPU and measure the sustainable memory bandwidth. It validates that each GPU can achieve the specified minimum bandwidth threshold, ensuring the memory subsystem is performing as expected for high-bandwidth workloads.
The Memory Bandwidth plugin implements the STREAM benchmark, which consists of four main memory operations:
COPY: a(i) = b(i) - Simple memory copy operation
SCALE: a(i) = q*b(i) - Memory copy with scaling
SUM: a(i) = b(i) + c(i) - Memory copy with addition
TRIAD: a(i) = b(i) + q*c(i) - Memory copy with scaling and addition
The plugin focuses on the TRIAD operation, which is the most memory-intensive and provides the best measure of sustainable memory bandwidth. This operation is chosen because it requires reading from two memory locations and writing to a third, represents a common pattern in scientific computing and data processing, provides the most comprehensive measure of memory bandwidth, and tests both read and write operations simultaneously.
The test allocates memory based on the configured memory size parameter (default: 67,108,864 elements × 4 bytes x 3 arrays = 756MB per GPU allocated), runs TRIAD operations with different memory access patterns, measures bandwidth for each configuration, finds optimal performance and compares against the minimum threshold, and reports results including achieved bandwidth and any errors encountered.
The test validates that each GPU can achieve the specified minimum bandwidth threshold, ensuring the memory subsystem is performing as expected for high-bandwidth workloads.
Preconditions#
NVIDIA GPU with CUDA support
CUDA driver and runtime installed
DCGM host engine running
GPU memory must be available for allocation
Parameters#
The following table lists the parameters for the Memory Bandwidth plugin.
Parameter Name |
Type |
Default |
Description |
|---|---|---|---|
minimum_bandwidth |
Double |
100.0 |
Minimum bandwidth in MB/s that must be achieved for the test to pass. |
is_allowed |
String |
See description |
Whether this test is allowed to run. Must be “True” for the test to execute. Note: When unspecified, DCGM searches for configuration files for the specified GPU; if no configuration is found, the parameter defaults to “False”. |
max_sbe_errors |
Double |
DCGM_FP64_BLANK |
Threshold for single-bit error (SBE) detection. If set, the test will fail if SBE count exceeds this threshold. |
run_if_gom_enabled |
String |
“True” |
Whether to run the test if GPU Operating Mode (GOM) is enabled. |
logfile |
String |
“stats_membw.json” |
Output file for test statistics and results. |
logfile_type |
Double |
0.0 |
Type of log file output format. |
memory_size_mb |
Double |
756.0 |
Total memory size in MB to allocate for the bandwidth test. Must be between 6 MB (minimum) and 756 MB (maximum). When not specified, defaults to 756 MB. This is the total memory size for all arrays a, b, and c. |
ignore_error_codes |
String |
“” |
Comma-separated list of DCGM field result ( |
Test Categories#
The Memory Bandwidth plugin belongs to the following test categories:
Stress: Performs intensive operations to validate system stability under load
Sample Commands#
Run a basic memory bandwidth test:
$ dcgmi diag -r memory_bandwidth -p "memory_bandwidth.is_allowed=True"
Run the test with a custom minimum bandwidth threshold:
$ dcgmi diag -r memory_bandwidth -p "memory_bandwidth.is_allowed=True;memory_bandwidth.minimum_bandwidth=500"
Run the test with a custom log file:
$ dcgmi diag -r memory_bandwidth -p "memory_bandwidth.is_allowed=True;memory_bandwidth.logfile=my_membw_test.json"
Run the test with multiple parameters:
$ dcgmi diag -r memory_bandwidth -p "memory_bandwidth.is_allowed=True;memory_bandwidth.minimum_bandwidth=300;memory_bandwidth.logfile=detailed_membw.json"
Failure Conditions#
The test will fail if the achieved bandwidth is below the minimum bandwidth threshold
The test will fail if unrecoverable memory errors or CUDA errors occur during the test
The test will fail if the SBE error count exceeds the
max_sbe_errorsthreshold (when set)