Memory Bandwidth Plugin#

Overview#

The Memory Bandwidth plugin tests and validates the memory bandwidth performance of individual NVIDIA GPUs. It measures how fast each GPU can read from and write to its own memory, which is critical for applications that require high memory throughput.

The plugin targets the following aspects of GPU memory performance:

  • Memory Bandwidth: Measures the rate at which data can be read from and written to GPU memory within a single GPU

  • Memory Subsystem Stability: Validates that the memory subsystem can handle sustained high-bandwidth operations without errors

  • Memory Performance Validation: Ensures the GPU meets expected memory performance thresholds

Test Description#

This test performs memory bandwidth measurements using the STREAM benchmark’s TRIAD operation. It allocates large memory arrays on each GPU and runs intensive memory operations to stress the memory subsystem. During this process, the GPU will be monitored for memory errors, CUDA errors, and performance thresholds.

The test runs multiple iterations with different memory access patterns to find the optimal configuration for each GPU and measure the sustainable memory bandwidth. It validates that each GPU can achieve the specified minimum bandwidth threshold, ensuring the memory subsystem is performing as expected for high-bandwidth workloads.

The Memory Bandwidth plugin implements the STREAM benchmark, which consists of four main memory operations:

  1. COPY: a(i) = b(i) - Simple memory copy operation

  2. SCALE: a(i) = q*b(i) - Memory copy with scaling

  3. SUM: a(i) = b(i) + c(i) - Memory copy with addition

  4. TRIAD: a(i) = b(i) + q*c(i) - Memory copy with scaling and addition

The plugin focuses on the TRIAD operation, which is the most memory-intensive and provides the best measure of sustainable memory bandwidth. This operation is chosen because it requires reading from two memory locations and writing to a third, represents a common pattern in scientific computing and data processing, provides the most comprehensive measure of memory bandwidth, and tests both read and write operations simultaneously.

The test allocates memory based on the configured memory size parameter (default: 67,108,864 elements × 4 bytes x 3 arrays = 756MB per GPU allocated), runs TRIAD operations with different memory access patterns, measures bandwidth for each configuration, finds optimal performance and compares against the minimum threshold, and reports results including achieved bandwidth and any errors encountered.

The test validates that each GPU can achieve the specified minimum bandwidth threshold, ensuring the memory subsystem is performing as expected for high-bandwidth workloads.

Preconditions#

  • NVIDIA GPU with CUDA support

  • CUDA driver and runtime installed

  • DCGM host engine running

  • GPU memory must be available for allocation

Parameters#

The following table lists the parameters for the Memory Bandwidth plugin.

Parameter Name

Type

Default

Description

minimum_bandwidth

Double

100.0

Minimum bandwidth in MB/s that must be achieved for the test to pass.

is_allowed

String

See description

Whether this test is allowed to run. Must be “True” for the test to execute. Note: When unspecified, DCGM searches for configuration files for the specified GPU; if no configuration is found, the parameter defaults to “False”.

max_sbe_errors

Double

DCGM_FP64_BLANK

Threshold for single-bit error (SBE) detection. If set, the test will fail if SBE count exceeds this threshold.

run_if_gom_enabled

String

“True”

Whether to run the test if GPU Operating Mode (GOM) is enabled.

logfile

String

“stats_membw.json”

Output file for test statistics and results.

logfile_type

Double

0.0

Type of log file output format.

memory_size_mb

Double

756.0

Total memory size in MB to allocate for the bandwidth test. Must be between 6 MB (minimum) and 756 MB (maximum). When not specified, defaults to 756 MB. This is the total memory size for all arrays a, b, and c.

ignore_error_codes

String

“”

Comma-separated list of DCGM field result (DCGM_FR_*) error codes to ignore during test execution. These are specific error codes defined by the DCGM framework that can be suppressed if they are expected.

Test Categories#

The Memory Bandwidth plugin belongs to the following test categories:

  • Stress: Performs intensive operations to validate system stability under load

Sample Commands#

Run a basic memory bandwidth test:

$ dcgmi diag -r memory_bandwidth -p "memory_bandwidth.is_allowed=True"

Run the test with a custom minimum bandwidth threshold:

$ dcgmi diag -r memory_bandwidth -p "memory_bandwidth.is_allowed=True;memory_bandwidth.minimum_bandwidth=500"

Run the test with a custom log file:

$ dcgmi diag -r memory_bandwidth -p "memory_bandwidth.is_allowed=True;memory_bandwidth.logfile=my_membw_test.json"

Run the test with multiple parameters:

$ dcgmi diag -r memory_bandwidth -p "memory_bandwidth.is_allowed=True;memory_bandwidth.minimum_bandwidth=300;memory_bandwidth.logfile=detailed_membw.json"

Failure Conditions#

  • The test will fail if the achieved bandwidth is below the minimum bandwidth threshold

  • The test will fail if unrecoverable memory errors or CUDA errors occur during the test

  • The test will fail if the SBE error count exceeds the max_sbe_errors threshold (when set)