NVBandwidth Plugin

Overview

The NVBandwidth plugin is part of the level 3 and higher tests. NVBandwidth performs bandwidth measurements on NVIDIA GPUs on a single host.

Test Description

NVBandwidth measures bandwidth for various memcpy patterns across different links using copy engine or kernel copy methods. nvbandwidth reports current measured bandwidth on your system. Additional system-specific tuning may be required to achieve maximal peak bandwidth. Tests are performed on GPUs on a single host only. For more information, please see https://github.com/NVIDIA/nvbandwidth.

Supported Products

DCGM will run the NVbandwidth test on the following GPU products:

  • NVIDIA A800 (20bd)

  • NVIDIA B100 (197f)

  • NVIDIA B200 (1999, 199b, 20da)

  • NVIDIA B300 (20e6)

  • NVIDIA GH100-88K-A1

  • NVIDIA GH100-888K (2342, 237f)

  • NVIDIA H100 144GB HBM3

  • NVIDIA H100 80GB HBM3

  • NVIDIA H100NVL (2321, 233a)

  • NVIDIA H200 (2335, 233b)

  • NVIDIA H20A

  • NVIDIA H20B

  • NVIDIA H20 HBM3e

  • NVIDIA H20 NVL16

  • NVIDIA L2

  • NVIDIA L20

  • NVIDIA L20A

  • NVIDIA L30

  • NVIDIA L40S

  • NVIDIA P2021

  • NVIDIA PG153 SKU 210

  • NVIDIA RTX 2000 Ada Generation

  • NVIDIA RTX6000D

Supported Parameters

The following table lists the global parameters for this plugin:

Parameter Name

Type

Default

Description

testcases

string

The list of specific testcases to run, separated by ,, e.g.: 0,1,2.

is_allowed

Bool

False

Specifies whether or not this test is allowed to run.

Sample Commands

Run the test with default parameters:

$ dcgmi diag -r nvbandwidth

Run the test, specifying only testcase 1

$ dcgmi diag -r nvbandwidth -p nvbandwidth.testcases=1

Run the test, specifying multiple testcases

$ dcgmi diag -r nvbandwidth -p nvbandwidth.testcases=1,2,3

Run the level 3 test, indicating the nvbandwidth test should be allowed to run:

$ dcgmi diag -r 3 -p nvbandwidth.is_allowed=true

Failure Conditions

  • The test will fail if the nvbandwidth executable cannot be found.

  • The test will fail if current memory copy utilization (MCUTIL) is over 10% or cannot be retrieved.

  • The test will fail if an error is encountered during nvbandwidth execution.