NVBandwidth Plugin#

Overview#

The NVBandwidth plugin is part of the level 3 and higher tests. NVBandwidth performs bandwidth measurements on NVIDIA GPUs on a single host.

Test Description#

NVBandwidth measures bandwidth for various memcpy patterns across different links using copy engine or kernel copy methods. nvbandwidth reports current measured bandwidth on your system. Additional system-specific tuning may be required to achieve maximal peak bandwidth. Tests are performed on GPUs on a single host only. For more information, please see NVIDIA/nvbandwidth.

Supported Products#

DCGM will run the NVbandwidth test on the following GPU products:

  • NVIDIA A800 (20bd)

  • NVIDIA B100 (197f)

  • NVIDIA B200 (1999, 199b, 20da)

  • NVIDIA B300 (20e6)

  • NVIDIA GH100-88K-A1

  • NVIDIA GH100-888K (2342, 237f)

  • NVIDIA H100 144GB HBM3

  • NVIDIA H100 80GB HBM3

  • NVIDIA H100NVL (2321, 233a)

  • NVIDIA H200 (2335, 233b)

  • NVIDIA H20A

  • NVIDIA H20B

  • NVIDIA H20 HBM3e

  • NVIDIA H20 NVL16

  • NVIDIA L2

  • NVIDIA L20

  • NVIDIA L20A

  • NVIDIA L30

  • NVIDIA L40S

  • NVIDIA P2021

  • NVIDIA PG153 SKU 210

  • NVIDIA RTX 2000 Ada Generation

  • NVIDIA RTX6000D

Supported Parameters#

The following table lists the global parameters for this plugin:

Parameter Name

Type

Default

Description

testcases

string

The list of specific testcases to run, separated by ,, e.g.: 0,1,2.

is_allowed

Bool

False

Specifies whether or not this test is allowed to run.

Sample Commands#

Run the test with default parameters:

$ dcgmi diag -r nvbandwidth

Run the test, specifying only testcase 1

$ dcgmi diag -r nvbandwidth -p nvbandwidth.testcases=1

Run the test, specifying multiple testcases

$ dcgmi diag -r nvbandwidth -p nvbandwidth.testcases=1,2,3

Run the level 3 test, indicating the nvbandwidth test should be allowed to run:

$ dcgmi diag -r 3 -p nvbandwidth.is_allowed=true

Failure Conditions#

  • The test will fail if the nvbandwidth executable cannot be found.

  • The test will fail if current memory copy utilization (MCUTIL) is over 10% or cannot be retrieved.

  • The test will fail if an error is encountered during nvbandwidth execution.