DCGM Release Notes#

4.6.0#

Features#

  • Active health checks (dcgmi diag)

    • Added support for the following systems:

      • NVIDIA RTX PRO 5000 Blackwell (devId 0x2bb3 subsystem 0x227a)

    • Level 3 and level 4 diagnostics can now run on GPUs that do not have SKU-specific diagnostic calibration by passing -p "generic_mode=True". In this mode, affected tests run as stability checks instead of using calibrated performance or power thresholds.

    • The software test now reports GPUs that reach 512 uncorrectable row remaps as failed and recommends running field diagnostics.

  • Background health checks (dcgmi health)

    • Health monitoring now warns when total uncorrectable row remaps reach 512 on Ampere and newer GPUs, allowing administrators to run field diagnostics before GPU memory health degrades further.

  • Profiling metrics (dcgmi profile)

    • Added cumulative GPM profiling counters, allowing applications to read raw cycle and PCIe byte totals instead of ratio-only profiling metrics.

      • DCGM_FI_PROF_SM_CYCLES_ELAPSED_TOTAL

      • DCGM_FI_PROF_SM_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_MMA_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_DMMA_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_HMMA_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_IMMA_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_DFMA_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_PCIE_TX_BYTES_TOTAL

      • DCGM_FI_PROF_PCIE_RX_BYTES_TOTAL

      • DCGM_FI_PROF_INT_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_FP64_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_FP32_CYCLES_ACTIVE_TOTAL

      • DCGM_FI_PROF_FP16_CYCLES_ACTIVE_TOTAL

  • System monitoring (dcgmi dmon)

    • Added per-link NVLink5 COUNT and FEC history monitoring for Blackwell and newer GPUs. Use gpu_link:<gpuId>:<linkIndex> to query an individual GPU NVLink. Existing gpu:<id> queries continue to report aggregate values.

    • Added entity selectors for NVSwitch links and raw link IDs, including switch_link:<switchId>:<linkIndex> and hex values such as link:0x103.

    • Added range syntax for GPU and NVSwitch link selectors, such as gpu_link:0:{0-5}, gpu_link:{0-1}:{0-3}, and switch_link:{0-1}:{0-3}.

    • Added per-link NVLink fields that use gpu_link:<gpuId>:<linkIndex> selectors, allowing Rubin-class systems to monitor links beyond link 17 while preserving existing link 0 through link 17 field behavior.

      • DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_PER_LINK_TOTAL

      • DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_PER_LINK_TOTAL

      • DCGM_FI_DEV_NVLINK_REPLAY_ERROR_PER_LINK_TOTAL

      • DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_PER_LINK_TOTAL

      • DCGM_FI_DEV_NVLINK_THROUGHPUT_PER_LINK

      • DCGM_FI_DEV_NVLINK_TX_THROUGHPUT_PER_LINK

      • DCGM_FI_DEV_NVLINK_RX_THROUGHPUT_PER_LINK

      • DCGM_FI_PROF_NVLINK_TX_BYTES_PER_LINK

      • DCGM_FI_PROF_NVLINK_RX_BYTES_PER_LINK

    • Added support for monitoring supported NVSwitch telemetry and topology fields through the NVSDM backend on Blackwell systems.

    • Added GPU fabric health summary reporting.

      • DCGM_FI_DEV_FABRIC_HEALTH_SUMMARY reports the GPU fabric health summary.

  • Core

    • NVLink status and topology APIs now support up to 36 GPU NVLinks on Rubin-class systems. The latest dcgmGetNvLinkLinkStatus and topology APIs can report links beyond link 17, allowing dcgmi topo and other clients of those APIs to show the expanded link set. Legacy API versions continue to work for existing callers.

  • NVLink (dcgmi nvlink)

    • dcgmi nvlink -s --show-entity-ids now prints dcgm_link_t entity IDs for GPU and NVSwitch links. Use those IDs as link:<entityId> selectors in field-monitoring commands.

  • Packaging and platform support

    • Added package installation support for

      • Ubuntu 26.04

      • Debian 13

      • Fedora 43

      • SLES 16

Improvements#

  • Active health checks (dcgmi diag)

    • -c configuration files and -p parameters can now be used together. Explicit -p values override matching configuration file values, while the rest of the configuration file still applies.

    • P2P diagnostics now report more specific error codes when DCGM can identify the interconnect type, and fall back to a generic P2P error when the interconnect cannot be determined.

    • P2P diagnostics now include buffer comparison details when memory validation fails.

  • Background health checks (dcgmi health)

    • XID 93 corrupt InfoROM health warnings are now reported only on Volta GPUs and ignored on newer GPU architectures where the XID is not applicable.

    • XID 94 now reports a warning with application-restart guidance instead of reporting a GPU failure. This avoids unnecessary node drains for contained GPU application errors.

    • XID 64 row remap failures now report failure and recommend resetting the GPU or rebooting the node immediately.

    • A GPU recovery action of drain and reset now reports failure instead of warning.

  • Multinode diagnostics (dcgmi mndiag)

    • Multinode diagnostics now more reliably detect whether the MPI workload started successfully, reducing false failures when diagnostic output is delayed or formatted unexpectedly.

    • Multinode diagnostics now fail explicitly when expected host information is missing from mnubergemm output.

    • Multinode diagnostics now discover the Open MPI TCP interface at runtime, improving reliability on systems with multiple network interfaces.

    • Large mnubergemm stdout logs are handled more efficiently, reducing diagnostic runtime and log redirection overhead.

  • Core

    • GPU NVLink throughput field identifiers now use THROUGHPUT instead of BANDWIDTH. Deprecated BANDWIDTH aliases remain available for existing integrations.

      • DCGM_FI_DEV_NVLINK_BANDWIDTH_L* to DCGM_FI_DEV_NVLINK_THROUGHPUT_L* for link fields L0 through L17.

      • DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL to DCGM_FI_DEV_NVLINK_THROUGHPUT_TOTAL.

      • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L* to DCGM_FI_DEV_NVLINK_TX_THROUGHPUT_L* for link fields L0 through L17.

      • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_TOTAL to DCGM_FI_DEV_NVLINK_TX_THROUGHPUT_TOTAL.

      • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L* to DCGM_FI_DEV_NVLINK_RX_THROUGHPUT_L* for link fields L0 through L17.

      • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_TOTAL to DCGM_FI_DEV_NVLINK_RX_THROUGHPUT_TOTAL.

    • Canonical DCGM field identifier names were standardized in the C/C++ and Python APIs. Existing names remain available as deprecated aliases, so current integrations continue to work while new integrations can move to the clearer names.

      • System, GPU identity, topology, CUDA, and InfoROM identifiers:

        • DCGM_FI_UNKNOWN to DCGM_FI_SYSTEM_FIELD_UNKNOWN.

        • DCGM_FI_DRIVER_VERSION to DCGM_FI_SYSTEM_DRIVER_VERSION.

        • DCGM_FI_NVML_VERSION to DCGM_FI_SYSTEM_NVML_VERSION.

        • DCGM_FI_PROCESS_NAME to DCGM_FI_SYSTEM_PROCESS_NAME.

        • DCGM_FI_DEV_COUNT to DCGM_FI_SYSTEM_GPU_QUANTITY.

        • DCGM_FI_BIND_UNBIND_EVENT to DCGM_FI_SYSTEM_GPU_BIND_EVENT.

        • DCGM_FI_DEV_NAME to DCGM_FI_DEV_GPU_NAME.

        • DCGM_FI_DEV_BRAND to DCGM_FI_DEV_GPU_BRAND.

        • DCGM_FI_DEV_SERIAL to DCGM_FI_DEV_BOARD_SERIAL.

        • DCGM_FI_DEV_UUID to DCGM_FI_DEV_GPU_UUID.

        • DCGM_FI_DEV_MINOR_NUMBER to DCGM_FI_DEV_GPU_MINOR_NUMBER.

        • DCGM_FI_DEV_OEM_INFOROM_VER to DCGM_FI_DEV_INFOROM_OEM_VERSION.

        • DCGM_FI_DEV_ECC_INFOROM_VER to DCGM_FI_DEV_INFOROM_ECC_VERSION.

        • DCGM_FI_DEV_POWER_INFOROM_VER to DCGM_FI_DEV_INFOROM_POWER_VERSION.

        • DCGM_FI_DEV_INFOROM_IMAGE_VER to DCGM_FI_DEV_INFOROM_IMAGE_VERSION.

        • DCGM_FI_DEV_INFOROM_CONFIG_CHECK to DCGM_FI_DEV_INFOROM_CHECKSUM.

        • DCGM_FI_DEV_INFOROM_CONFIG_VALID to DCGM_FI_DEV_INFOROM_VALID.

        • DCGM_FI_DEV_PCI_BUSID to DCGM_FI_DEV_PCI_BUS_ID.

        • DCGM_FI_GPU_TOPOLOGY_PCI to DCGM_FI_SYSTEM_PCI_TOPOLOGY.

        • DCGM_FI_GPU_TOPOLOGY_NVLINK to DCGM_FI_SYSTEM_NVLINK_TOPOLOGY.

        • DCGM_FI_GPU_TOPOLOGY_AFFINITY to DCGM_FI_SYSTEM_GPU_AFFINITY.

        • DCGM_FI_DEV_CUDA_COMPUTE_CAPABILITY to DCGM_FI_CUDA_GPU_COMPUTE_CAPABILITY.

        • DCGM_FI_DEV_CUDA_VISIBLE_DEVICES_STR to DCGM_FI_CUDA_GPU_VISIBLE_DEVICES.

        • DCGM_FI_DEV_P2P_NVLINK_STATUS to DCGM_FI_DEV_NVLINK_P2P_STATUS.

        • DCGM_FI_DEV_COMPUTE_MODE to DCGM_FI_DEV_GPU_COMPUTE_MODE.

        • DCGM_FI_DEV_PERSISTENCE_MODE to DCGM_FI_DEV_GPU_PERSISTENCE_MODE.

        • DCGM_FI_DEV_MEM_AFFINITY_0 through DCGM_FI_DEV_MEM_AFFINITY_3 to DCGM_FI_DEV_MEMORY_AFFINITY_0 through DCGM_FI_DEV_MEMORY_AFFINITY_3.

        • DCGM_FI_SYNC_BOOST to DCGM_FI_SYSTEM_GPU_SYNC_BOOST.

      • GPU clock, temperature, power, fabric, memory, and health identifiers:

        • DCGM_FI_DEV_AUTOBOOST to DCGM_FI_DEV_CLOCKS_AUTOBOOST_MODE.

        • DCGM_FI_DEV_SUPPORTED_CLOCKS to DCGM_FI_DEV_CLOCKS_SUPPORTED.

        • DCGM_FI_DEV_MEMORY_TEMP to DCGM_FI_DEV_MEMORY_TEMP_CELSIUS.

        • DCGM_FI_DEV_GPU_TEMP to DCGM_FI_DEV_GPU_TEMP_CELSIUS.

        • DCGM_FI_DEV_MEM_MAX_OP_TEMP to DCGM_FI_DEV_MEMORY_MAX_OP_TEMP_CELSIUS.

        • DCGM_FI_DEV_GPU_MAX_OP_TEMP to DCGM_FI_DEV_GPU_MAX_OP_TEMP_CELSIUS.

        • DCGM_FI_DEV_GPU_TEMP_LIMIT to DCGM_FI_DEV_GPU_TEMP_MARGIN_CELSIUS.

        • DCGM_FI_DEV_SLOWDOWN_TEMP to DCGM_FI_DEV_GPU_TEMP_SLOWDOWN_CELSIUS.

        • DCGM_FI_DEV_SHUTDOWN_TEMP to DCGM_FI_DEV_GPU_TEMP_SHUTDOWN_CELSIUS.

        • DCGM_FI_DEV_POWER_USAGE to DCGM_FI_DEV_BOARD_POWER_WATTS.

        • DCGM_FI_DEV_POWER_USAGE_INSTANT to DCGM_FI_DEV_BOARD_POWER_RAW_WATTS.

        • DCGM_FI_DEV_POWER_MGMT_LIMIT to DCGM_FI_DEV_BOARD_POWER_LIMIT_REQUESTED_WATTS.

        • DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN to DCGM_FI_DEV_BOARD_POWER_LIMIT_MIN_WATTS.

        • DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX to DCGM_FI_DEV_BOARD_POWER_LIMIT_MAX_WATTS.

        • DCGM_FI_DEV_POWER_MGMT_LIMIT_DEF to DCGM_FI_DEV_BOARD_POWER_LIMIT_DEFAULT_WATTS.

        • DCGM_FI_DEV_ENFORCED_POWER_LIMIT to DCGM_FI_DEV_BOARD_POWER_LIMIT_ENFORCED_WATTS.

        • DCGM_FI_DEV_REQUESTED_POWER_PROFILE_MASK to DCGM_FI_DEV_BOARD_POWER_PROFILE_REQUESTED_MASK.

        • DCGM_FI_DEV_ENFORCED_POWER_PROFILE_MASK to DCGM_FI_DEV_BOARD_POWER_PROFILE_ENFORCED_MASK.

        • DCGM_FI_DEV_VALID_POWER_PROFILE_MASK to DCGM_FI_DEV_BOARD_POWER_PROFILE_SUPPORTED_MASK.

        • DCGM_FI_DEV_FABRIC_MANAGER_ERROR_CODE to DCGM_FI_DEV_FABRIC_MANAGER_ERROR.

        • DCGM_FI_DEV_PSTATE to DCGM_FI_DEV_GPU_PSTATE.

        • DCGM_FI_DEV_PCIE_REPLAY_COUNTER to DCGM_FI_DEV_PCIE_REPLAY_TOTAL.

        • DCGM_FI_DEV_GPU_UTIL to DCGM_FI_DEV_GPU_UTIL_RATIO.

        • DCGM_FI_DEV_ACCOUNTING_DATA to DCGM_FI_DEV_PROCESS_ACCOUNTING_STATS.

        • DCGM_FI_DEV_XID_ERRORS to DCGM_FI_DEV_XID_ERROR.

        • DCGM_FI_DEV_FB_USED_PERCENT to DCGM_FI_DEV_FB_USED_RATIO.

        • DCGM_FI_DEV_C2C_LINK_COUNT to DCGM_FI_DEV_C2C_LINK_QUANTITY.

        • DCGM_FI_DEV_ECC_CURRENT to DCGM_FI_DEV_ECC_MODE.

        • DCGM_FI_DEV_THRESHOLD_SRM to DCGM_FI_DEV_SRAM_EXCEEDED.

        • DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_MAX to DCGM_FI_DEV_BANK_REMAP_AVAIL_MAX.

        • DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_HIGH to DCGM_FI_DEV_BANK_REMAP_AVAIL_HIGH.

        • DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_PARTIAL to DCGM_FI_DEV_BANK_REMAP_AVAIL_PARTIAL.

        • DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_LOW to DCGM_FI_DEV_BANK_REMAP_AVAIL_LOW.

        • DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_NONE to DCGM_FI_DEV_BANK_REMAP_AVAIL_NONE.

        • DCGM_FI_DEV_RETIRED_SBE to DCGM_FI_DEV_PAGE_RETIRED_SBE_TOTAL.

        • DCGM_FI_DEV_RETIRED_DBE to DCGM_FI_DEV_PAGE_RETIRED_DBE_TOTAL.

        • DCGM_FI_DEV_RETIRED_PENDING to DCGM_FI_DEV_PAGE_RETIRED_PENDING.

        • DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS to DCGM_FI_DEV_ROW_REMAP_UNCORRECTABLE_TOTAL.

        • DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS to DCGM_FI_DEV_ROW_REMAP_CORRECTABLE_TOTAL.

        • DCGM_FI_DEV_ROW_REMAP_FAILURE to DCGM_FI_DEV_ROW_REMAP_FAILED.

        • DCGM_FI_DEV_PCIE_COUNT_CORRECTABLE_ERRORS to DCGM_FI_DEV_PCIE_CORRECTABLE_ERROR_TOTAL.

        • DCGM_FI_DEV_MEMORY_UNREPAIRABLE_FLAG to DCGM_FI_DEV_MEMORY_UNREPAIRABLE.

        • DCGM_FI_DEV_GET_GPU_RECOVERY_ACTION to DCGM_FI_DEV_GPU_RECOVERY_ACTION.

      • CPU and ConnectX identifiers:

        • DCGM_FI_DEV_CPU_TEMP_CURRENT to DCGM_FI_DEV_CPU_TEMP_CELSIUS.

        • DCGM_FI_DEV_CPU_TEMP_WARNING to DCGM_FI_DEV_CPU_TEMP_WARNING_CELSIUS.

        • DCGM_FI_DEV_CPU_TEMP_CRITICAL to DCGM_FI_DEV_CPU_TEMP_CRITICAL_CELSIUS.

        • DCGM_FI_DEV_CPU_POWER_UTIL_CURRENT to DCGM_FI_DEV_CPU_POWER_WATTS.

        • DCGM_FI_DEV_CPU_POWER_LIMIT to DCGM_FI_DEV_CPU_POWER_LIMIT_WATTS.

        • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_STATUS to DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERROR_STATUS.

        • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_MASK to DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERROR_MASK.

        • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_SEVERITY to DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERROR_SEVERITY.

      • NVLink field identifiers:

        • DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_L* to DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_L*_TOTAL for link fields L0 through L17.

        • DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL to DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_L* to DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_L*_TOTAL for link fields L0 through L17.

        • DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL to DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_L* to DCGM_FI_DEV_NVLINK_REPLAY_ERROR_L*_TOTAL for link fields L0 through L17.

        • DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL to DCGM_FI_DEV_NVLINK_REPLAY_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_L* to DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_L*_TOTAL for link fields L0 through L17.

        • DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL to DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_TOTAL.

        • DCGM_FI_DEV_GPU_NVLINK_ERRORS to DCGM_FI_DEV_NVLINK_ERROR.

        • DCGM_FI_DEV_NVLINK_ERROR_DL_CRC to DCGM_FI_DEV_NVLINK_CRC_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_ERROR_DL_RECOVERY to DCGM_FI_DEV_NVLINK_RECOVERY_TOTAL.

        • DCGM_FI_DEV_NVLINK_ERROR_DL_REPLAY to DCGM_FI_DEV_NVLINK_REPLAY_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_TX_PACKETS to DCGM_FI_DEV_NVLINK_TX_PACKET_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_TX_BYTES to DCGM_FI_DEV_NVLINK_TX_BYTES_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_RX_PACKETS to DCGM_FI_DEV_NVLINK_RX_PACKET_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_RX_BYTES to DCGM_FI_DEV_NVLINK_RX_BYTES_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_RX_MALFORMED_PACKET_ERRORS to DCGM_FI_DEV_NVLINK_RX_PACKET_MALFORMED_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_RX_BUFFER_OVERRUN_ERRORS to DCGM_FI_DEV_NVLINK_RX_PACKET_DROPPED_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_RX_ERRORS to DCGM_FI_DEV_NVLINK_RX_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_RX_REMOTE_ERRORS to DCGM_FI_DEV_NVLINK_RX_REMOTE_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_RX_GENERAL_ERRORS to DCGM_FI_DEV_NVLINK_RX_GENERAL_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_LOCAL_LINK_INTEGRITY_ERRORS to DCGM_FI_DEV_NVLINK_INTEGRITY_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_SUCCESSFUL_EVENTS to DCGM_FI_DEV_NVLINK_RECOVERY_SUCCESSFUL_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_FAILED_EVENTS to DCGM_FI_DEV_NVLINK_RECOVERY_FAILED_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_EVENTS to DCGM_FI_DEV_NVLINK_RECOVERY_EVENT_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_RX_SYMBOL_ERRORS to DCGM_FI_DEV_NVLINK_RX_SYMBOL_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_COUNT_SYMBOL_BER to DCGM_FI_DEV_NVLINK_SYMBOL_BER_RAW.

        • DCGM_FI_DEV_NVLINK_COUNT_SYMBOL_BER_FLOAT to DCGM_FI_DEV_NVLINK_SYMBOL_BER_RATIO.

        • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_BER to DCGM_FI_DEV_NVLINK_EFFECTIVE_BER_RAW.

        • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_BER_FLOAT to DCGM_FI_DEV_NVLINK_EFFECTIVE_BER_RATIO.

        • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_ERRORS to DCGM_FI_DEV_NVLINK_EFFECTIVE_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL to DCGM_FI_DEV_NVLINK_ECC_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_RECOVERY_TOTAL_SUCCESSFUL_EVENTS to DCGM_FI_DEV_NVLINK_PPCNT_RECOVERY_SUCCESSFUL_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PHYSICAL_SUCCESSFUL_RECOVERY_EVENTS maps to DCGM_FI_DEV_NVLINK_PPCNT_PHYSICAL_RECOVERY_SUCCESSFUL_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PHYSICAL_LINK_DOWN_COUNTER to DCGM_FI_DEV_NVLINK_PPCNT_PHYSICAL_LINK_DOWN_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_CODES to DCGM_FI_DEV_NVLINK_PPCNT_PLR_RX_CODE_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_CODE_ERR to DCGM_FI_DEV_NVLINK_PPCNT_PLR_RX_CODE_ERROR_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_UNCORRECTABLE_CODE to DCGM_FI_DEV_NVLINK_PPCNT_PLR_RX_CODE_UNCORRECTABLE_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_CODES to DCGM_FI_DEV_NVLINK_PPCNT_PLR_TX_CODE_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_RETRY_CODES to DCGM_FI_DEV_NVLINK_PPCNT_PLR_TX_RETRY_CODE_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_RETRY_EVENTS to DCGM_FI_DEV_NVLINK_PPCNT_PLR_TX_RETRY_EVENT_TOTAL.

        • DCGM_FI_DEV_NVLINK_PPCNT_PLR_SYNC_EVENTS to DCGM_FI_DEV_NVLINK_PPCNT_PLR_SYNC_EVENT_TOTAL.

      • vGPU field identifiers:

        • DCGM_FI_DEV_VIRTUAL_MODE to DCGM_FI_DEV_GPU_VIRTUAL_MODE.

        • DCGM_FI_DEV_SUPPORTED_TYPE_INFO to DCGM_FI_DEV_VGPU_SUPPORTED_INFO.

        • DCGM_FI_DEV_CREATABLE_VGPU_TYPE_IDS to DCGM_FI_DEV_VGPU_CREATABLE_IDS.

        • DCGM_FI_DEV_VGPU_INSTANCE_IDS to DCGM_FI_DEV_VGPU_INSTANCE_INFO.

        • DCGM_FI_DEV_VGPU_UTILIZATIONS to DCGM_FI_DEV_VGPU_UTIL_INFO.

        • DCGM_FI_DEV_VGPU_PER_PROCESS_UTILIZATION to DCGM_FI_DEV_VGPU_PROCESS_UTIL_INFO.

        • DCGM_FI_DEV_SUPPORTED_VGPU_TYPE_IDS to DCGM_FI_DEV_VGPU_SUPPORTED_IDS.

        • DCGM_FI_DEV_VGPU_INSTANCE_LICENSE_STATE to DCGM_FI_DEV_VGPU_INSTANCE_LICENSE_STATUS.

        • DCGM_FI_DEV_VGPU_VM_GPU_INSTANCE_ID to DCGM_FI_DEV_VGPU_GPU_INSTANCE_ID.

      • NVSwitch field identifiers:

        • DCGM_FI_DEV_NVSWITCH_POWER_VDD to DCGM_FI_DEV_NVSWITCH_POWER_VDD_WATTS.

        • DCGM_FI_DEV_NVSWITCH_POWER_DVDD to DCGM_FI_DEV_NVSWITCH_POWER_DVDD_WATTS.

        • DCGM_FI_DEV_NVSWITCH_POWER_HVDD to DCGM_FI_DEV_NVSWITCH_POWER_HVDD_WATTS.

        • DCGM_FI_DEV_NVSWITCH_LINK_REPLAY_ERRORS to DCGM_FI_DEV_NVSWITCH_LINK_REPLAY_ERROR_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_RECOVERY_ERRORS to DCGM_FI_DEV_NVSWITCH_LINK_RECOVERY_ERROR_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_FLIT_ERRORS to DCGM_FI_DEV_NVSWITCH_LINK_FLIT_ERROR_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_COUNT_VC0 to DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_SAMPLE_VC0_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_COUNT_VC1 to DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_SAMPLE_VC1_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_COUNT_VC2 to DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_SAMPLE_VC2_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_COUNT_VC3 to DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_SAMPLE_VC3_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE0 to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_L0_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE1 to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_L1_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE2 to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_L2_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE3 to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_L3_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE4 to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_L4_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE5 to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_L5_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE6 to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_L6_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE7 to DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERROR_L7_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE0 to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_L0_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE1 to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_L1_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE2 to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_L2_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE3 to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_L3_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE4 to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_L4_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE5 to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_L5_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE6 to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_L6_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE7 to DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERROR_L7_TOTAL.

        • DCGM_FI_DEV_NVSWITCH_TEMPERATURE_CURRENT to DCGM_FI_DEV_NVSWITCH_TEMP_CELSIUS.

        • DCGM_FI_DEV_NVSWITCH_TEMPERATURE_LIMIT_SLOWDOWN to DCGM_FI_DEV_NVSWITCH_TEMP_SLOWDOWN_CELSIUS.

        • DCGM_FI_DEV_NVSWITCH_TEMPERATURE_LIMIT_SHUTDOWN to DCGM_FI_DEV_NVSWITCH_TEMP_SHUTDOWN_CELSIUS.

        • DCGM_FI_DEV_NVSWITCH_PHYS_ID to DCGM_FI_DEV_NVSWITCH_PHYSICAL_ID.

        • DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS to DCGM_FI_DEV_SXID_FATAL_ERROR.

        • DCGM_FI_DEV_NVSWITCH_NON_FATAL_ERRORS to DCGM_FI_DEV_SXID_NON_FATAL_ERROR.

        • DCGM_FI_DEV_NVSWITCH_LINK_DEVICE_LINK_ID to DCGM_FI_DEV_NVSWITCH_LINK_REMOTE_LINK_ID.

        • DCGM_FI_DEV_NVSWITCH_LINK_DEVICE_LINK_SID to DCGM_FI_DEV_NVSWITCH_LINK_REMOTE_LINK_SID.

        • DCGM_FI_DEV_NVSWITCH_DEVICE_UUID to DCGM_FI_DEV_NVSWITCH_UUID.

      • Profiling utilization identifiers:

        • DCGM_FI_PROF_GR_ENGINE_ACTIVE to DCGM_FI_PROF_GR_ENGINE_UTIL_RATIO.

        • DCGM_FI_PROF_SM_ACTIVE to DCGM_FI_PROF_SM_UTIL_RATIO.

        • DCGM_FI_PROF_SM_OCCUPANCY to DCGM_FI_PROF_SM_OCCUPANCY_RATIO.

        • DCGM_FI_PROF_PIPE_TENSOR_ACTIVE to DCGM_FI_PROF_TENSOR_UTIL_RATIO.

        • DCGM_FI_PROF_DRAM_ACTIVE to DCGM_FI_PROF_DRAM_UTIL_RATIO.

        • DCGM_FI_PROF_PIPE_FP64_ACTIVE to DCGM_FI_PROF_FP64_UTIL_RATIO.

        • DCGM_FI_PROF_PIPE_FP32_ACTIVE to DCGM_FI_PROF_FP32_UTIL_RATIO.

        • DCGM_FI_PROF_PIPE_FP16_ACTIVE to DCGM_FI_PROF_FP16_UTIL_RATIO.

        • DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE to DCGM_FI_PROF_IMMA_UTIL_RATIO.

        • DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE to DCGM_FI_PROF_HMMA_UTIL_RATIO.

        • DCGM_FI_PROF_PIPE_TENSOR_DFMA_ACTIVE to DCGM_FI_PROF_DFMA_UTIL_RATIO.

        • DCGM_FI_PROF_PIPE_INT_ACTIVE to DCGM_FI_PROF_INT_UTIL_RATIO.

        • DCGM_FI_PROF_NVDEC0_ACTIVE through DCGM_FI_PROF_NVDEC7_ACTIVE to DCGM_FI_PROF_NVDEC_UTIL_0_RATIO through DCGM_FI_PROF_NVDEC_UTIL_7_RATIO.

        • DCGM_FI_PROF_NVJPG0_ACTIVE through DCGM_FI_PROF_NVJPG7_ACTIVE to DCGM_FI_PROF_NVJPG_UTIL_0_RATIO through DCGM_FI_PROF_NVJPG_UTIL_7_RATIO.

        • DCGM_FI_PROF_NVOFA0_ACTIVE and DCGM_FI_PROF_NVOFA1_ACTIVE to DCGM_FI_PROF_NVOFA_UTIL_0_RATIO and DCGM_FI_PROF_NVOFA_UTIL_1_RATIO.

  • System monitoring (dcgmi dmon)

    • NVSwitch throughput metadata tags were updated from bandwidth naming to throughput naming.

      • Aggregate NVSwitch fields now use nvswitch_throughput_tx and nvswitch_throughput_rx.

      • Link fields now use nvlink_throughput_tx and nvswitch_link_throughput_rx.

    • Updated dcgmi dmon short column names for NVSwitch fields.

      • DCGM_FI_DEV_NVSWITCH_RESET_REQUIRED now uses SWRSTRQ instead of SWFRMVER.

      • DCGM_FI_DEV_NVSWITCH_FIRMWARE_VERSION now uses SWFRMVER.

  • Host engine

    • Improved compatibility with R610 and older drivers by using the legacy NVML initialization path by default on those driver branches.

    • IMEX status refreshes no longer block other DCGM requests while waiting on external nvidia-imex-ctl calls.

Bug Fixes#

  • Active health checks (dcgmi diag)

    • Fixed a bug in which the software test could continue past a GPU state that already required recovery.

    • Fixed a bug in which diagnostic configuration handling could leave temporary /tmp/tmp-dcgm-* YAML files behind.

    • Reduced noisy debug logging from the memory_bandwidth diagnostic.

  • Background health checks (dcgmi health)

    • Fixed false-positive NVLink health failures caused by valid zero or sentinel BER values.

    • Fixed a bug in which IMEX daemon health could be reported for GPUs that do not use NVLink.

    • Updated NVLink BER thresholds to align symbolic BER and effective BER health behavior with hardware guidance.

  • Multinode diagnostics (dcgmi mndiag)

    • Fixed reliability issues when running multinode diagnostics repeatedly on the same hosts.

  • System monitoring (dcgmi dmon)

    • Fixed a bug that caused NVLink throughput fields to remain unchanged because they were not refreshed from NVML.

    • Fixed a bug that caused nvlink_pprm_oper_recovery fields to report N/A when data was available.

    • Fixed a bug that caused power smoothing fields to report N/A on systems running driver 590 or newer when data was available.

    • Fixed a bug that could crash the host engine when querying unsupported NVSwitch topology fields through the NSCQ backend.

    • Fixed a bug that caused incorrect link entity IDs on NVSwitch-only systems.

  • Host engine

    • Fixed a bug in which DCGM could hang or log backend errors when the NVSwitch backend libraries were unavailable.

    • Fixed excessive nv-hostengine memory growth when high-frequency watches collected GPM profiling fields with large sample-retention settings.

    • Fixed a bug that could report PCIe policy violations from cumulative replay counters instead of new replay counter activity.

    • Reduced log noise when optional plugin shutdown entry points are absent.

4.5.3#

Features#

  • Active health checks (dcgmi diag)

    • Extended utility diagnostic (EUD)

      • The heartbeat timeout can now be configured via the DCGM_EUD_HEARTBEAT_TIMEOUT_SECONDS environment variable.

      • The heartbeat timeout can be disabled via the DCGM_EUD_HEARTBEAT_TIMEOUT_DISABLED environment variable.

Improvements#

  • Active health checks (dcgmi diag)

    • Extended utility diagnostic (EUD)

      • Now runs even if a row-remapping failure is reported in the software active health check.

      • In the event that a configuration error is detected requiring that the health check be rerun, the diagnostic error message now directs the user to an EUD log file prescribing recovery steps.

    • Added support for the following systems:

      • NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition (devId 2bb4)

      • HGX B200 168GB (devId 2909 subsystem 22eb)

Bug Fixes#

  • Active health checks (dcgmi diag)

    • Fixed a bug in which the extended utility diagnostic (EUD) active health check invoked the EUD binary with the default test profile when re-running as root.

  • Python bindings

    • Corrected the value of the DCGM_FR_PCIE_H_REPLAY_VIOLATION constant

  • System monitoring (dcgmi dmon)

    • Fixed a bug which resulted in the following fields always being reported as blank:

      • DCGM_FI_DEV_C2C_LINK_ERROR_INTR

      • DCGM_FI_DEV_C2C_LINK_ERROR_REPLAY

      • DCGM_FI_DEV_C2C_LINK_ERROR_REPLAY_B2B

      • DCGM_FI_DEV_C2C_LINK_POWER_STATE

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_0

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_1

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_2

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_3

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_4

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_5

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_6

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_7

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_8

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_9

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_10

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_11

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_12

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_13

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_14

      • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_15

    • Fixed a bug that resulted in the following fields reporting values for NVLink 0 rather than aggregate values for all NVLinks:

      • DCGM_FI_DEV_NVLINK_COUNT_TX_PACKETS

      • DCGM_FI_DEV_NVLINK_COUNT_TX_BYTES

      • DCGM_FI_DEV_NVLINK_COUNT_RX_BYTES

      • DCGM_FI_DEV_NVLINK_COUNT_RX_PACKETS

      • DCGM_FI_DEV_NVLINK_COUNT_RX_MALFORMED_PACKET_ERRORS

      • DCGM_FI_DEV_NVLINK_COUNT_RX_BUFFER_OVERRUN_ERRORS

      • DCGM_FI_DEV_NVLINK_COUNT_RX_ERRORS

      • DCGM_FI_DEV_NVLINK_COUNT_RX_REMOTE_ERRORS

      • DCGM_FI_DEV_NVLINK_COUNT_RX_GENERAL_ERRORS

      • DCGM_FI_DEV_NVLINK_COUNT_LOCAL_LINK_INTEGRITY_ERRORS

      • DCGM_FI_DEV_NVLINK_COUNT_TX_DISCARDS

      • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_SUCCESSFUL_EVENTS

      • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_FAILED_EVENTS

      • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_EVENTS

      • DCGM_FI_DEV_NVLINK_COUNT_RX_SYMBOL_ERRORS

      • DCGM_FI_DEV_NVLINK_COUNT_SYMBOL_BER

      • DCGM_FI_DEV_NVLINK_COUNT_SYMBOL_BER_FLOAT

      • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_BER

      • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_BER_FLOAT

      • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_ERRORS

    • Fixed a bug that resulted in a crash when monitoring a system with more than 32 virtual GPUs per GPU.

4.5.2#

Bug Fixes#

  • Python bindings

    • Fixed a bug which caused attempts to import the DcgmGroup.py and DcgmFieldGroup Python modules to fail.

4.5.1#

Bug Fixes#

  • Active health checks (dcgmi diag)

    • NVBandwidth

      • Fixed a bug in which the NVBandwith active health check would sometimes report failure as part of level 4 diagnostics citing memory copy utilization.

  • Multinode diagnostics (dcgmi mndiag)

    • Fixed a bug in which the mnubergemm subprocess could deadlock when the diagnostic was terminated prematurely.

  • Python bindings

    • Fixed a bug which caused attempts to import the DcgmDiag Python module to fail.

  • Profiling metrics (dcgmi profile)

    • Corrected a bug that caused the profiling module to be unable to load after a GPU bind/unbind cycle followed by an embedded DCGM service restart on systems with pre‑Hopper GPUs.

  • System monitoring (dcgmi dmon)

    • Fixed a bug in which NVIDIA NVLink PPCNT counters were only properly reported when two or more such metrics were monitored.

4.5.0#

Features#

  • Active health checks (dcgmi diag)

    • Diagnostic

      • Added the always_use_tensor parameter. When specified, the diagnostic test will always use tensor cores if available.

      • Added the clocks_tolerance_pcnt parameter. When specified, GPUs reporting a clock speed outside the argument percentage threshold (relative to the system mean) will be reported.

      • Added the power_tolerance_pcnt parameter. When specified, devices reporting a power usage outside the argument percentage threshold (relative to the system mean) will be reported.

      • Added the tolerance_pcnt parameter. A short hand for setting both power_tolerance_pcnt and clock_colerance_pcnt.

    • Software

      • Added integration for hardware level unrepairable device memory error detection.

    • PCIe

      • Added a new parameter, max_pcie_correctable_errors, allowing users to configure the maximum number of PCIe correctable errors that can happen before the health check reports a failure.

    • Miscellaneous

      • Added support for the following systems:

        • GB203 (devId 2c3a subsystem 21f4)

      • Health checks now provide explicit status messages for non-active (e.g. unbound) GPUs.

      • GPU bind and unbind events will stop diagnostics in flight.

      • GPU serial numbers are included in JSON-format programmatic output.

  • Background health checks (dcgmi health)

    • GPU health checks account for non-active GPUs (e.g. unbound).

    • Added detection and reporting of multiple fabric health conditions, including route health, bandwidth degradation, route recovery, access timeout recovery, and configuration errors.

    • Added support for monitoring multiple NVLink health conditions, including ECC data error counts, local link integrity errors, link recovery events, and symbol bit error rates.

    • Added support for monitoring the NVIDIA Internode Memory Exchange (IMEX) domain and daemon statuses.

    • Added support for monitoring GPU driver health and recovery actions.

  • Configuration (dcgmi config)

    • GPU configuration will be preserved across GPU bind and unbind events.

  • Policy (dcgmi policy)

    • Policy operations account for non-active GPUs (e.g. unbound).

    • Policies registered on the meta group now automatically apply to newly attached GPUs.

  • Profiling metrics (dcgmi profile)

    • Added support for managing profiling watches across GPU attach and detach events.

  • System monitoring (dcgmi dmon)

    • Added the DCGM_FI_BIND_UNBIND_EVENT for notification of GPU bind and unbind events.

  • Topology (dcgmi topo)

    • Reports account for and exclude non-active GPUs (e.g. unbound).

  • Host engine

    • Added support for listening on the VSOCK protocol.

    • Added support for the following fields:

      • DCGM_FI_DEV_GET_GPU_RECOVERY_ACTION

      • DCGM_FI_DEV_GPU_RECOVERY_ACTION

      • DCGM_FI_DEV_MEMORY_UNREPAIRABLE_FLAG

      • DCGM_FI_DEV_NVLINK_ECC_DATA_ERROR_COUNT_TOTAL

      • DCGM_FI_DEV_NVLINK_PPCNT_IBPC_PORT_XMIT_WAIT

Improvements#

  • By default, dcgmi discovery now excludes non-active GPUs. The new -a/--all flag can be used to include non-active GPUs in the report.

  • The end-to-end time of the PCIe active health check was significantly improved.

  • The diagnostic active health check on Blackwell and newer hardware now uses a large matrix size and uses tensor cores by default.

  • Tune the targeted power active health check for the B300 GPU (devId 3182)

Bug Fixes#

  • Active health checks (dcgmi diag)

    • NVBandwidth

      • Fix a typo in an error message.

  • NVLink (dcgmi nvlink)

    • Fix a potential for overflow when reporting NVLink5 related fields.

  • Profiling metrics (dcgmi profile)

    • DCGM no longer attempts to load the legacy profiling library for systems with NVIDIA GPU performance monitoring hardware.

  • Test load generator tool (dcgmproftester)

    • Failure messages intended for non-programmatic consumption are now reported via standard error rather than standard output.

4.4.2#

Features#

  • Active health checks (dcgmi diag)

    • Added support for the following systems:

      • H20 NVL16 (devId 230c)

      • GB300 NVL Galaxy (devId 31C220E5), including workstation variants

      • GB300 MaxQ (devId 31a1)

    • The GPU memory plugin (memory) now provides a parameter, max_free_memory, to configure the amount of the memory allocated by the test.

    • The memory bandwidth plugin (memory_bandwidth) now provides a parameter, memory_size_mb, to configure the size of the memory buffer used for the test.

    • The memory stress plugin (memtest) now provides a parameter, minimum_allocation_percentage, to configure a minimum fraction of the GPU memory that must be available in order for the test to be conducted.

    • The PCIe plugin now accounts for hardware PCIe correctable errors counters (DCGM_FI_DEV_PCIE_COUNT_CORRECTABLE_ERRORS) when available.

    • The pulse test active health check is now supported for the following hardware:

      • PG153 SKU 210 (devId 2bb5)

      • RTX6000D (devId 2bb9)

    • Added the --enable-heartbeat flag. When specified, killing the dcgmi process associated with a running active health check will now result in the active health check being canceled.

  • Background health checks (dcgmi health)

    • Added support for DCGM_FI_DEV_PCIE_COUNT_CORRECTABLE_ERRORS.

  • System monitor (dcgmi dmon)

    • Added support for the following fields:

      • DCGM_FI_DEV_FABRIC_HEALTH_MASK

      • DCGM_FI_DEV_PCIE_COUNT_CORRECTABLE_ERRORS

    • Added support for the following fields related to the NVIDIA Internode Memory Exchange (IMEX) daemon:

      • DCGM_FI_IMEX_DOMAIN_STATUS

      • DCGM_FI_IMEX_DAEMON_STATUS

Improvements#

  • A warning is now issued when the value of the CUDA_VISIBLE_DEVICES environment variable differs between the dcgmi process and the nv-hostengine process it interacts with.

  • The nvswitch module now logs connection status on state change rather than periodically.

  • The system monitor module (dmon) now reports chassis serial numbers in the same format as nvidia-smi.

  • The active health check software prologue test now reports the fabric manager service’s health mask in the event a fabric training error is observed.

  • Active health checks now log and terminate in the event of a hang.

  • The nv-hostengine process managed by the nvidia-dcgm systemd service now logs in the event of a hang.

  • Removed extraneous files from DEB and RPM packages.

Bug Fixes#

  • Corrected the name reported for NVIDIA CPUs in dcgmi discovery.

  • Corrected an error that resulted in only a subset of available cores being reported when querying entities corresponding to NVIDIA CPUs.

  • Corrected an issue that would result in a crash when the NVSwitch module enumerated ports of a system with a faulty link between a NVIDIA ConnectX-7 NIC and a NVIDIA Quantum-3 switch.

  • Active health checks (dcgmi diag)

    • The --help flag is now recognized and is synonymous with the -h flag.

    • The PCIe plugin now accounts for counter differences between NVLink5 devices and earlier generation NVLink hardware.

    • Corrected an error in how the nvbandwidth plugin computed the memory copy utilization for devices under test.

    • Corrected an error in how the memory stress plugin (mem_test) determines whether sufficient memory is available to execute.

    • Corrected reference values for the targeted stress plugin for GB300 NVL Bianca hardware (devId 31c2).

  • Background health checks (dcgmi health)

    • Corrected an error in which XID values were misreported in user diagnostics.

  • System monitoring (dcgmi dmon)

    • Corrected an issue that resulted in the following fields being displayed as zero when not available on the system (as opposed to N/A).

      • DCGM_FI_DEV_NVLINK_PPRM_OPER_RECOVERY

      • DCGM_FI_DEV_NVLINK_PPCNT_RECOVERY_TIME_SINCE_LAST

      • DCGM_FI_DEV_NVLINK_PPCNT_RECOVERY_TIME_BETWEEN_LAST_TWO

      • DCGM_FI_DEV_NVLINK_PPCNT_RECOVERY_TOTAL_SUCCESSFUL_EVENTS

      • DCGM_FI_DEV_NVLINK_PPCNT_PHYSICAL_SUCCESSFUL_RECOVERY_EVENTS

      • DCGM_FI_DEV_NVLINK_PPCNT_PHYSICAL_LINK_DOWN_COUNTER

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_CODES

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_CODE_ERR

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_UNCORRECTABLE_CODE

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_CODES

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_RETRY_CODES

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_RETRY_EVENTS

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_SYNC_EVENTS

      • DCGM_FI_DEV_NVSWITCH_VOLTAGE_MVOLT

      • DCGM_FI_DEV_NVSWITCH_CURRENT_IDDQ

      • DCGM_FI_DEV_NVSWITCH_CURRENT_IDDQ_REV

      • DCGM_FI_DEV_NVSWITCH_CURRENT_IDDQ_DVDD

      • DCGM_FI_DEV_NVSWITCH_POWER_VDD

      • DCGM_FI_DEV_NVSWITCH_POWER_DVDD

      • DCGM_FI_DEV_NVSWITCH_POWER_HVDD

      • DCGM_FI_DEV_NVSWITCH_LINK_THROUGHPUT_TX

      • DCGM_FI_DEV_NVSWITCH_LINK_THROUGHPUT_RX

      • DCGM_FI_DEV_NVSWITCH_LINK_FATAL_ERRORS

      • DCGM_FI_DEV_NVSWITCH_LINK_NON_FATAL_ERRORS

      • DCGM_FI_DEV_NVSWITCH_LINK_REPLAY_ERRORS

      • DCGM_FI_DEV_NVSWITCH_LINK_RECOVERY_ERRORS

      • DCGM_FI_DEV_NVSWITCH_LINK_FLIT_ERRORS

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_LOW_VC0

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_LOW_VC1

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_LOW_VC2

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_LOW_VC3

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_MEDIUM_VC0

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_MEDIUM_VC1

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_MEDIUM_VC2

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_MEDIUM_VC3

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_HIGH_VC0

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_HIGH_VC1

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_HIGH_VC2

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_HIGH_VC3

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_PANIC_VC0

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_PANIC_VC1

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_PANIC_VC2

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_PANIC_VC3

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_COUNT_VC0

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_COUNT_VC1

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_COUNT_VC2

      • DCGM_FI_DEV_NVSWITCH_LINK_LATENCY_COUNT_VC3

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE0

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE1

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE2

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE3

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE0

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE1

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE2

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE3

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE4

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE5

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE6

      • DCGM_FI_DEV_NVSWITCH_LINK_CRC_ERRORS_LANE7

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE4

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE5

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE6

      • DCGM_FI_DEV_NVSWITCH_LINK_ECC_ERRORS_LANE7

      • DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS

      • DCGM_FI_DEV_NVSWITCH_NON_FATAL_ERRORS

      • DCGM_FI_DEV_NVSWITCH_TEMPERATURE_CURRENT

      • DCGM_FI_DEV_NVSWITCH_TEMPERATURE_LIMIT_SLOWDOWN

      • DCGM_FI_DEV_NVSWITCH_TEMPERATURE_LIMIT_SHUTDOWN

      • DCGM_FI_DEV_NVSWITCH_THROUGHPUT_TX

      • DCGM_FI_DEV_NVSWITCH_THROUGHPUT_RX

      • DCGM_FI_DEV_NVSWITCH_PHYS_ID

      • DCGM_FI_DEV_NVSWITCH_RESET_REQUIRED

      • DCGM_FI_DEV_NVSWITCH_LINK_ID

      • DCGM_FI_DEV_NVSWITCH_PCIE_DOMAIN

      • DCGM_FI_DEV_NVSWITCH_PCIE_BUS

      • DCGM_FI_DEV_NVSWITCH_PCIE_DEVICE

      • DCGM_FI_DEV_NVSWITCH_PCIE_FUNCTION

      • DCGM_FI_DEV_NVSWITCH_LINK_STATUS

      • DCGM_FI_DEV_NVSWITCH_LINK_TYPE

      • DCGM_FI_DEV_NVSWITCH_LINK_REMOTE_PCIE_DOMAIN

      • DCGM_FI_DEV_NVSWITCH_LINK_REMOTE_PCIE_BUS

      • DCGM_FI_DEV_NVSWITCH_LINK_REMOTE_PCIE_DEVICE

      • DCGM_FI_DEV_NVSWITCH_LINK_REMOTE_PCIE_FUNCTION

      • DCGM_FI_DEV_NVSWITCH_LINK_DEVICE_LINK_ID

      • DCGM_FI_DEV_NVSWITCH_LINK_DEVICE_LINK_SID

      • DCGM_FI_DEV_CPU_POWER_LIMIT

      • DCGM_FI_DEV_CONNECTX_HEALTH

      • DCGM_FI_DEV_CONNECTX_ACTIVE_PCIE_LINK_WIDTH

      • DCGM_FI_DEV_CONNECTX_ACTIVE_PCIE_LINK_SPEED

      • DCGM_FI_DEV_CONNECTX_EXPECT_PCIE_LINK_WIDTH

      • DCGM_FI_DEV_CONNECTX_EXPECT_PCIE_LINK_SPEED

      • DCGM_FI_DEV_CONNECTX_CORRECTABLE_ERR_STATUS

      • DCGM_FI_DEV_CONNECTX_CORRECTABLE_ERR_MASK

      • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_STATUS

      • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_MASK

      • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_SEVERITY

      • DCGM_FI_DEV_CONNECTX_DEVICE_TEMPERATURE

4.4.1#

New Features#

  • Active health checks (dcgmi diag)

    • Added support for RTX6000D systems (devId 2bb9)

Bug Fixes#

  • Fixed a packaging bug that corrupted a binary installed by the proprietary CUDA13 RPM package that is leveraged by the pulse test diagnostic.

4.4.0#

New Features#

  • Active health checks (dcgmi diag)

    • Added support for several GPUs

      • P2021 (devId 29bb)

      • B300 (devId 3182 subsystem ID 20E610DE)

      • GB300 NVL Bianca (devId 31c2)

  • Background health checks (dcgmi health)

    • Expanded XID monitoring

      • Critical hardware errors monitored regardless of which health subsystems are enabled

        • Memory - Volatile DBE Detected (48)

        • NVLink - Critical Error (74)

        • All subsystems - Fallen Off Bus (79)

        • Memory - Contained Error (94)

        • Memory - Uncontained Error (95)

        • Memory - ECC Unrecovered Error (140)

      • Memory Subsystem (DCGM_HEALTH_WATCH_MEM)

        • MMU Error (31)

        • PBDMA Error (32)

        • Reset Channel Verification Error (43)

        • Pending Page Retirements (63)

        • Row Remap Failure (64)

      • PCIe Subsystem (DCGM_HEALTH_WATCH_PCIE)

        • PCIe Bus Error (38)

        • PCIe Fabric Error (39)

        • PCI Replay Rate (42)

        • PCIe BDF Reset (74)

      • Thermal Subsystem (DCGM_HEALTH_WATCH_THERMAL)

        • Clocks Event Thermal (60)

        • EDPP Power Brake Thermal Limit (61)

        • Thermal Violations (62)

        • Thermal Diode Short Detection (63)

      • Power Subsystem (DCGM_HEALTH_WATCH_POWER)

        • Power State Change (54)

        • Clock Change (56)

        • Clock Change Due to Power (57)

        • Clock Change Due to Thermal (58)

        • Power State Forced Change (78)

      • NVLink Subsystem (DCGM_HEALTH_WATCH_NVLINK)

        • NVLink Error Threshold (67)

        • NVLink Flow Control Error (73)

        • NVLink Error (74)

        • C2C Link Corrected Error (121)

        • NVLink FLA Privilege Error (137)

      • InfoROM Subsystem (DCGM_HEALTH_WATCH_INFOROM)

        • Non-fatal violation of provisioned InfoROM wear limit (93)

  • Configuration management (dcgmi config)

    • Added support for overwriting the workload power profile.

  • Multi-node diagnostics (dcgmi mndiag)

    • Added support for the following GPUs

      • GB300 NVL Bianca (devId 31c2)

  • System monitoring (dcgmi dmon)

    • Added support for the following GPU performance monitoring (GPM) host memory utilization metrics:

      • DCGM_FI_PROF_HOSTMEM_CACHE_HIT

      • DCGM_FI_PROF_HOSTMEM_CACHE_MISS

    • Added support for the following GPU performance monitoring (GPM) peer memory utilization metrics:

      • DCGM_FI_PROF_PEERMEM_CACHE_HIT

      • DCGM_FI_PROF_PEERMEM_CACHE_MISS

    • Added support for querying the following power smoothing fields:

      • DCGM_FI_DEV_PWR_SMOOTHING_ACTIVE_PRESET_PROFILE

      • DCGM_FI_DEV_PWR_SMOOTHING_ADMIN_OVERRIDE_PERCENT_TMP_FLOOR

      • DCGM_FI_DEV_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_HYST_VAL

      • DCGM_FI_DEV_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_RATE

      • DCGM_FI_DEV_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_UP_RATE

      • DCGM_FI_DEV_PWR_SMOOTHING_APPLIED_TMP_CEIL

      • DCGM_FI_DEV_PWR_SMOOTHING_APPLIED_TMP_FLOOR

      • DCGM_FI_DEV_PWR_SMOOTHING_ENABLED

      • DCGM_FI_DEV_PWR_SMOOTHING_HW_CIRCUITRY_PERCENT_LIFETIME_REMAINING

      • DCGM_FI_DEV_PWR_SMOOTHING_IMM_RAMP_DOWN_ENABLED

      • DCGM_FI_DEV_PWR_SMOOTHING_MAX_NUM_PRESET_PROFILES

      • DCGM_FI_DEV_PWR_SMOOTHING_MAX_PERCENT_TMP_FLOOR_SETTING

      • DCGM_FI_DEV_PWR_SMOOTHING_MIN_PERCENT_TMP_FLOOR_SETTING

      • DCGM_FI_DEV_PWR_SMOOTHING_PRIV_LVL

      • DCGM_FI_DEV_PWR_SMOOTHING_PROFILE_PERCENT_TMP_FLOOR

      • DCGM_FI_DEV_PWR_SMOOTHING_PROFILE_RAMP_DOWN_HYST_VAL

      • DCGM_FI_DEV_PWR_SMOOTHING_PROFILE_RAMP_DOWN_RATE

      • DCGM_FI_DEV_PWR_SMOOTHING_PROFILE_RAMP_UP_RATE

      Note

      Querying these fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

    • Added support for the following link-based NVLink metrics

      • DCGM_FI_DEV_NVLINK_PPCNT_PHYSICAL_LINK_DOWN_COUNTER

      • DCGM_FI_DEV_NVLINK_PPCNT_PHYSICAL_SUCCESSFUL_RECOVERY_EVENTS

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_CODES

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_CODE_ERR

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_RCV_UNCORRECTABLE_CODE

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_SYNC_EVENTS

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_CODES

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_RETRY_CODES

      • DCGM_FI_DEV_NVLINK_PPCNT_PLR_XMIT_RETRY_EVENTS

      • DCGM_FI_DEV_NVLINK_PPCNT_RECOVERY_TIME_BETWEEN_LAST_TWO

      • DCGM_FI_DEV_NVLINK_PPCNT_RECOVERY_TIME_SINCE_LAST

      • DCGM_FI_DEV_NVLINK_PPCNT_RECOVERY_TOTAL_SUCCESSFUL_EVENTS

      • DCGM_FI_DEV_NVLINK_PPRM_OPER_RECOVERY

  • Packaging

    • New packages supporting CUDA 13 systems

      • datacenter-gpu-manager-4-cuda13

      • datacenter-gpu-manager-4-multinode-cuda13

      • datacenter-gpu-manager-4-proprietary-cuda13

      Warning

      See the caveat in Pre-Requisites subsection of the Installation section of the Getting Started documentation regarding systems with Maxwell, Volta, and Pascal generation GPUs using driver version 580.

    • DCGM packages added to NVIDIA network package repositories for the following architectures and Linux distributions

      • Server Based Systems Architecture (SBSA)

        • Red Hat Enterprise Linux 10

        • Rocky Linux 10

      • x86_64

        • Red Hat Enterprise Linux 10

        • Rocky Linux 10

Improvements#

  • Active health checks (dcgmi diag)

    • The active health checks now monitor of spawned threads and processes and reports in the event that they appear to become unresponsive.

    • The output of the CPU extended utility diagnostic (EUD) active health check now provides a hint when appropriate to verify inventory information and where to find it.

  • Multi-node diagnostics (dcgmi mndiag)

    • The mnubergemm diagnostic now ensures the mnubergemm executable binary resides at the same path on each participating device.

  • Miscellaneous

    • The dcgmi -v command more clearly distinguishes information for the dcgmi binary and nv-hostengine.

Bug Fixes#

  • Fixed a bug in which dcgmi encounters a segmentation fault when signaled with SIGINT while the active health check is running in the embedded mode.

  • Fixed a bug in which signals terminating active health checks (dcgmi diag) in flight sent while the extended utility diagnostic (EUD) plugin was running could result in result in one or more (potentially hung) orphaned processes.

  • Fixed a bug in which the system monitor (dcgmi dmon) failed to recognize NVIDIA CPUs on some Linux distributions.

  • Fixed a bug involving spurious log messages regarding the DCGM_FI_DEV_NVSWITCH_NON_FATAL_ERRORS field on systems using the NVSwitch Device Monitoring (NVSDM) library.

  • Fixed a bug in which attempting to collect profiling metric from a nv-hostengine process associated with a non-root user on systems with pre-Hopper generation GPUs would incorrectly report that the profiling module had suffered an unrecoverable error rather than report that insufficient permission were available to the process.

  • Fixed a bug in which the nvbandwidth active health check encountered an error on Ampere and Ada generation GPUs.

  • Fixed a bug in which the following NVIDIA CPU fields were not properly reported by the system monitor (dcgmi dmon)

    • CPUTP (field 1110)

    • CPUTW (field 1111)

    • CPUTC (field 1112)

    • CPUPU (field 1130)

    • CPUPL (field 1131)

    • CPUMN (field 1141)

4.3.1#

Bug Fixes#

  • Fixed a bug in which the nv-hostengine process continuously consumed the processing resources of a CPU core.

4.3.0#

New Features#

  • Multi-node diagnostic module added

    • This functionality is distributed in dedicated packages. See the updated Getting Started page for more information.

  • Added new NVLink monitoring fields related to the peer-to-peer (P2P) status

    • DCGM_FI_DEV_P2P_NVLINK_STATUS

  • Added generic support for B200 GPUs

Improvements#

  • The background health checks (i.e. the health module) now accounts for the bit error rate (BER) and symbol error count metrics introduced in NVLink 5.

  • The PCIe active health check now accounts for the bit error rate (BER) and symbol error count metrics introduced in NVLink 5.

  • DCGM packages added to NVIDIA network package repositories for the following architectures and Linux distributions

    • Server Based Systems Architecture (SBSA)

      • Debian 12

      • Amazon Linux 2023

      • Azure Linux 3.0

    • x86_64

      • Amazon Linux 2023

      • Azure Linux 3.0

Bug Fixes#

  • The NVSwitch module now honors the configured watch interval for the purpose of reporting associated metrics.

  • Fixed a bug in which resulted in the NVSwitch Device Monitoring (NVSDM) library failing to initialize monitoring for some supported devices.

  • DCGM DEB packages now account for dependencies on the following system packages

    • lshw

    • nologin

    • passwd

  • DCGM RPM packages now account for dependencies on the following system packages

    • lshw

    • login

    • shadow-utils

  • Corrected the definition of dcgm_link_t (found in dcgm_structs.h and dcgm_structs.py)

4.2.3#

New Features#

  • Add new NVLink monitoring fields related to Bit Error Rate (BER):

    • DCGM_FI_DEV_NVLINK_COUNT_SYMBOL_BER_FLOAT

    • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_BER

    • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_BER_FLOAT

    • DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_ERRORS

  • Add new NVLink-C2C (chip-to-chip interconnect) monitoring fields:

    • DCGM_FI_DEV_C2C_LINK_ERROR_INTR

    • DCGM_FI_DEV_C2C_LINK_ERROR_REPLAY

    • DCGM_FI_DEV_C2C_LINK_ERROR_REPLAY_B2B

    • DCGM_FI_DEV_C2C_LINK_POWER_STATE

  • Add new NVlink Forward Error Correction (FEC) monitoring fields:

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_0

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_1

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_2

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_3

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_4

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_5

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_6

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_7

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_8

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_9

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_10

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_11

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_12

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_13

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_14

    • DCGM_FI_DEV_NVLINK_COUNT_FEC_HISTORY_15

  • Add new monitoring fields related to clock events:

    • DCGM_FI_DEV_CLOCKS_EVENT_REASON_SW_POWER_CAP_NS

    • DCGM_FI_DEV_CLOCKS_EVENT_REASON_SYNC_BOOST_NS

    • DCGM_FI_DEV_CLOCKS_EVENT_REASON_SW_THERM_SLOWDOWN_NS

    • DCGM_FI_DEV_CLOCKS_EVENT_REASON_HW_THERM_SLOWDOWN_NS

    • DCGM_FI_DEV_CLOCKS_EVENT_REASON_HW_POWER_BRAKE_SLOWDOWN_NS

Improvements#

  • Active health checks (i.e. dcgmi diag)

    • Increase information message capacity from 16 to 128.

    • The nvbandwidth plugin is now supported for H100 80GB HBM3 (devId 2330) GPUs.

    • The memory plugin will now only execute if there is sufficient free memory on the GPU to accommodate the targeted allocation.

    • Failure of one or more diagnostic plug-ins now results in a non-zero exit code from dcgmi diag.

Bug Fixes#

  • Fixed a bug where NSCQ session initialization failure would lead to a crash in the nv-hostengine process.

  • Blackwell or later systems provide a different set of NVLink error counters than pre-Blackwell systems. The dcgm nvlink subcommand now accounts for that distinction.

  • Corrected a bug that prevented nv-hostengine from loading the profiling module on platforms with supported Ada generation GPUs.

  • Corrected a bug that would result in rare segmentation faults or spurious failures when executing the pulse test diagnostic plugin

4.2.2#

New Features#

  • The software diagnostic plugin now verifies whether the SRAM ECC error threshold had been exceeded.

  • Added support for B40 (devId 2bb5)

Improvements#

  • In cases where an entity ID is used with dcgmi diag and an error is encountered, diagnostic messages reflect the entity ID instead of the group ID.

  • The DCGM nvbandwidth diagnostic plugin now redirects the standard output of the spawned nvbandwidth process to a log file, dcgm_nvbandwidth.log

Bug Fixes#

  • The nvbandwidth diagnostic plugin no longer introduces persistent modifications to the CUDA_VISIBLE_DEVICES environment variable of the nvvs process

  • The dcgmi groups subcommand now reports an error when attempting to create a group would exceed the group limit.

  • The following GPU NVLink metrics will be reported as GPU entity metrics

    • DCGM_FI_PROF_NVLINK_L0_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L0_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L1_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L1_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L2_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L2_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L3_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L3_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L4_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L4_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L5_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L5_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L6_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L6_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L7_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L7_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L8_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L8_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L9_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L9_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L10_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L10_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L11_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L11_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L12_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L12_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L13_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L13_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L14_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L14_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L15_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L15_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L16_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L16_TX_BYTES

    • DCGM_FI_PROF_NVLINK_L17_RX_BYTES

    • DCGM_FI_PROF_NVLINK_L17_TX_BYTES

4.2.1#

Version 4.2.1 was not publicly released

4.2.0#

New Features#

  • Added support for a B200 GPU (devId 2941)

4.1.1#

New Features#

  • Added support for H20 NVL16 (devId 230e)

  • Added support for several B200 GPUS

    • devId 20da

    • devId 1999

    • devId 199B10DE

Bug Fixes#

  • Fixed a bug that caused nv-hostengine to crash while collecting NVSwitch errors from the NSCQ library

  • Fixed a bug in which values reported by dcgmi stats would decay to lower values than was correct

  • Fixed a bug that caused the sysmon module to crash

4.1.0#

New Features#

  • Improve the dcgmproftester NVLink testing for newer GPU generations.

  • Add IPv6 support.

  • Add the ability to ignore DCGM Diagnostic failures on the command line.

  • Add monitoring for CX7:

    • DCGM_FI_DEV_CONNECTX_HEALTH

    • DCGM_FI_DEV_CONNECTX_ACTIVE_PCIE_LINK_WIDTH

    • DCGM_FI_DEV_CONNECTX_ACTIVE_PCIE_LINK_SPEED

    • DCGM_FI_DEV_CONNECTX_EXPECT_PCIE_LINK_WIDTH

    • DCGM_FI_DEV_CONNECTX_EXPECT_PCIE_LINK_SPEED

    • DCGM_FI_DEV_CONNECTX_CORRECTABLE_ERR_STATUS

    • DCGM_FI_DEV_CONNECTX_CORRECTABLE_ERR_MASK

    • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_STATUS

    • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_MASK

    • DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_SEVERITY

    • DCGM_FI_DEV_CONNECTX_DEVICE_TEMPERATURE

  • Add new NVLink traffic measuring fields:

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L0

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L1

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L2

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L3

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L4

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L5

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L6

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L7

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L8

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L9

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L10

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L11

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L12

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L13

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L14

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L15

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L16

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_L17

    • DCGM_FI_DEV_NVLINK_TX_BANDWIDTH_TOTAL

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L0

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L1

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L2

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L3

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L4

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L5

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L6

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L7

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L8

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L9

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L10

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L11

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L12

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L13

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L14

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L15

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L16

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_L17

    • DCGM_FI_DEV_NVLINK_RX_BANDWIDTH_TOTAL

  • Add new fields to track ECC errors in specific locations:

    • DCGM_FI_DEV_ECC_SBE_VOL_SHM

    • DCGM_FI_DEV_ECC_DBE_VOL_SHM

    • DCGM_FI_DEV_ECC_SBE_VOL_CBU

    • DCGM_FI_DEV_ECC_DBE_VOL_CBU

    • DCGM_FI_DEV_ECC_SBE_AGG_SHM

    • DCGM_FI_DEV_ECC_DBE_AGG_SHM

    • DCGM_FI_DEV_ECC_SBE_AGG_CBU

    • DCGM_FI_DEV_ECC_DBE_AGG_CBU

    • DCGM_FI_DEV_ECC_SBE_VOL_SRM

    • DCGM_FI_DEV_ECC_DBE_VOL_SRM

    • DCGM_FI_DEV_ECC_SBE_AGG_SRM

    • DCGM_FI_DEV_ECC_DBE_AGG_SRM

  • Add platform information fields:

    • DCGM_FI_DEV_PLATFORM_INFINIBAND_GUID

    • DCGM_FI_DEV_PLATFORM_CHASSIS_SERIAL_NUMBER

    • DCGM_FI_DEV_PLATFORM_CHASSIS_SLOT_NUMBER

    • DCGM_FI_DEV_PLATFORM_TRAY_INDEX

    • DCGM_FI_DEV_PLATFORM_HOST_ID

    • DCGM_FI_DEV_PLATFORM_PEER_TYPE

    • DCGM_FI_DEV_PLATFORM_MODULE_ID

  • Add a field to track the TLIMIT

    • DCGM_FI_DEV_GPU_TEMP_LIMIT

Bug Fixes#

  • Fix a bug in retrieving the NVSwitch topology from NVSDM

  • Improve logging for various error conditions.

  • Update the version of the underlying binary for the pulse_test to resolve a crash.

  • Fix a post-install issue in the core RPM.

  • Improve error messages around CUDA APIs in the Diagnostic.

  • To reduce false positives, do not consider thermal violations as independent failures for the Diagnostic; only fail if there are other signs of problems.

  • Fix a bug preventing the EUD from finishing (pause and resume wasn’t working completely).

  • Add missing error specifics in the pulse_test results.

  • Fix comments around the _VIOLATION fields to note the correct units.

4.0.0#

New Features#

Entity Centric Messages#

  • dcgmi diag output has been revised to report errors and info messages along with entity information. This will allow the diagnostic to report GPUs and non-GPU hardware, including Nvidia Grace CPUs and NVSwitches.

  • Updated dcgmDiagResponse_v struct and dcgmRunDiag_v message format.

NVBandwidth#

  • There’s a new plugin which launches NVBandwidth to check inter-GPU communication on a single node, which is supported for CUDA 12.

NVLink5 Monitoring#

  • DCGM will now use the NVSDM library (if available) to monitor NVLink5.

  • Several new fields were added to monitor GPU NVLinks:

    • DCGM_FI_DEV_NVLINK_COUNT_TX_PACKETS

    • DCGM_FI_DEV_NVLINK_COUNT_TX_BYTES

    • DCGM_FI_DEV_NVLINK_COUNT_RX_PACKETS

    • DCGM_FI_DEV_NVLINK_COUNT_RX_BYTES

    • DCGM_FI_DEV_NVLINK_COUNT_RX_MALFORMED_PACKET_ERRORS

    • DCGM_FI_DEV_NVLINK_COUNT_RX_BUFFER_OVERRUN_ERRORS

    • DCGM_FI_DEV_NVLINK_COUNT_RX_ERRORS

    • DCGM_FI_DEV_NVLINK_COUNT_RX_REMOTE_ERRORS

    • DCGM_FI_DEV_NVLINK_COUNT_RX_GENERAL_ERRORS

    • DCGM_FI_DEV_NVLINK_COUNT_LOCAL_LINK_INTEGRITY_ERRORS

    • DCGM_FI_DEV_NVLINK_COUNT_TX_DISCARDS

    • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_SUCCESSFUL_EVENTS

    • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_FAILED_EVENTS

    • DCGM_FI_DEV_NVLINK_COUNT_LINK_RECOVERY_EVENTS

    • DCGM_FI_DEV_NVLINK_COUNT_RX_SYMBOL_ERRORS

    • DCGM_FI_DEV_NVLINK_COUNT_SYMBOL_BER

    • DCGM_FI_DEV_NVLINK_ERROR_DL_CRC

    • DCGM_FI_DEV_NVLINK_ERROR_DL_RECOVERY

    • DCGM_FI_DEV_NVLINK_ERROR_DL_REPLAY

  • These new fields will also display in the output of dcgmi nvlink -e.

Miscellaneous#

  • In addition to the automatically created /var/log/nvidia-dcgm directory by the deb/rpm post-install scripts, nv-hostengine will attempt to create the directory for the log files at startup specified by either the DCGM_HOME_DIR environment variable or the --home-dir command line argument.

  • Debug symbol packages are available for non-proprietary packages in the RPM format.

Improvements#

  • NVIDIA Grace CPU serial numbers are now available via DCGM API.

  • Diagnostics run levels 3 and 4 now include Grace CPU EUD.

  • Grace CPU EUD can be run individually via dcgmi diag -r cpu_eud.

  • dcgmi diag output will display detected Grace CPUs

    • Their respective serial numbers are reported in JSON format output.

  • The pulse test has added additional patterns to better cover Hopper GPUs.

  • The CUDA kernels used by DCGM are now compiled against CUDA 12.6.3.

  • /dev/kmsg is now parsed to detect some XIDs that were previously undetected.

  • PCIe test errors have been improved for clarity.

Fixed Issues#

  • The occasional hang in the pulse test for Hopper GPUs has fixed.

  • Many errors that were previously misattributed to multiple GPUs are now correctly attributed to only the offending hardware.

  • A false positive warning on the memtest has been fixed.

  • The rate of PCIe replays required to cause a failure has been corrected (it was previously too low).

  • An incorrect abort due to signal mishandling during the PCIe test was resolved.

  • Fixed an issue that prevented reporting more than 50% utilization for tensor activity on L20 GPUs.

Deprecations and Breaking Changes#

  • New JSON format for dcgmi diag

  • Removed the tmp_dir parameter in the eud plugin (eud and cpu_eud).

  • Subtests of the DCGM Software Diagnostic are no longer individually reported; aggregated results of the software test are now reported instead.

  • NVVS (long-deprecated) will no longer write human understandable output.

  • The dcgmActionValidate_v2() API function now prioritizes the argument group ID. Argument entity IDs will not be considered except when the group ID is set to DCGM_GROUP_NULL.

  • The -g argument to dcgmi diag, used to specify a list of GPUs to run the diagnostic on, has been deprecated and may be removed in a future release. For compatibility with future releases, use -i to specify the list of entities to run the diagnostic on.

  • dcgm.service has been demoted from being a stand-alone systemd unit to being an alias of the nvidia-dcgm.service systemd unit.

  • Installation assets are no longer shipped in a single monolithic package. Instead, installation assets have been split among several packages, allowing clients to opt-out of the installation of assets not applicable to their use case.

    Component packages are as follows:

    • datacenter-gpu-manager-4-core

      • Provides nv-hostengine binary and other CUDA-agnostic installation assets available through the DCGM open source product

    • datacenter-gpu-manager-4-cuda11

      • Provides the CUDA11-specific binaries available through the DCGM open source product

    • datacenter-gpu-manager-4-cuda12

      • Provides the CUDA12-specific binaries available through the DCGM open source product

    • datacenter-gpu-manager-4-proprietary

      • Provides CUDA-agnostic installation assets not distributed as part of the DCGM open source product

    • datacenter-gpu-manager-4-proprietary-cuda11

      • Provides CUDA11 binaries not distributed as part of the DCGM open source product

    • datacenter-gpu-manager-4-proprietary-cuda12

      • Provides CUDA12 binaries not distributed as part of the DCGM open source product

    • datacenter-gpu-manager-4-development

      • Provides files necessary for the development of downstream software dependent on the DCGM library

    Additional information is in the package documentation.

  • Installation paths have been updated to more closely conform to the Filesystem Hierarchy Specification version 3.0

    • Binaries previously installed to /usr/share/nvidia-validation-suite/ are now installed to /usr/libexec/datacenter-gpu-manager-4/

    • Administrator scripts previously installed to /usr/local/dcgm/scripts/ are now installed to /usr/sbin/

    • Python bindings are now installed to /usr/share/datacenter-gpu-manager-4/bindings/python3/

    • Sample configuration files are now installed to /usr/share/doc/datacenter-gpu-manager-4/examples/

    • C header files for development of software dependent on libdcgm are installed to /usr/include/datacenter-gpu-manager-4

    • CMake find package modules are now installed to /usr/share/cmake/

    • Software development samples now installed to /usr/src/datacenter-gpu-manager-4/

Known Issues#

  • The pulse test sometimes crashes inconsistently.

3.3.9#

New Features#

  • Added support for H100 144GB BM3 (devId 2348)

  • Added support for H20 HBM3e (232c)

Improvements#

  • Added SoC power utilization telemetry for Grace CPUs

Fixed Issues#

  • Corrected an issue that caused spurious dcgmproftester failures in MIG environments

  • Corrected an issue that caused dcgmproftester worker processes to crash on shutdown

  • Corrected an issue where sequential DCGM tests would report that GPU resources are busy

  • Corrected an issue that caused diag -r 4 Memtest to fail with a warning on healthy H100 GPUs.

3.3.8#

New Features#

  • DCGM diagnostic now includes the –expectedNumEntities parameter to specify the expected number of GPUs in default groups. This helps identify potential fall-off-the-bus GPUs by failing the diagnostic if the actual GPU count differs from the expected number.

  • The DCGM diagnostic now has an unlimited default timeout, replacing the previous 8-hour limit. Users can set a custom timeout using the –timeout command line argument.

  • DCGM diagnostic now supports the H200NVL GPU (SKU 0x233b).

Improvements#

  • The DCGM Diagnostic now fails early if a pending row remapping is detected.

  • The nvidia-dcgm service has been configured to initiate in the appropriate sequence with other systemd services, including nvidia-mig-manager.

  • The DCGM Diagnostic now supports test parameter values up to 1024 characters, allowing for more detailed customization.

  • The EUD diagnostic now supports multiple specifications of the passthrough_args parameter via the command line. These specifications are subsequently concatenated to form the final parameter value.

  • The DCGM Diagnostic command line now allows multiple instances of the -p/–parameters option. However, each test’s parameter, except for the eud.passthrough_args and cpu_eud.passthrough_args, should still be specified only once.

  • The CPU EUD (dcgmi diag -r cpu_eud) now runs as root following the GPU EUD (dcgmi diag -r eud) behavior.

Fixed Issues#

  • The DCGM Diagnostic Software Plugin now correctly attributes errors to proper GPU indices.

  • Fixed the DCGM Diagnostic PCIe and Memory Bandwidth plugins crashes on systems with multiple NUMA nodes.

  • The DCGM Health Monitoring for the PCIe bus error rates now depend on the PCIe generation and expected throughput.

  • Fixed Grace CPU utilization computation.

3.3.7#

New Features#

  • Initial support of Grace CPU EUD. The new dcgmi diag -r cpu_eud command. Requires installation of the cpueud package.

  • EUD is enabled for aarch64 platform.

  • Critical XID events can now be parsed from kernel logs.

Improvements#

  • DCGM now works in environments without Nvidia GPU drivers installed to support environments with only Grace CPUs.

  • The dcgmi output now includes the EUD version.

Fixed Issues#

  • Fixed segmentation fault error in dcgmi during the diagnostic run.

  • Fixed an issue that stopped the dcgmproftester from working in mixed MIG environments.

  • Fixed an issue that did not allow the dcgmproftester to run all tests in a single-GPU environment.

  • T400 and T400 4Gb SKUs are disabled in dcgmproftester.

3.3.6#

New Features#

  • Added support of HBM temperature sensors.

Fixed Issues#

  • Fixed an issue when DCGM reports extremely high temperature values on some GPUs.

  • Fixed overflow in the Memory test.

  • Fixed an issue that could lead to GSP timeout errors in the OpenRM driver.

  • Fixed an issue when the Pulse test and EUD tests could report issues with the GPU even when the GPU is healthy.

  • Fixed an issue that lead to incorrect Grace CPU utilization and temperature values.

  • Fixed an issue with duplicated errors in the diag reporting.

  • Fixed an issue that lead to a paused DCGM state if EUD test is interrupted.

3.3.5#

New Features#

  • The DCGM Diagnostic’s diagnostic plugin will now fail if any NaN values are detected in the result matrix.

  • Added support for H200 (devId 2335)

  • Added support for H20 (devId 2329)

Improvements#

  • DCGM Diagnostic’s Targeted Power plugin will now use FP64 math to achieve higher power usage on GH200 (devId 2342)

  • Improved DCGM Diagnostic’s Software plugin’s ability to find installed libraries on the system as part of its library check.

Fixed Issues#

  • Addressed an issue in the SysMon module that made DCGM startups non-determistic

3.3.3#

New Features#

  • Added support for the L2 GPU.

Improvements#

  • Prevented duplicate errors from being returned in DCGM Diag’s json/text output

Fixed Issues#

  • Fixed reporting of Cuda errors in DCGM Diag to be per-GPU rather than for all GPUs.

3.3.2#

New Features#

  • Added support for L20 GPU.

Fixed Issues#

  • Added the gpuId to the JSON output when the DCGM Diag Deployment plugin fails.

3.3.1#

New Features#

  • Added support for A800 20bd SKU.

  • Added support for water-cooled A800 GPU.

  • Added CPU power and thermal health checks.

  • Added C2C support.

Improvements#

  • All XIDs during diagnostics are now reported.

  • Some logs’ verbosity was reduced from Error to Debug level.

  • Stopped checking NVLink replay counts as a failure condition.

  • Made EUD independent from service-account. Fixed direct run of the EUD diagnostic.

  • The multi-node health check script is now included in the installation packages.

  • Relaxed PCI width testing for QA scripts.

Fixed Issues#

  • Fixed EUD diagnostic when MLE parsing is enabled.

  • Fixed setting of logging severity via dcgmi.

  • Fix crash in pulsetest.

  • Resolved an issue causing diagnostic to hang on systems with odd number of GPUs.

3.3.0#

New Features#

  • Added support for monitoring NVIDIA Grace CPUs

  • Added DCGM Diag support for the GPUs of Grace + Hopper systems (devId 2342)

  • Added the following fieldIds for NvSwitch power: DCGM_FI_DEV_NVSWITCH_POWER_VDD, DCGM_FI_DEV_NVSWITCH_POWER_DVDD, DCGM_FI_DEV_NVSWITCH_POWER_HVDD

  • Added DCGM Diag pulse test support for the L4 GPU

Improvements#

  • Reworked DCGM Diag error reporting to include more specific error categories and next steps to aid in automation workflows

  • Data Center Profiling metrics are now allowed on SKUs with brand DCGM_BRAND_NVIDIA_RTX like A6000.

  • Added error id, category, and severity to the dcgmi diag –json output for the Deployment Plugin

Fixed Issues#

  • Added a workaround for DCGM_FI_DEV_MEMORY_TEMP being BLANK on r545 drivers. This is due to NVIDIA Bug 4300930 in the NVML library.

  • Fixed an uninitialized memory bug in the memtest plugin of dcgmi diag -r 4.

3.2.6#

New Features#

  • Added DCGM Diag support for L40S, H100 PCIe (devId 2321), and H800 PCIe (devId 233a)

Improvements#

  • Added logging of health check failures to /var/log/nv-hostengine.log in addition to the dcgmHealthCheck() API returning errors.

Fixed Issues#

  • Fixed dcgmi diag’s Permission and OS Blocks subtest failing within containers.

  • Fixed dcgmi diag -r eud eud.suite_level returning Invalid Parameter

  • Fixed a segfault in DCGM Diag’s nvvs process when GPUs failed to initialize

3.2.5#

New Features#

  • DCGM Diag’s PCIe test will now utilize subprocesses and NUMA to achieve optimal D2H and H2D bandwidth on some AMD CPUs where that is required.

Improvements#

  • The DCGM Diag PCIe plugin now uses Bit Error Rates (BER) instead of static thresholds when detecting excessive PCIe replay.

  • Added a reminder to restart the DCGM service when running the DCGM Diag warns about the nvvs binary not being found.

Fixed Issues#

  • Fixed dcgmi diag not running on ARM64 and PPC64LE platforms in DCGM 3.2.3.

  • Fixed RPATH for the DCGM libraries on platforms where there are dcgm libraries in /lib/ directory (ppc64le rhel).

3.2.3#

New Features#

  • Added a reference implementation of DCGM + NCCL multi-node testing.

  • Added a subtest to DCGM Diagnostic’s PCIe test that does GEMMs concurrent to P2P copies.

  • Added -r production_testing to DCGM Diagnostics to capture production line testing as a specific use case.

  • Added detection of host side PCIe replays to dcgmi diag -r production_testing as a failure condition.

  • Added support for profiling telemetry fieldIds 1001+ for Ada L4

  • Added power telemetry for NvSwitches.

  • Added gather-dcgm-logs.sh to gather all DCGM log files when submitting bugs

Improvements#

  • Removed DCGM’s dependency on OpenMP

  • Added the discrete error_id to the JSON output of DCGM Diag to enable scripting actions based on error codes.

  • The DCGM Diagnostic’s EUD plugin now writes its logs to /var/logs/dcgm like the rest of DCGM.

Fixed Issues#

  • Fixed nv-hostengine thrashing the heap under heavy load.

  • Fixed DCP metrics on H100 sometimes returning N/A values under MIG.

Deprecations#

  • dcgm_prometheus.py has been deprecated. Please use DCGM exporter for Prometheus integration

3.1.8#

Improvements#

  • Added a static library of libdcgm.

  • Improved the DCGM Diagnostic PCIe Plugin’s detection of broken P2P between GPUs

Fixed Issues#

  • Fixed an issue where DCGM could hang on systems with NvSwitches

  • Fixed a timing issue where field IDs 1001+ could return N/A values on H100 GPUs.

  • Fixed dcgmproftester11 not working on drivers r515 and older

  • Fixed dcgmproftester10 not working for FP16, FP32, and FP64

  • Fixed minor bugs in the DCGM Diagnostic EUD plugin.

3.1.7#

Improvements#

  • Added support for the NVIDIA L40 and NVIDIA L4 (based on the Ada Lovelace architecture)

  • Added support for the 800-series of NVIDIA GPU products

  • Included metadata on software versions and GPUs detected when running DCGM Diagnostics

  • Updated configuration parameters for the Input EDPp test on H100 and added the ability for users to select a subset of the test patterns.

Fixed Issues#

  • Fixed an issue where DCGM Diagnostics would hang in scenarios where the GPU is unable to be enumerated any longer on the PCIe bus during a diagnostics run.

  • Fixed an issue where installing DCGM fails to install on SLES 15 due to the inability to create a nvidia-dcgm group.

  • Fixed a memory leak issue in libdcgmmoduleprofiling.so when monitoring MIG devices with profiling metrics

Known Issues#

  • When profiling metrics are monitoried (for example, dcgmi dmon -e 1001), some metrics might be reported as “N/A” for some intervals with an active CUDA context (for example, dcgmproftester12 -t 1007 -d 50 –no-dcgm-validation). This issue is due to a timing issue where samples are being cleaned up before they can be used to calculate the metrics. This issue is resolved in DCGM 3.1.8.

  • Running dcgmproftester12 without any arguments may result in an error after a while: Error -24 from InitializeGpus(). Exiting.

3.1.6#

Improvements#

  • Added support for HGX H100 and H100 SXM products.

  • Updated the Input EDPp tests to support H100 products.

  • Added a warning when users attempt to run DCGM Diagnostics on GPUs configured in MIG mode, but no MIG devices are created.

Fixed Issues#

  • Fixed an error with missing symbols when running DCGM Diagnostics run levels 3 & 4. The errors manifest as follows: Unable to merge JSON results for regular Diag and EUD tests and logs will contain this error: Couldn’t load a definition for GetPluginInterfaceVersion in plugin.

  • Fixed an intermittent crash when running the Input EDPp tests.

  • Fixed incorrect failure of DCGM diagnostics on discovering inactive NVLinks.

3.1.3#

New Features#

  • Added support for the NVIDIA Hopper architecture and NVIDIA H100 PCIe product:

    • Added support for Hopper performance monitoring APIs

    • Added support for Hopper Multi-Instance GPU profiles

    • Added support for DCGM GPU diagnostics

  • Added support for NVIDIA Ada architecture, NVIDIA L40 product

  • Added telemetry for NVSwitches. See API documentation (fieldIdentifers) for new fields.

  • Added support for End User Diagnostics (EUD) as a preview feature for specific PCIe products

  • Added support for CUDA 12

  • Added the ability for DCGM Diagnostics to skip the NVLink integration test when NVLinks are not enabled. This can be accomplished by adding the -p pcie.test_nvlink_status=false option to the dcgmi diag command-line.

  • Added support for Red Hat Enterprise Linux (RHEL) 9.

Major API changes and Deprecations#

The following features have been dropped or deprecated starting with DCGM 3.0:

  • The socket protocol based on protobuf has been removed

  • The DCGM introspection APIs have been removed (except for host engine memory usage and host CPU usage)

  • The following field identifers have been removed:

    • DCGM_FI_DEV_GRAPHICS_PIDS

    • DCGM_FI_DEV_COMPUTE_PIDS

    • DCGM_FI_DEV_GPU_UTIL_SAMPLES

    • DCGM_FI_DEV_MEM_COPY_UTIL_SAMPLES

  • Support for CUDA 9 and CUDA 10 based drivers has been removed; DCGM diagnostics cannot be used on systems with these older driver installations

  • For reading metrics, the dcgmProfWatchFields() API is no longer supported (and will return a DCGM_ST_NOT_SUPPORTED error.) Instead, the more generic dcgmWatchFields() API should be used.

  • The sm_stress test is no longer run as default for -r 3 and -r 4 run levels. To invoke the test separately, dcgmi diag -r sm_stress can be used.

Fixed Issues#

  • The Input EDPp test (“Pulse”) with -r 4 and Memory bandwidth tests are now supported for H100 PCIe in this release

  • Fixed an issue with the Pulse test (under -r 4) which caused the test to hang in some scenarios on A100 systems

  • Fixed an issue in DCGM diagnostics where a failure on one GPU would be attributed to all GPUs in a multi-GPU system.

  • Fixed an issue with the calculation of the DCGM_FI_DEV_FB_USED_PERCENT metric

  • Fixed an issue with package dependencies on the libgomp package on SUSE SLES based distributions

  • Fixed an issue where DCGM diagnostics was not handling driver timeouts correctly

  • Fixed an issue with DCGM diagnostics to not print out Error: unable to establish a connection to the specified host: localhost when a --host parameter was not passed.

  • Fixed an issue in DCGM diagnostics to handle Ctrl-C signals correctly.

  • Fixed an issue with metrics in MIG mode where all field values would report incorrect values after a few hours.

  • Fixed an issue on A100 in MIG mode where some whole GPU metrics such as temperature, power etc. were returned as 0 for MIG devices.

  • Fixed an issue on A100 in MIG mode to report memory (DCGM_FI_DEV_FB_FREE, DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_FB_TOTAL) per MIG device.

  • Fixed an issue where package installation would fail on RHEL systems.

  • Removed the redundant temperature_max setting from the diag-skus.yaml configuration for DCGM Diagnostics.

  • Fixed an issue where DCGM with R510+ drivers was using an incorrect NVML API to return memory usage. A new field identifier DCGM_FI_DEV_FB_RESERVED was added to distinguish between the actual usage and reserved memory.

Known Issues#

  • On V100, DCGM metrics may be reported as 0 after some time interval when two or more CUDA contexts are active on the GPU.

  • On DGX-2/HGX-2 systems, ensure that nv-hostengine and the Fabric Manager service are started before using dcgmproftester for testing the new profiling metrics. See the Getting Started section in the DCGM User Guide for details on installation.

  • On K80s, nvidia-smi may report hardware throttling (clocks_throttle_reasons.hw_slowdown = ACTIVE) during DCGM Diagnostics (Level 3). The stressful workload results in power transients that engage the HW slowdown mechanism to ensure that the Tesla K80 product operates within the power capping limit for both long term and short term timescales. For Volta or later Tesla products, this reporting issue has been fixed and the workload transients are no longer flagged as “HW Slowdown”. The NVIDIA driver will accurately detect if the slowdown event is due to thermal thresholds being exceeded or external power brake event. It is recommended that customers ignore this failure mode on Tesla K80 if the GPU temperature is within specification.

  • To report NVLINK bandwidth utilization DCGM programs counters in the HW to extract the desired information. It is currently possible for certain other tools a user might run, including nvprof, to change these settings after DCGM monitoring begins. In such a situation DCGM may subsequently return errors or invalid values for the NVLINK metrics. There is currently no way within DCGM to prevent other tools from modifying this shared configuration. Once the interfering tool is done a user of DCGM can repair the reporting by running nvidia-smi nvlink -sc 0bz; nvidia-smi nvlink -sc 1bz.