Field Identifiers#

group Field Identifiers

Field Identifiers.

Defines

DCGM_FI_UNKNOWN 0#

NULL field.

DCGM_FI_DRIVER_VERSION 1#

Driver Version.

DCGM_FI_NVML_VERSION 2#
DCGM_FI_PROCESS_NAME 3#
DCGM_FI_DEV_COUNT 4#

Number of Devices on the node.

DCGM_FI_CUDA_DRIVER_VERSION 5#

Cuda Driver Version Retrieves a number with the major value in the thousands place and the minor value in the hundreds place.

CUDA 11.1 = 11100

DCGM_FI_BIND_UNBIND_EVENT 6#

GPU bind/unbind event notification Values: SystemReinitializing=1, SystemReinitializationCompleted=2.

Note

Recommended watch frequency: 1 second

DCGM_FI_DEV_NAME 50#

Name of the GPU device.

DCGM_FI_DEV_BRAND 51#

Device Brand.

DCGM_FI_DEV_NVML_INDEX 52#

NVML index of this GPU.

DCGM_FI_DEV_SERIAL 53#

Device Serial Number.

DCGM_FI_DEV_UUID 54#

UUID corresponding to the device.

DCGM_FI_DEV_MINOR_NUMBER 55#

Device node minor number /dev/nvidia#.

DCGM_FI_DEV_OEM_INFOROM_VER 56#

OEM inforom version.

DCGM_FI_DEV_PCI_BUSID 57#

PCI attributes for the device.

DCGM_FI_DEV_PCI_COMBINED_ID 58#

The combined 16-bit device id and 16-bit vendor id.

DCGM_FI_DEV_PCI_SUBSYS_ID 59#

The 32-bit Sub System Device ID.

DCGM_FI_GPU_TOPOLOGY_PCI 60#

Topology of all GPUs on the system via PCI (static)

Topology of all GPUs on the system via NVLINK (static)

DCGM_FI_GPU_TOPOLOGY_AFFINITY 62#

Affinity of all GPUs on the system (static)

DCGM_FI_DEV_CUDA_COMPUTE_CAPABILITY 63#

Cuda compute capability for the device.

The major version is the upper 32 bits and the minor version is the lower 32 bits.

A bitmap of the P2P NVLINK status from this GPU to others on this host.

DCGM_FI_DEV_COMPUTE_MODE 65#

Compute mode for the device.

DCGM_FI_DEV_PERSISTENCE_MODE 66#

Persistence mode for the device Boolean: 0 is disabled, 1 is enabled.

DCGM_FI_DEV_MIG_MODE 67#

MIG mode for the device Boolean: 0 is disabled, 1 is enabled.

DCGM_FI_DEV_CUDA_VISIBLE_DEVICES_STR 68#

The string that CUDA_VISIBLE_DEVICES should be set to for this entity (including MIG)

DCGM_FI_DEV_MIG_MAX_SLICES 69#

The maximum number of MIG slices supported by this GPU.

DCGM_FI_DEV_CPU_AFFINITY_0 70#

Device CPU affinity.

part 1/8 = cpus 0 - 63

DCGM_FI_DEV_CPU_AFFINITY_1 71#

Device CPU affinity.

part 1/8 = cpus 64 - 127

DCGM_FI_DEV_CPU_AFFINITY_2 72#

Device CPU affinity.

part 2/8 = cpus 128 - 191

DCGM_FI_DEV_CPU_AFFINITY_3 73#

Device CPU affinity.

part 3/8 = cpus 192 - 255

DCGM_FI_DEV_CC_MODE 74#

ConfidentialCompute/AmpereProtectedMemory status for this system 0 = disabled 1 = enabled.

DCGM_FI_DEV_MIG_ATTRIBUTES 75#

Attributes for the given MIG device handles.

DCGM_FI_DEV_MIG_GI_INFO 76#

GPU instance profile information.

DCGM_FI_DEV_MIG_CI_INFO 77#

Compute instance profile information.

DCGM_FI_DEV_ECC_INFOROM_VER 80#

ECC inforom version.

DCGM_FI_DEV_POWER_INFOROM_VER 81#

Power management object inforom version.

DCGM_FI_DEV_INFOROM_IMAGE_VER 82#

Inforom image version.

DCGM_FI_DEV_INFOROM_CONFIG_CHECK 83#

Inforom configuration checksum.

DCGM_FI_DEV_INFOROM_CONFIG_VALID 84#

Reads the infoROM from the flash and verifies the checksums.

DCGM_FI_DEV_VBIOS_VERSION 85#

VBIOS version of the device.

DCGM_FI_DEV_MEM_AFFINITY_0 86#

Device Memory node affinity, 0-63.

DCGM_FI_DEV_MEM_AFFINITY_1 87#

Device Memory node affinity, 64-127.

DCGM_FI_DEV_MEM_AFFINITY_2 88#

Device Memory node affinity, 128-191.

DCGM_FI_DEV_MEM_AFFINITY_3 89#

Device Memory node affinity, 192-255.

DCGM_FI_DEV_BAR1_TOTAL 90#

Total BAR1 of the GPU in MB.

DCGM_FI_SYNC_BOOST 91#

Deprecated - Sync boost settings on the node.

DCGM_FI_DEV_BAR1_USED 92#

Used BAR1 of the GPU in MB.

DCGM_FI_DEV_BAR1_FREE 93#

Free BAR1 of the GPU in MB.

DCGM_FI_DEV_GPM_SUPPORT 94#

  • GPM support for the device

DCGM_FI_DEV_SM_CLOCK 100#

SM clock for the device.

DCGM_FI_DEV_MEM_CLOCK 101#

Memory clock for the device.

DCGM_FI_DEV_VIDEO_CLOCK 102#

Video encoder/decoder clock for the device.

DCGM_FI_DEV_APP_SM_CLOCK 110#

SM Application clocks.

DCGM_FI_DEV_APP_MEM_CLOCK 111#

Memory Application clocks.

DCGM_FI_DEV_CLOCKS_EVENT_REASONS 112#

Current clock event reasons (bitmask of DCGM_CLOCKS_EVENT_REASON_*)

DCGM_FI_DEV_CLOCK_THROTTLE_REASONS DCGM_FI_DEV_CLOCKS_EVENT_REASONS#

Deprecated: Use DCGM_FI_DEV_CLOCKS_EVENT_REASONS instead.

DCGM_FI_DEV_MAX_SM_CLOCK 113#

Maximum supported SM clock for the device.

DCGM_FI_DEV_MAX_MEM_CLOCK 114#

Maximum supported Memory clock for the device.

DCGM_FI_DEV_MAX_VIDEO_CLOCK 115#

Maximum supported Video encoder/decoder clock for the device.

DCGM_FI_DEV_AUTOBOOST 120#

Auto-boost for the device (1 = enabled.

0 = disabled)

DCGM_FI_DEV_SUPPORTED_CLOCKS 130#

Supported clocks for the device.

DCGM_FI_DEV_MEMORY_TEMP 140#

Memory temperature for the device.

DCGM_FI_DEV_GPU_TEMP 150#

Current temperature readings for the device, in degrees C.

DCGM_FI_DEV_MEM_MAX_OP_TEMP 151#

Maximum operating temperature for the memory of this GPU.

Above this temperature slowdown will occur.

DCGM_FI_DEV_GPU_MAX_OP_TEMP 152#

Maximum operating temperature for this GPU.

DCGM_FI_DEV_GPU_TEMP_LIMIT 153#

Thermal margin temperature (distance to nearest slowdown threshold) for this GPU.

DCGM_FI_DEV_POWER_USAGE 155#

Power usage for the device in Watts.

DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION 156#

Total energy consumption for the GPU in mJ since the driver was last reloaded.

DCGM_FI_DEV_POWER_USAGE_INSTANT 157#

Current instantaneous power usage of the device in Watts.

DCGM_FI_DEV_SLOWDOWN_TEMP 158#

Slowdown temperature for the device.

DCGM_FI_DEV_SHUTDOWN_TEMP 159#

Shutdown temperature for the device.

DCGM_FI_DEV_POWER_MGMT_LIMIT 160#

Current Power limit for the device.

DCGM_FI_DEV_POWER_MGMT_LIMIT_MIN 161#

Minimum power management limit for the device.

DCGM_FI_DEV_POWER_MGMT_LIMIT_MAX 162#

Maximum power management limit for the device.

DCGM_FI_DEV_POWER_MGMT_LIMIT_DEF 163#

Default power management limit for the device.

DCGM_FI_DEV_ENFORCED_POWER_LIMIT 164#

Effective power limit that the driver enforces after taking into account all limiters.

DCGM_FI_DEV_REQUESTED_POWER_PROFILE_MASK 165#

Requested workload power profile mask(Blackwell and newer)

DCGM_FI_DEV_ENFORCED_POWER_PROFILE_MASK 166#

Enforced workload power profile mask(Blackwell and newer)

DCGM_FI_DEV_VALID_POWER_PROFILE_MASK 167#

Requested workload power profile mask(Blackwell and newer)

DCGM_FI_DEV_FABRIC_MANAGER_STATUS 170#

The status of the fabric manager - a value from dcgmFabricManagerStatus_t.

DCGM_FI_DEV_FABRIC_MANAGER_ERROR_CODE 171#

The failure that happened while starting the Fabric Manager, if any NOTE: this is not populated unless the fabric manager completed startup.

DCGM_FI_DEV_FABRIC_CLUSTER_UUID 172#

The uuid of the cluster to which this GPU belongs.

DCGM_FI_DEV_FABRIC_CLIQUE_ID 173#

The ID of the fabric clique to which this GPU belongs.

DCGM_FI_DEV_FABRIC_HEALTH_MASK 174#

GPU Fabric health Status Mask.

Use DCGM_GPU_FABRIC_HEALTH_TEST macro to check the different health statuses. Use DCGM_GPU_FABRIC_HEALTH_GET macro to get the different health statuses.

DCGM_FI_DEV_PSTATE 190#

Performance state (P-State) 0-15.

0=highest

DCGM_FI_DEV_FAN_SPEED 191#

Fan speed for the device in percent 0-100.

DCGM_FI_DEV_PCIE_TX_THROUGHPUT 200#

PCIe Tx utilization information.

Deprecated: Use DCGM_FI_PROF_PCIE_TX_BYTES instead.

DCGM_FI_DEV_PCIE_RX_THROUGHPUT 201#

PCIe Rx utilization information.

Deprecated: Use DCGM_FI_PROF_PCIE_RX_BYTES instead.

DCGM_FI_DEV_PCIE_REPLAY_COUNTER 202#

PCIe replay counter.

DCGM_FI_DEV_GPU_UTIL 203#

GPU Utilization.

DCGM_FI_DEV_MEM_COPY_UTIL 204#

Memory Utilization.

DCGM_FI_DEV_ACCOUNTING_DATA 205#

Process accounting stats.

This field is only supported when the host engine is running as root unless you enable accounting ahead of time. Accounting mode can be enabled by running “nvidia-smi -am 1” as root on the same node the host engine is running on.

DCGM_FI_DEV_ENC_UTIL 206#

Encoder Utilization.

DCGM_FI_DEV_DEC_UTIL 207#

Decoder Utilization.

DCGM_FI_DEV_XID_ERRORS 230#

XID errors.

The value is the specific XID error

PCIe Max Link Generation.

PCIe Max Link Width.

PCIe Current Link Generation.

PCIe Current Link Width.

DCGM_FI_DEV_POWER_VIOLATION 240#

Power Violation time in ns.

DCGM_FI_DEV_THERMAL_VIOLATION 241#

Thermal Violation time in ns.

DCGM_FI_DEV_SYNC_BOOST_VIOLATION 242#

Sync Boost Violation time in ns.

DCGM_FI_DEV_BOARD_LIMIT_VIOLATION 243#

Board violation limit.

DCGM_FI_DEV_LOW_UTIL_VIOLATION 244#

Low utilisation violation limit.

DCGM_FI_DEV_RELIABILITY_VIOLATION 245#

Reliability violation limit.

DCGM_FI_DEV_TOTAL_APP_CLOCKS_VIOLATION 246#

App clock violation limit.

DCGM_FI_DEV_TOTAL_BASE_CLOCKS_VIOLATION 247#

Base clock violation limit.

DCGM_FI_DEV_FB_TOTAL 250#

Total Frame Buffer of the GPU in MB.

DCGM_FI_DEV_FB_FREE 251#

Free Frame Buffer in MB.

DCGM_FI_DEV_FB_USED 252#

Used Frame Buffer in MB.

DCGM_FI_DEV_FB_RESERVED 253#

Reserved Frame Buffer in MB.

DCGM_FI_DEV_FB_USED_PERCENT 254#

Percentage used of Frame Buffer: ‘Used/(Total - Reserved)’.

Range 0.0-1.0

C2C Link Count.

C2C Link Status The value of 0 the link is INACTIVE.

The value of 1 the link is ACTIVE.

DCGM_FI_DEV_C2C_MAX_BANDWIDTH 287#

C2C Max Bandwidth The value indicates the link speed in MB/s.

DCGM_FI_DEV_ECC_CURRENT 300#

Current ECC mode for the device.

DCGM_FI_DEV_ECC_PENDING 301#

Pending ECC mode for the device.

DCGM_FI_DEV_ECC_SBE_VOL_TOTAL 310#

Total single bit volatile ECC errors.

DCGM_FI_DEV_ECC_DBE_VOL_TOTAL 311#

Total double bit volatile ECC errors.

DCGM_FI_DEV_ECC_SBE_AGG_TOTAL 312#

Total single bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_DBE_AGG_TOTAL 313#

Total double bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_SBE_VOL_L1 314#

L1 cache single bit volatile ECC errors.

DCGM_FI_DEV_ECC_DBE_VOL_L1 315#

L1 cache double bit volatile ECC errors.

DCGM_FI_DEV_ECC_SBE_VOL_L2 316#

L2 cache single bit volatile ECC errors.

DCGM_FI_DEV_ECC_DBE_VOL_L2 317#

L2 cache double bit volatile ECC errors.

DCGM_FI_DEV_ECC_SBE_VOL_DEV 318#

Device memory single bit volatile ECC errors.

DCGM_FI_DEV_ECC_DBE_VOL_DEV 319#

Device memory double bit volatile ECC errors.

DCGM_FI_DEV_ECC_SBE_VOL_REG 320#

Register file single bit volatile ECC errors.

DCGM_FI_DEV_ECC_DBE_VOL_REG 321#

Register file double bit volatile ECC errors.

DCGM_FI_DEV_ECC_SBE_VOL_TEX 322#

Texture memory single bit volatile ECC errors.

DCGM_FI_DEV_ECC_DBE_VOL_TEX 323#

Texture memory double bit volatile ECC errors.

DCGM_FI_DEV_ECC_SBE_AGG_L1 324#

L1 cache single bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_DBE_AGG_L1 325#

L1 cache double bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_SBE_AGG_L2 326#

L2 cache single bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_DBE_AGG_L2 327#

L2 cache double bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_SBE_AGG_DEV 328#

Device memory single bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_DBE_AGG_DEV 329#

Device memory double bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_SBE_AGG_REG 330#

Register File single bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_DBE_AGG_REG 331#

Register File double bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_SBE_AGG_TEX 332#

Texture memory single bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_DBE_AGG_TEX 333#

Texture memory double bit aggregate (persistent) ECC errors Note: monotonically increasing.

DCGM_FI_DEV_ECC_SBE_VOL_SHM 334#

Texture SHM single bit volatile ECC errors.

DCGM_FI_DEV_ECC_DBE_VOL_SHM 335#

Texture SHM double bit volatile ECC errors.

DCGM_FI_DEV_ECC_SBE_VOL_CBU 336#

CBU single bit ECC volatile errors.

DCGM_FI_DEV_ECC_DBE_VOL_CBU 337#

CBU double bit ECC volatile errors.

DCGM_FI_DEV_ECC_SBE_AGG_SHM 338#

Texture SHM single bit aggregate ECC errors.

DCGM_FI_DEV_ECC_DBE_AGG_SHM 339#

Texture SHM double bit aggregate ECC errors.

DCGM_FI_DEV_ECC_SBE_AGG_CBU 340#

CBU single bit ECC aggregate errors.

DCGM_FI_DEV_ECC_DBE_AGG_CBU 341#

CBU double bit ECC aggregate errors.

DCGM_FI_DEV_ECC_SBE_VOL_SRM 342#

Turing and later fields.

SRAM single bit ECC volatile errors

DCGM_FI_DEV_ECC_DBE_VOL_SRM 343#

SRAM double bit ECC volatile errors.

DCGM_FI_DEV_ECC_SBE_AGG_SRM 344#

SRAM single bit ECC aggregate errors.

DCGM_FI_DEV_ECC_DBE_AGG_SRM 345#

SRAM double bit ECC aggregate errors.

DCGM_FI_DEV_THRESHOLD_SRM 346#

Ampere and later fields.

SRAM Threashhold Exceeded boolean (1=true)

DCGM_FI_DEV_DIAG_MEMORY_RESULT 350#

Result of the GPU Memory test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_DIAGNOSTIC_RESULT 351#

Result of the Diagnostics test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_PCIE_RESULT 352#

Result of the PCIe + NVLink test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_TARGETED_STRESS_RESULT 353#

Result of the Targeted Stress test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_TARGETED_POWER_RESULT 354#

Result of the Targeted Power test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_MEMORY_BANDWIDTH_RESULT 355#

Result of the Memory Bandwidth test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_MEMTEST_RESULT 356#

Result of the Memory Stress test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_PULSE_TEST_RESULT 357#

Result of the Input Energy Delayed Product power (EDPp) test (a.k.a.

the pulse test) Refers to a int64_t storing a value drawn from dcgmError_t enumeration

DCGM_FI_DEV_DIAG_EUD_RESULT 358#

Result of the Extended Utility Diagnostics (EUD) test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_CPU_EUD_RESULT 359#

Result of the CPU Extended Utility Diagnostics (CPU EUD) test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_SOFTWARE_RESULT 360#

Result of the Software test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_NVBANDWIDTH_RESULT 361#

Result of the NVBandwidth test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_DIAG_STATUS 362#
DCGM_FI_DEV_DIAG_NCCL_TESTS_RESULT 363#

Result of the nccl-tests test Refers to a int64_t storing a value drawn from dcgmError_t enumeration.

DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_MAX 385#

Historical max available spare memory rows per memory bank.

DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_HIGH 386#

Historical high mark of available spare memory rows per memory bank.

DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_PARTIAL 387#

Historical mark of partial available spare memory rows per memory bank.

DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_LOW 388#

Historical low mark of available spare memory rows per memory bank.

DCGM_FI_DEV_BANKS_REMAP_ROWS_AVAIL_NONE 389#

Historical marker of memory banks with no available spare memory rows.

DCGM_FI_DEV_RETIRED_SBE 390#

Number of retired pages because of single bit errors Note: monotonically increasing.

DCGM_FI_DEV_RETIRED_DBE 391#

Number of retired pages because of double bit errors Note: monotonically increasing.

DCGM_FI_DEV_RETIRED_PENDING 392#

Number of pages pending retirement.

DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS 393#

Number of remapped rows for uncorrectable errors.

DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS 394#

Number of remapped rows for correctable errors.

DCGM_FI_DEV_ROW_REMAP_FAILURE 395#

Whether remapping of rows has failed.

DCGM_FI_DEV_ROW_REMAP_PENDING 396#

Whether remapping of rows is pending.

DCGM_FI_DEV_VIRTUAL_MODE 500#

Virtualization Mode corresponding to the GPU.

One of DCGM_GPU_VIRTUALIZATION_MODE_* constants.

DCGM_FI_DEV_SUPPORTED_TYPE_INFO 501#

Includes Count and Static info of vGPU types supported on a device.

DCGM_FI_DEV_CREATABLE_VGPU_TYPE_IDS 502#

Includes Count and currently Creatable vGPU types on a device.

DCGM_FI_DEV_VGPU_INSTANCE_IDS 503#

Includes Count and currently Active vGPU Instances on a device.

DCGM_FI_DEV_VGPU_UTILIZATIONS 504#

Utilization values for vGPUs running on the device.

DCGM_FI_DEV_VGPU_PER_PROCESS_UTILIZATION 505#

Utilization values for processes running within vGPU VMs using the device.

DCGM_FI_DEV_ENC_STATS 506#

Current encoder statistics for a given device.

DCGM_FI_DEV_FBC_STATS 507#

Statistics of current active frame buffer capture sessions on a given device.

DCGM_FI_DEV_FBC_SESSIONS_INFO 508#

Information about active frame buffer capture sessions on a target device.

DCGM_FI_DEV_SUPPORTED_VGPU_TYPE_IDS 509#

Includes Count and currently Supported vGPU types on a device.

DCGM_FI_DEV_VGPU_TYPE_INFO 510#

Includes Static info of vGPU types supported on a device.

DCGM_FI_DEV_VGPU_TYPE_NAME 511#

Includes the name of a vGPU type supported on a device.

DCGM_FI_DEV_VGPU_TYPE_CLASS 512#

Includes the class of a vGPU type supported on a device.

DCGM_FI_DEV_VGPU_TYPE_LICENSE 513#

Includes the license info for a vGPU type supported on a device.

DCGM_FI_DEV_VGPU_VM_ID 520#

VM ID of the vGPU instance.

DCGM_FI_DEV_VGPU_VM_NAME 521#

VM name of the vGPU instance.

DCGM_FI_DEV_VGPU_TYPE 522#

vGPU type of the vGPU instance

DCGM_FI_DEV_VGPU_UUID 523#

UUID of the vGPU instance.

DCGM_FI_DEV_VGPU_DRIVER_VERSION 524#

Driver version of the vGPU instance.

DCGM_FI_DEV_VGPU_MEMORY_USAGE 525#

Memory usage of the vGPU instance.

DCGM_FI_DEV_VGPU_LICENSE_STATUS 526#

License status of the vGPU.

0 = vgpu is not licensed

1 = vgpu is licensed

DCGM_FI_DEV_VGPU_FRAME_RATE_LIMIT 527#

Frame rate limit of the vGPU instance.

DCGM_FI_DEV_VGPU_ENC_STATS 528#

Current encoder statistics of the vGPU instance.

DCGM_FI_DEV_VGPU_ENC_SESSIONS_INFO 529#

Information about all active encoder sessions on the vGPU instance.

DCGM_FI_DEV_VGPU_FBC_STATS 530#

Statistics of current active frame buffer capture sessions on the vGPU instance.

DCGM_FI_DEV_VGPU_FBC_SESSIONS_INFO 531#

Information about active frame buffer capture sessions on the vGPU instance.

DCGM_FI_DEV_VGPU_INSTANCE_LICENSE_STATE 532#

License state information of the vGPU instance.

DCGM_FI_DEV_VGPU_PCI_ID 533#

PCI Id of the vGPU instance.

DCGM_FI_DEV_VGPU_VM_GPU_INSTANCE_ID 534#

GPU Instance ID for the given vGPU Instance.

DCGM_FI_FIRST_VGPU_FIELD_ID 520#

Starting field ID of the vGPU instance.

DCGM_FI_LAST_VGPU_FIELD_ID 570#

Last field ID of the vGPU instance.

DCGM_FI_MAX_VGPU_FIELDS DCGM_FI_LAST_VGPU_FIELD_ID - DCGM_FI_FIRST_VGPU_FIELD_ID#

For now max vGPU field Ids taken as difference of DCGM_FI_LAST_VGPU_FIELD_ID and DCGM_FI_LAST_VGPU_FIELD_ID i.e.

50

DCGM_FI_DEV_PLATFORM_INFINIBAND_GUID 571#

Infiniband GUID string with format 0xXXXXXXXXXXXXXXXX for the specified GPU.

DCGM_FI_DEV_PLATFORM_CHASSIS_SERIAL_NUMBER 572#

Serial number of the chassis containing this GPU.

DCGM_FI_DEV_PLATFORM_CHASSIS_SLOT_NUMBER 573#

Slot number in the rack containing the GPU (includes switches)

DCGM_FI_DEV_PLATFORM_TRAY_INDEX 574#

Tray index within the compute slots in the chassis containing this GPU (does not include switches)

DCGM_FI_DEV_PLATFORM_HOST_ID 575#

Index of the node within the slot containing the GPU.

DCGM_FI_DEV_PLATFORM_PEER_TYPE 576#

Platform indicated NVLink-peer type (e.g.

switch present or not)

DCGM_FI_DEV_PLATFORM_MODULE_ID 577#

ID of the GPU within the node.

Link-based PRM metrics for NvLink These fields use dcgm_link_t to specify GPU ID + port number for per-link metrics.

PPRM recovery operation status

Time in seconds since last PRM recovery.

Time in milliseconds between last two recoveries.

Total successful recovery events counter.

Physical layer successful recovery events.

Physical layer link down counter.

PLR received codewords counter.

PLR received code error counter.

PLR received uncorrectable codes counter.

PLR transmitted codewords counter.

PLR transmitted retry codes counter.

PLR transmitted retry events counter.

PLR sync events counter.

DCGM_FI_INTERNAL_FIELDS_0_START 600#

Starting ID for all the internal fields.

DCGM_FI_INTERNAL_FIELDS_0_END 699#

Last ID for all the internal fields.

NVSwitch entity field IDs start here.

NVSwitch latency bins for port 0

DCGM_FI_FIRST_NVSWITCH_FIELD_ID 700#

Starting field ID of the NVSwitch instance.

DCGM_FI_DEV_NVSWITCH_VOLTAGE_MVOLT 701#

NvSwitch voltage.

DCGM_FI_DEV_NVSWITCH_CURRENT_IDDQ 702#

NvSwitch Current IDDQ.

DCGM_FI_DEV_NVSWITCH_CURRENT_IDDQ_REV 703#

NvSwitch Current IDDQ Rev.

DCGM_FI_DEV_NVSWITCH_CURRENT_IDDQ_DVDD 704#

NvSwitch Current IDDQ Rev DVDD.

DCGM_FI_DEV_NVSWITCH_POWER_VDD 705#

NvSwitch Power VDD in watts.

DCGM_FI_DEV_NVSWITCH_POWER_DVDD 706#

NvSwitch Power DVDD in watts.

DCGM_FI_DEV_NVSWITCH_POWER_HVDD 707#

NvSwitch Power HVDD in watts.

NVSwitch Tx Throughput Counter for ports 0-17

NVSwitch Rx Throughput Counter for ports 0-17.

NvSwitch fatal_errors for ports 0-17.

NvSwitch non_fatal_errors for ports 0-17.

NvSwitch replay_count_errors for ports 0-17.

NvSwitch recovery_count_errors for ports 0-17.

NvSwitch filt_err_count_errors for ports 0-17.

NvLink lane_crs_err_count_aggregate_errors for ports 0-17.

NvLink lane ecc_err_count_aggregate_errors for ports 0-17.

Nvlink lane latency low lane0 counter.

Nvlink lane latency low lane1 counter.

Nvlink lane latency low lane2 counter.

Nvlink lane latency low lane3 counter.

Nvlink lane latency medium lane0 counter.

Nvlink lane latency medium lane1 counter.

Nvlink lane latency medium lane2 counter.

Nvlink lane latency medium lane3 counter.

Nvlink lane latency high lane0 counter.

Nvlink lane latency high lane1 counter.

Nvlink lane latency high lane2 counter.

Nvlink lane latency high lane3 counter.

Nvlink lane latency panic lane0 counter.

Nvlink lane latency panic lane1 counter.

Nvlink lane latency panic lane2 counter.

Nvlink lane latency panic lane2 counter.

Nvlink lane latency count lane0 counter.

Nvlink lane latency count lane1 counter.

Nvlink lane latency count lane2 counter.

Nvlink lane latency count lane3 counter.

NvLink lane crc_err_count for lane 0 on ports 0-17.

NvLink lane crc_err_count for lane 1 on ports 0-17.

NvLink lane crc_err_count for lane 2 on ports 0-17.

NvLink lane crc_err_count for lane 3 on ports 0-17.

NvLink lane ecc_err_count for lane 0 on ports 0-17.

NvLink lane ecc_err_count for lane 1 on ports 0-17.

NvLink lane ecc_err_count for lane 2 on ports 0-17.

NvLink lane ecc_err_count for lane 3 on ports 0-17.

NvLink lane crc_err_count for lane 4 on ports 0-17.

NvLink lane crc_err_count for lane 5 on ports 0-17.

NvLink lane crc_err_count for lane 6 on ports 0-17.

NvLink lane crc_err_count for lane 7 on ports 0-17.

NvLink lane ecc_err_count for lane 4 on ports 0-17.

NvLink lane ecc_err_count for lane 5 on ports 0-17.

NvLink lane ecc_err_count for lane 6 on ports 0-17.

NvLink lane ecc_err_count for lane 7 on ports 0-17.

NV Link TX Bandwidth Counter for Lane 0.

NV Link TX Bandwidth Counter for Lane 1.

NV Link TX Bandwidth Counter for Lane 2.

NV Link TX Bandwidth Counter for Lane 3.

NV Link TX Bandwidth Counter for Lane 4.

NV Link TX Bandwidth Counter for Lane 5.

NV Link TX Bandwidth Counter for Lane 6.

NV Link TX Bandwidth Counter for Lane 7.

NV Link TX Bandwidth Counter for Lane 8.

NV Link TX Bandwidth Counter for Lane 9.

NV Link TX Bandwidth Counter for Lane 10.

NV Link TX Bandwidth Counter for Lane 11.

NV Link TX Bandwidth Counter for Lane 12.

NV Link TX Bandwidth Counter for Lane 13.

NV Link TX Bandwidth Counter for Lane 14.

NV Link TX Bandwidth Counter for Lane 15.

NV Link TX Bandwidth Counter for Lane 16.

NV Link TX Bandwidth Counter for Lane 17.

NV Link Bandwidth Counter total for all TX Lanes.

DCGM_FI_DEV_NVSWITCH_FATAL_ERRORS 856#

NVSwitch fatal error information.

Note: value field indicates the specific SXid reported

DCGM_FI_DEV_NVSWITCH_NON_FATAL_ERRORS 857#

NVSwitch non fatal error information.

Note: value field indicates the specific SXid reported

DCGM_FI_DEV_NVSWITCH_TEMPERATURE_CURRENT 858#

NVSwitch current temperature.

DCGM_FI_DEV_NVSWITCH_TEMPERATURE_LIMIT_SLOWDOWN 859#

NVSwitch limit slowdown temperature.

DCGM_FI_DEV_NVSWITCH_TEMPERATURE_LIMIT_SHUTDOWN 860#

NVSwitch limit shutdown temperature.

DCGM_FI_DEV_NVSWITCH_THROUGHPUT_TX 861#

NVSwitch throughput Tx.

DCGM_FI_DEV_NVSWITCH_THROUGHPUT_RX 862#

NVSwitch throughput Rx.

DCGM_FI_DEV_NVSWITCH_PHYS_ID 863#
DCGM_FI_DEV_NVSWITCH_RESET_REQUIRED 864#

NVSwitch reset required.

NvSwitch NvLink ID.

DCGM_FI_DEV_NVSWITCH_PCIE_DOMAIN 866#

NvSwitch PCIE domain.

DCGM_FI_DEV_NVSWITCH_PCIE_BUS 867#

NvSwitch PCIE bus.

DCGM_FI_DEV_NVSWITCH_PCIE_DEVICE 868#

NvSwitch PCIE device.

DCGM_FI_DEV_NVSWITCH_PCIE_FUNCTION 869#

NvSwitch PCIE function.

NvLink status.

UNKNOWN:-1 OFF:0 SAFE:1 ACTIVE:2 ERROR:3

NvLink device type (GPU/Switch).

NvLink device pcie domain.

NvLink device pcie bus.

NvLink device pcie device.

NvLink device pcie function.

NvLink device link ID.

NvLink device SID.

DCGM_FI_DEV_NVSWITCH_DEVICE_UUID 878#

NvLink device switch/link uid.

NV Link RX Bandwidth Counter for Lane 0.

NV Link RX Bandwidth Counter for Lane 1.

NV Link RX Bandwidth Counter for Lane 2.

NV Link RX Bandwidth Counter for Lane 3.

NV Link RX Bandwidth Counter for Lane 4.

NV Link RX Bandwidth Counter for Lane 5.

NV Link RX Bandwidth Counter for Lane 6.

NV Link RX Bandwidth Counter for Lane 7.

NV Link RX Bandwidth Counter for Lane 8.

NV Link RX Bandwidth Counter for Lane 9.

NV Link RX Bandwidth Counter for Lane 10.

NV Link RX Bandwidth Counter for Lane 11.

NV Link RX Bandwidth Counter for Lane 12.

NV Link RX Bandwidth Counter for Lane 13.

NV Link RX Bandwidth Counter for Lane 14.

NV Link RX Bandwidth Counter for Lane 15.

NV Link RX Bandwidth Counter for Lane 16.

NV Link RX Bandwidth Counter for Lane 17.

NV Link Bandwidth Counter total for all RX Lanes.

DCGM_FI_LAST_NVSWITCH_FIELD_ID 899#

Last field ID of the NVSwitch instance.

DCGM_FI_MAX_NVSWITCH_FIELDS DCGM_FI_LAST_NVSWITCH_FIELD_ID - DCGM_FI_FIRST_NVSWITCH_FIELD_ID + 1#

For now max NVSwitch field Ids taken as difference of DCGM_FI_LAST_NVSWITCH_FIELD_ID and DCGM_FI_FIRST_NVSWITCH_FIELD_ID + 1 i.e.

200

DCGM_FI_PROF_GR_ENGINE_ACTIVE 1001#

Profiling Fields.

These all start with DCGM_FI_PROF_* Ratio of time the graphics engine is active. The graphics engine is active if a graphics/compute context is bound and the graphics pipe or compute pipe is busy.

DCGM_FI_PROF_SM_ACTIVE 1002#

The ratio of cycles an SM has at least 1 warp assigned (computed from the number of cycles and elapsed cycles)

DCGM_FI_PROF_SM_OCCUPANCY 1003#

The ratio of number of warps resident on an SM.

(number of resident as a ratio of the theoretical maximum number of warps per elapsed cycle)

DCGM_FI_PROF_PIPE_TENSOR_ACTIVE 1004#

The ratio of cycles the any tensor pipe is active (off the peak sustained elapsed cycles)

DCGM_FI_PROF_DRAM_ACTIVE 1005#

The ratio of cycles the device memory interface is active sending or receiving data.

DCGM_FI_PROF_PIPE_FP64_ACTIVE 1006#

Ratio of cycles the fp64 pipe is active.

DCGM_FI_PROF_PIPE_FP32_ACTIVE 1007#

Ratio of cycles the fp32 pipe is active.

DCGM_FI_PROF_PIPE_FP16_ACTIVE 1008#

Ratio of cycles the fp16 pipe is active.

This does not include HMMA.

DCGM_FI_PROF_PCIE_TX_BYTES 1009#

The number of bytes of active PCIe tx (transmit) data including both header and payload.

Note that this is from the perspective of the GPU, so copying data from device to host (DtoH) would be reflected in this metric.

DCGM_FI_PROF_PCIE_RX_BYTES 1010#

The number of bytes of active PCIe rx (read) data including both header and payload.

Note that this is from the perspective of the GPU, so copying data from host to device (HtoD) would be reflected in this metric.

The total number of bytes of active NvLink tx (transmit) data including both header and payload.

Per-link fields are available below

The total number of bytes of active NvLink rx (read) data including both header and payload.

Per-link fields are available below

DCGM_FI_PROF_PIPE_TENSOR_IMMA_ACTIVE 1013#

The ratio of cycles the tensor (IMMA) pipe is active (off the peak sustained elapsed cycles)

DCGM_FI_PROF_PIPE_TENSOR_HMMA_ACTIVE 1014#

The ratio of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles)

DCGM_FI_PROF_PIPE_TENSOR_DFMA_ACTIVE 1015#

The ratio of cycles the tensor (DFMA) pipe is active (off the peak sustained elapsed cycles)

DCGM_FI_PROF_PIPE_INT_ACTIVE 1016#

Ratio of cycles the integer pipe is active.

DCGM_FI_PROF_NVDEC0_ACTIVE 1017#

Ratio of cycles each of the NVDEC engines are active.

DCGM_FI_PROF_NVDEC1_ACTIVE 1018#
DCGM_FI_PROF_NVDEC2_ACTIVE 1019#
DCGM_FI_PROF_NVDEC3_ACTIVE 1020#
DCGM_FI_PROF_NVDEC4_ACTIVE 1021#
DCGM_FI_PROF_NVDEC5_ACTIVE 1022#
DCGM_FI_PROF_NVDEC6_ACTIVE 1023#
DCGM_FI_PROF_NVDEC7_ACTIVE 1024#
DCGM_FI_PROF_NVJPG0_ACTIVE 1025#

Ratio of cycles each of the NVJPG engines are active.

DCGM_FI_PROF_NVJPG1_ACTIVE 1026#
DCGM_FI_PROF_NVJPG2_ACTIVE 1027#
DCGM_FI_PROF_NVJPG3_ACTIVE 1028#
DCGM_FI_PROF_NVJPG4_ACTIVE 1029#
DCGM_FI_PROF_NVJPG5_ACTIVE 1030#
DCGM_FI_PROF_NVJPG6_ACTIVE 1031#
DCGM_FI_PROF_NVJPG7_ACTIVE 1032#
DCGM_FI_PROF_NVOFA0_ACTIVE 1033#

Ratio of cycles each of the NVOFA engines are active.

DCGM_FI_PROF_NVOFA1_ACTIVE 1034#

The per-link number of bytes of active NvLink TX (transmit) or RX (transmit) data including both header and payload.

For example: DCGM_FI_PROF_NVLINK_L0_TX_BYTES -> L0 TX To get the bandwidth for a link, add the RX and TX value together like total = DCGM_FI_PROF_NVLINK_L0_TX_BYTES + DCGM_FI_PROF_NVLINK_L0_RX_BYTES

NVLink throughput First.

NVLink throughput Last.

DCGM_FI_PROF_C2C_TX_ALL_BYTES 1076#

The total number of bytes transmitted over the C2C (Chip-to-Chip) interface, including both header and payload data.

DCGM_FI_PROF_C2C_TX_DATA_BYTES 1077#

The number of data-only bytes transmitted over the C2C (Chip-to-Chip) interface.

DCGM_FI_PROF_C2C_RX_ALL_BYTES 1078#

The total number of bytes received over the C2C (Chip-to-Chip) interface, including both header and payload data.

DCGM_FI_PROF_C2C_RX_DATA_BYTES 1079#

The number of data-only bytes received over the C2C (Chip-to-Chip) interface.

DCGM_FI_PROF_HOSTMEM_CACHE_HIT 1080#

Host Memory Cache Hit.

Percentage of requests to Host Memory that were served from cache

DCGM_FI_PROF_HOSTMEM_CACHE_MISS 1081#

Host Memory Cache Miss.

Percentage of requests to Host Memory that were cache misses

DCGM_FI_PROF_PEERMEM_CACHE_HIT 1082#

Peer Memory Cache Hit.

Percentage of requests to Peer Memory that were served from cache

DCGM_FI_PROF_PEERMEM_CACHE_MISS 1083#

Peer Memory Cache Miss.

Percentage of requests to Peer Memory that were cache misses

DCGM_FI_DEV_CPU_UTIL_TOTAL 1100#

CPU Utilization, total.

DCGM_FI_DEV_CPU_UTIL_USER 1101#

CPU Utilization, user.

DCGM_FI_DEV_CPU_UTIL_NICE 1102#

CPU Utilization, nice.

DCGM_FI_DEV_CPU_UTIL_SYS 1103#

CPU Utilization, system time.

DCGM_FI_DEV_CPU_UTIL_IRQ 1104#

CPU Utilization, interrupt servicing.

DCGM_FI_DEV_CPU_TEMP_CURRENT 1110#

CPU temperature.

DCGM_FI_DEV_CPU_TEMP_WARNING 1111#

CPU Warning Temperature.

DCGM_FI_DEV_CPU_TEMP_CRITICAL 1112#

CPU Critical Temperature.

DCGM_FI_DEV_CPU_CLOCK_CURRENT 1120#

CPU instantaneous clock speed.

DCGM_FI_DEV_CPU_POWER_UTIL_CURRENT 1130#

CPU power utilization.

DCGM_FI_DEV_CPU_POWER_LIMIT 1131#

CPU power limit.

DCGM_FI_DEV_SYSIO_POWER_UTIL_CURRENT 1132#

SoC power utilization.

DCGM_FI_DEV_MODULE_POWER_UTIL_CURRENT 1133#

Module power utilization.

DCGM_FI_DEV_CPU_VENDOR 1140#

CPU vendor name.

DCGM_FI_DEV_CPU_MODEL 1141#

CPU model name.

Total Tx packets on the link in NVLink5 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Total Tx bytes on the link in NVLink5 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Total Rx packets on the link in NVLink5 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Total Rx bytes on the link in NVLink5 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Number of packets Rx on a link where packets are malformed Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Number of packets that were discarded on Rx due to buffer overrun Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Total number of packets with errors Rx on a link Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Total number of packets Rx - stomp/EBP marker Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Total number of packets Rx with header mismatch Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Total number of times that the count of local errors exceeded a threshold Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Total number of tx error packets that were discarded Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Number of times link went from Up to recovery, succeeded and link came back up Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Number of times link went from Up to recovery, failed and link was declared down Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Number of times link went from Up to recovery, irrespective of the result Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Number of errors in rx symbols Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

BER for symbol errors - raw value Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

BER for symbol errors - decoded float (derived from DCGM_FI_DEV_NVLINK_COUNT_SYMBOL_BER) Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Effective BER for effective errors - raw value Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Effective BER for effective errors - decoded float (derived from DCGM_FI_DEV_NVLINK_COUNT_EFFECTIVE_BER) Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Sum of the number of errors in each Nvlink packet Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

NVLink ECC Data Error Counter total for all Links.

DCGM_FI_DEV_FIRST_CONNECTX_FIELD_ID 1300#

First field id of ConnectX.

DCGM_FI_DEV_CONNECTX_HEALTH 1300#

Health state of ConnectX.

Active PCIe link width.

Active PCIe link speed.

Expect PCIe link width.

Expect PCIe link speed.

DCGM_FI_DEV_CONNECTX_CORRECTABLE_ERR_STATUS 1305#

Correctable error status.

DCGM_FI_DEV_CONNECTX_CORRECTABLE_ERR_MASK 1306#

Correctable error mask.

DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_STATUS 1307#

Uncorrectable error status.

DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_MASK 1308#

Uncorrectable error mask.

DCGM_FI_DEV_CONNECTX_UNCORRECTABLE_ERR_SEVERITY 1309#

Uncorrectable error severity.

DCGM_FI_DEV_CONNECTX_DEVICE_TEMPERATURE 1310#

Device temperature.

DCGM_FI_DEV_LAST_CONNECTX_FIELD_ID 1399#

The last field id of ConnectX.

C2C Link CRC Error Counter.

C2C Link Replay Error Counter.

C2C Link Back to Back Replay Error Counter.

C2C Link Power state.

See NVML_C2C_POWER_STATE_*

Count of symbol errors that are corrected in each bin.

Count of symbol errors that are corrected - bin 0 Note: NVLink5+ only. Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 1 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 2 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 3 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 4 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 5 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 6 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 7 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 8 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 9 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 10 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 11 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 12 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 13 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 14 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

Count of symbol errors that are corrected - bin 15 Note: NVLink5+ only.

Returns aggregate value across all links. Not supported on NVLink4 and earlier.

DCGM_FI_DEV_CLOCKS_EVENT_REASON_SW_POWER_CAP_NS 1420#

Count, in nanoseconds, of slowdown or shutdown in sampling interval.

Throttling to not exceed currently set power limits in ns

DCGM_FI_DEV_CLOCKS_EVENT_REASON_SYNC_BOOST_NS 1421#

Throttling to match minimum possible clock across Sync Boost Group in ns.

DCGM_FI_DEV_CLOCKS_EVENT_REASON_SW_THERM_SLOWDOWN_NS 1422#

Throttling to ensure ((GPU temp < GPU Max Operating Temp) && (Memory Temp < Memory Max Operating Temp)) in ns.

DCGM_FI_DEV_CLOCKS_EVENT_REASON_HW_THERM_SLOWDOWN_NS 1423#

Throttling due to temperature being too high (reducing core clocks by a factor of 2 or more) in ns.

DCGM_FI_DEV_CLOCKS_EVENT_REASON_HW_POWER_BRAKE_SLOWDOWN_NS 1424#

Throttling due to external power brake assertion trigger (reducing core clocks by a factor of 2 or more) in ns.

DCGM_FI_DEV_PWR_SMOOTHING_ENABLED 1425#

DCGM Power smoothing fields.

Enablement (0/DISABLED or 1/ENABLED)

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_PRIV_LVL 1426#

Current privilege level.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_IMM_RAMP_DOWN_ENABLED 1427#

Immediate ramp down enablement (0/DISABLED or 1/ENABLED)

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_APPLIED_TMP_CEIL 1428#

Applied TMP ceiling value in Watts.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_APPLIED_TMP_FLOOR 1429#

Applied TMP floor value in Watts.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_MAX_PERCENT_TMP_FLOOR_SETTING 1430#

Max % TMP Floor value.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_MIN_PERCENT_TMP_FLOOR_SETTING 1431#

Min % TMP Floor value.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_HW_CIRCUITRY_PERCENT_LIFETIME_REMAINING 1432#

HW Circuitry % lifetime remaining.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_MAX_NUM_PRESET_PROFILES 1433#

Max number of preset profiles.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_PROFILE_PERCENT_TMP_FLOOR 1434#

% TMP floor for a given profile

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_PROFILE_RAMP_UP_RATE 1435#

Ramp up rate in mW/s for a given profile.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_PROFILE_RAMP_DOWN_RATE 1436#

Ramp down rate in mW/s for a given profile.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_PROFILE_RAMP_DOWN_HYST_VAL 1437#

Ramp down hysteresis value in ms for a given profile.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_ACTIVE_PRESET_PROFILE 1438#

Active preset profile number.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_ADMIN_OVERRIDE_PERCENT_TMP_FLOOR 1439#

% TMP floor for a given profile

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_UP_RATE 1440#

Ramp up rate in mW/s for a given profile.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_RATE 1441#

Ramp down rate in mW/s for a given profile.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PWR_SMOOTHING_ADMIN_OVERRIDE_RAMP_DOWN_HYST_VAL 1442#

Ramp down hysteresis value in ms for a given profile.

Note

DCGM_FI_DEV_PWR_SMOOTHING_* fields requires that power smoothing in-band access privileges have been set to either level 1 or level 2 (e.g. via Redfish API)

DCGM_FI_DEV_PCIE_COUNT_CORRECTABLE_ERRORS 1501#

1443 to 1500 entries reserved for power smoothing fields

DCGM_FI_IMEX_DOMAIN_STATUS 1502#

IMEX domain status (UP, DOWN, DEGRADED) Retrieved from nvidia-imex-ctl -N -j command.

DCGM_FI_IMEX_DAEMON_STATUS 1503#

IMEX daemon status (0-7 numeric values) Retrieved from nvidia-imex-ctl -q command Values: INITIALIZING=0, STARTING_AUTH_SERVER=1, WAITING_FOR_PEERS=2, WAITING_FOR_RECOVERY=3, INIT_GPU=4, READY=5, SHUTTING_DOWN=6, UNAVAILABLE=7.

DCGM_FI_DEV_MEMORY_UNREPAIRABLE_FLAG 1507#

1504 to 1506 entries reserved for power IMEX fields

Unrepairable memory flag indicating if memory has unrepairable errors 1=yes, 0=no

NVLink State (see NVML_FI_DEV_NVLINK_GET_STATE for return values) This field expects a dcgm_link_t entity to specify the GPU and link index.

Use DCGM_FE_LINK entity group when accessing this field.

InfiniBand Port Counter: Port Transmit Wait (see NVML_PRM_COUNTER_ID_PPCNT_PORTCOUNTERS_PORT_XMIT_WAIT for details) This field expects a dcgm_link_t entity to specify the GPU and link index.

Use DCGM_FE_LINK entity group when accessing this field.

DCGM_FI_DEV_GET_GPU_RECOVERY_ACTION 1523#

GPU Recovery Action (see nvmlDeviceGpuRecoveryAction_t for return values)

DCGM_FI_MAX_FIELDS (DCGM_FI_DEV_GET_GPU_RECOVERY_ACTION + 1)#

1 greater than maximum fields above.

This is the 1 greater than the maximum field id that could be allocated.

Functions

dcgm_field_meta_p DcgmFieldGetById(unsigned short fieldId)#

Get a pointer to the metadata for a field by its field ID.

See DCGM_FI_? for a list of field IDs.

Parameters:

fieldId – IN: One of the field IDs (DCGM_FI_?)

Returns:

0 On Failure >0 Pointer to field metadata structure if found.

dcgm_field_meta_p DcgmFieldGetByTag(const char *tag)#

Get a pointer to the metadata for a field by its field tag.

Parameters:

tag – IN: Tag for the field of interest

Returns:

0 On failure or not found >0 Pointer to field metadata structure if found

int DcgmFieldsInit(void)#

Initialize the DcgmFields module.

Call this once from inside your program

Returns:

0 On success <0 On error

int DcgmFieldsTerm(void)#

Terminates the DcgmFields module.

Call this once from inside your program

Returns:

0 On success <0 On error

const char *DcgmFieldsGetEntityGroupString(
dcgm_field_entity_group_t entityGroupId
)#

Get the string version of a entityGroupId.

Returns:

  • Pointer to a string like GPU/NvSwitch..etc

  • Null on error