Structure Definitions

group dcgmStructs

Unnamed Group

DCGM_RUN_FLAGS_VERBOSE

Flags options for running the GPU diagnostic.

Output in verbose mode; include information as well as warnings

DCGM_RUN_FLAGS_STATSONFAIL

Output stats only on failure.

DCGM_RUN_FLAGS_TRAIN

Train DCGM diagnostic and output a configuration file with golden values.

DCGM_RUN_FLAGS_FORCE_TRAIN

Ignore warnings against training the diagnostic and train anyway.

DCGM_RUN_FLAGS_FAIL_EARLY

Enable fail early checks for the Targeted Stress, Targeted Power, SM Stress, and Diagnostic tests.

Unnamed Group

DCGM_TOPO_HINT_F_NONE

Topology hints for dcgmSelectGpusByTopology()

No hints specified

DCGM_TOPO_HINT_F_IGNOREHEALTH

Ignore the health of the GPUs when picking GPUs for job execution.

By default, only healthy GPUs are considered.

Defines

dcgmConnectV2Params_version1

Version 1 for dcgmConnectV2Params_v1.

dcgmConnectV2Params_version2

Version 2 for dcgmConnectV2Params_v2.

dcgmConnectV2Params_version

Latest version for dcgmConnectV2Params_t.

dcgmHostengineHealth_version1
dcgmHostengineHealth_version

Latest version for dcgmHostengineHealth_t.

dcgmGroupInfo_version2

Version 2 for dcgmGroupInfo_v2.

dcgmGroupInfo_version

Latest version for dcgmGroupInfo_t.

DCGM_MAX_INSTANCES_PER_GPU
DCGM_MAX_COMPUTE_INSTANCES_PER_GPU
DCGM_MAX_TOTAL_INSTANCES_PER_GPU
DCGM_MAX_HIERARCHY_INFO
DCGM_MAX_INSTANCES
DCGM_MAX_COMPUTE_INSTANCES
dcgmMigHierarchy_version1
dcgmMigHierarchy_version2
dcgmMigHierarchy_version
DCGM_MAX_NUM_FIELD_GROUPS

Maximum number of field groups that can exist.

DCGM_MAX_FIELD_IDS_PER_FIELD_GROUP

Maximum number of field IDs that can be in a single field group.

dcgmFieldGroupInfo_version1

Version 1 for dcgmFieldGroupInfo_v1.

dcgmFieldGroupInfo_version

Latest version for dcgmFieldGroupInfo_t.

dcgmAllFieldGroup_version1

Version 1 for dcgmAllFieldGroup_v1.

dcgmAllFieldGroup_version

Latest version for dcgmAllFieldGroup_t.

dcgmClockSet_version1

Version 1 for dcgmClockSet_v1.

dcgmClockSet_version

Latest version for dcgmClockSet_t.

dcgmDeviceSupportedClockSets_version1

Version 1 for dcgmDeviceSupportedClockSets_v1.

dcgmDeviceSupportedClockSets_version

Latest version for dcgmDeviceSupportedClockSets_t.

dcgmDevicePidAccountingStats_version1

Version 1 for dcgmDevicePidAccountingStats_v1.

dcgmDevicePidAccountingStats_version

Latest version for dcgmDevicePidAccountingStats_t.

dcgmDeviceThermals_version1

Version 1 for dcgmDeviceThermals_v1.

dcgmDeviceThermals_version

Latest version for dcgmDeviceThermals_t.

dcgmDevicePowerLimits_version1

Version 1 for dcgmDevicePowerLimits_v1.

dcgmDevicePowerLimits_version

Latest version for dcgmDevicePowerLimits_t.

dcgmDeviceIdentifiers_version1

Version 1 for dcgmDeviceIdentifiers_v1.

dcgmDeviceIdentifiers_version

Latest version for dcgmDeviceIdentifiers_t.

dcgmDeviceMemoryUsage_version1

Version 1 for dcgmDeviceMemoryUsage_v1.

dcgmDeviceMemoryUsage_version

Latest version for dcgmDeviceMemoryUsage_t.

dcgmDeviceVgpuUtilInfo_version1

Version 1 for dcgmDeviceVgpuUtilInfo_v1.

dcgmDeviceVgpuUtilInfo_version

Latest version for dcgmDeviceVgpuUtilInfo_t.

dcgmDeviceEncStats_version1

Version 1 for dcgmDeviceEncStats_v1.

dcgmDeviceEncStats_version

Latest version for dcgmDeviceEncStats_t.

dcgmDeviceFbcStats_version1

Version 1 for dcgmDeviceFbcStats_v1.

dcgmDeviceFbcStats_version

Latest version for dcgmDeviceEncStats_t.

dcgmDeviceFbcSessionInfo_version1

Version 1 for dcgmDeviceFbcSessionInfo_v1.

dcgmDeviceFbcSessionInfo_version

Latest version for dcgmDeviceFbcSessionInfo_t.

dcgmDeviceFbcSessions_version1

Version 1 for dcgmDeviceFbcSessions_v1.

dcgmDeviceFbcSessions_version

Latest version for dcgmDeviceFbcSessions_t.

dcgmDeviceVgpuEncSessions_version1

Version 1 for dcgmDeviceVgpuEncSessions_v1.

dcgmDeviceVgpuEncSessions_version

Latest version for dcgmDeviceVgpuEncSessions_t.

dcgmDeviceVgpuProcessUtilInfo_version1

Version 1 for dcgmDeviceVgpuProcessUtilInfo_v1.

dcgmDeviceVgpuProcessUtilInfo_version

Latest version for dcgmDeviceVgpuProcessUtilInfo_t.

dcgmDeviceVgpuTypeInfo_version1

Version 1 for dcgmDeviceVgpuTypeInfo_v1.

dcgmDeviceVgpuTypeInfo_version2

Version 2 for dcgmDeviceVgpuTypeInfo_v2.

dcgmDeviceVgpuTypeInfo_version

Latest version for dcgmDeviceVgpuTypeInfo_t.

dcgmDevicesSettings_version1
dcgmDeviceSettings_version2
dcgmDeviceSettings_version
dcgmDeviceAttributes_version1

Version 1 for dcgmDeviceAttributes_v1.

dcgmDeviceAttributes_version2

Version 2 for dcgmDeviceAttributes_v2.

dcgmDeviceAttributes_version3

Version 3 for dcgmDeviceAttributes_v3.

dcgmDeviceAttributes_version

Latest version for dcgmDeviceAttributes_t.

DCGM_MAX_VGPU_TYPES_PER_PGPU

Maximum number of vGPU types per physical GPU.

DCGM_DEVICE_UUID_BUFFER_SIZE

Represents the size of a buffer that holds string related to attributes specific to vGPU instance.

dcgmConfig_version1

Version 1 for dcgmConfig_v1.

dcgmConfig_version

Latest version for dcgmConfig_t.

dcgmPolicyViolation_version1
dcgmPolicyViolation_version
DCGM_POLICY_COND_IDX_MAX
DCGM_POLICY_COND_MAX
dcgmPolicy_version1

Version 1 for dcgmPolicy_v1.

dcgmPolicy_version

Latest version for dcgmPolicy_t.

dcgmPolicyCallbackResponse_version1

Version 1 for dcgmPolicyCallbackResponse_v1.

dcgmPolicyCallbackResponse_version

Latest version for dcgmPolicyCallbackResponse_t.

DCGM_MAX_BLOB_LENGTH

Set above size of largest blob entry.

Currently this is dcgmDeviceVgpuTypeInfo_v1

dcgmFieldValue_version1

Version 1 for dcgmFieldValue_v1.

dcgmFieldValue_version2

Version 2 for dcgmFieldValue_v2.

DCGM_FV_FLAG_LIVE_DATA

Field value flags used by dcgmEntitiesGetLatestValues.

Retrieve live data from the driver rather than cached data. Warning: Setting this flag will result in multiple calls to the NVIDIA driver that will be much slower than retrieving a cached value.

DCGM_HEALTH_WATCH_COUNT_V1

For iterating through the dcgmHealthSystems_v1 enum

DCGM_HEALTH_WATCH_COUNT_V2

For iterating through the dcgmHealthSystems_v2 enum

DCGM_HEALTH_WATCH_MAX_INCIDENTS
dcgmHealthResponse_version4

Version 4 for dcgmHealthResponse_v4.

dcgmHealthResponse_version

Latest version for dcgmHealthResponse_t.

dcgmHealthSetParams_version2

Version 2 for dcgmHealthSet_v2.

DCGM_MAX_PID_INFO_NUM
dcgmPidInfo_version2

Version 2 for dcgmPidInfo_v2.

dcgmPidInfo_version

Latest version for dcgmPidInfo_t.

dcgmJobInfo_version3

Version 3 for dcgmJobInfo_v3.

dcgmJobInfo_version

Latest version for dcgmJobInfo_t.

dcgmRunningProcess_version1

Version 1 for dcgmRunningProcess_v1.

dcgmRunningProcess_version

Latest version for dcgmRunningProcess_t.

DCGM_SM_PERF_INDEX
DCGM_TARGETED_PERF_INDEX
DCGM_PER_GPU_TEST_COUNT_V6
DCGM_PER_GPU_TEST_COUNT_V7
DCGM_SWTEST_COUNT
LEVEL_ONE_MAX_RESULTS
dcgmDiagResponse_version6

Version 6 for dcgmDiagResponse_v6.

dcgmDiagResponse_version7

Version 7 for dcgmDiagResponse_v7.

dcgmDiagResponse_version

Latest version for dcgmDiagResponse_t.

DCGM_TOPOLOGY_PATH_PCI(x)
DCGM_AFFINITY_BITMASK_ARRAY_SIZE
dcgmDeviceTopology_version1

Version 1 for dcgmDeviceTopology_v1.

dcgmDeviceTopology_version

Latest version for dcgmDeviceTopology_t.

dcgmGroupTopology_version1

Version 1 for dcgmGroupTopology_v1.

dcgmGroupTopology_version

Latest version for dcgmGroupTopology_t.

dcgmIntrospectContext_version1

Version 1 for dcgmIntrospectContext_t.

dcgmIntrospectContext_version

Latest version for dcgmIntrospectContext_t.

dcgmIntrospectFieldsExecTime_version1

Version 1 for dcgmIntrospectFieldsExecTime_t.

dcgmIntrospectFieldsExecTime_version

Latest version for dcgmIntrospectFieldsExecTime_t.

dcgmIntrospectFullFieldsExecTime_version2

Version 1 for dcgmIntrospectFullFieldsExecTime_t.

dcgmIntrospectFullFieldsExecTime_version

Latest version for dcgmIntrospectFullFieldsExecTime_t.

dcgmIntrospectMemory_version1

Version 1 for dcgmIntrospectMemory_t.

dcgmIntrospectMemory_version

Latest version for dcgmIntrospectMemory_t.

dcgmIntrospectFullMemory_version1

Version 1 for dcgmIntrospectFullMemory_t.

dcgmIntrospectFullMemory_version

Latest version for dcgmIntrospectFullMemory_t.

dcgmIntrospectCpuUtil_version1

Version 1 for dcgmIntrospectCpuUtil_t.

dcgmIntrospectCpuUtil_version

Latest version for dcgmIntrospectCpuUtil_t.

DCGM_MAX_CONFIG_FILE_LEN
DCGM_MAX_TEST_NAMES
DCGM_MAX_TEST_NAMES_LEN
DCGM_MAX_TEST_PARMS
DCGM_MAX_TEST_PARMS_LEN
DCGM_GPU_LIST_LEN
DCGM_FILE_LEN
DCGM_PATH_LEN
DCGM_THROTTLE_MASK_LEN
dcgmRunDiag_version7

Version 7 for dcgmRunDiag_t.

DCGM_GEGE_FLAG_ONLY_SUPPORTED

Flags for dcgmGetEntityGroupEntities’s flags parameter.

Only return entities that are supported by DCGM. This mimics the behavior of dcgmGetAllSupportedDevices().

dcgmTopoSchedHint_version1
dcgmNvLinkStatus_version1

Version 1 of dcgmNvLinkStatus.

dcgmNvLinkStatus_version2

Version 2 of dcgmNvLinkStatus.

DCGM_SUMMARY_MIN
DCGM_SUMMARY_MAX
DCGM_SUMMARY_AVG
DCGM_SUMMARY_SUM
DCGM_SUMMARY_COUNT
DCGM_SUMMARY_INTEGRAL
DCGM_SUMMARY_DIFF
DCGM_SUMMARY_SIZE
dcgmFieldSummaryRequest_version1
DCGM_MODULE_STATUSES_CAPACITY
dcgmModuleGetStatuses_version1

Version 1 of dcgmModuleGetStatuses.

dcgmModuleGetStatuses_version
dcgmStartEmbeddedV2Params_version1

Version 1 for dcgmStartEmbeddedV2Params_v1.

dcgmStartEmbeddedV2Params_version2

Version 2 for dcgmStartEmbeddedV2Params.

DCGM_PROF_MAX_NUM_GROUPS

Maximum number of metric ID groups that can exist in DCGM.

DCGM_PROF_MAX_FIELD_IDS_PER_GROUP

Maximum number of field IDs that can be in a single DCGM profiling metric group.

dcgmProfGetMetricGroups_version2

Version 1 of dcgmProfGetMetricGroups_t.

dcgmProfGetMetricGroups_version
dcgmProfWatchFields_version1

Version 1 of dcgmProfWatchFields_v1.

dcgmProfWatchFields_version
dcgmProfUnwatchFields_version1

Version 1 of dcgmProfUnwatchFields_v1.

dcgmProfUnwatchFields_version
dcgmSettingsSetLoggingSeverity_version1
dcgmSettingsSetLoggingSeverity_version
dcgmVersionInfo_version2

Version 2 of the dcgmVersionInfo_v2.

dcgmVersionInfo_version

Typedefs

typedef uintptr_t dcgmHandle_t

Identifier for DCGM Handle.

typedef uintptr_t dcgmGpuGrp_t

Identifier for a group of GPUs. A group can have one or more GPUs.

typedef uintptr_t dcgmFieldGrp_t

Identifier for a group of fields.

typedef uintptr_t dcgmStatus_t

Identifier for list of status codes.

typedef dcgmConnectV2Params_v2 dcgmConnectV2Params_t

Typedef for dcgmConnectV2Params_v2.

typedef dcgmHostengineHealth_v1 dcgmHostengineHealth_t

Typedef for dcgmHostengineHealth_t.

typedef dcgmGroupInfo_v2 dcgmGroupInfo_t

Typedef for dcgmGroupInfo_v2.

typedef dcgmFieldGroupInfo_v1 dcgmFieldGroupInfo_t
typedef dcgmAllFieldGroup_v1 dcgmAllFieldGroup_t
typedef dcgmClockSet_v1 dcgmClockSet_t

Typedef for dcgmClockSet_v1.

typedef dcgmDeviceSupportedClockSets_v1 dcgmDeviceSupportedClockSets_t

Typedef for dcgmDeviceSupportedClockSets_v1.

typedef dcgmDevicePidAccountingStats_v1 dcgmDevicePidAccountingStats_t

Typedef for dcgmDevicePidAccountingStats_v1.

typedef dcgmDeviceThermals_v1 dcgmDeviceThermals_t

Typedef for dcgmDeviceThermals_v1.

typedef dcgmDevicePowerLimits_v1 dcgmDevicePowerLimits_t

Typedef for dcgmDevicePowerLimits_v1.

typedef dcgmDeviceIdentifiers_v1 dcgmDeviceIdentifiers_t

Typedef for dcgmDeviceIdentifiers_v1.

typedef dcgmDeviceMemoryUsage_v1 dcgmDeviceMemoryUsage_t

Typedef for dcgmDeviceMemoryUsage_v1.

typedef dcgmDeviceVgpuUtilInfo_v1 dcgmDeviceVgpuUtilInfo_t

Typedef for dcgmDeviceVgpuUtilInfo_v1.

typedef dcgmDeviceEncStats_v1 dcgmDeviceEncStats_t

Typedef for dcgmDeviceEncStats_v1.

typedef dcgmDeviceFbcStats_v1 dcgmDeviceFbcStats_t

Typedef for dcgmDeviceFbcStats_v1.

typedef enum dcgmFBCSessionType_enum dcgmFBCSessionType_t
typedef dcgmDeviceFbcSessionInfo_v1 dcgmDeviceFbcSessionInfo_t

Typedef for dcgmDeviceFbcSessionInfo_v1.

typedef dcgmDeviceFbcSessions_v1 dcgmDeviceFbcSessions_t

Typedef for dcgmDeviceFbcSessions_v1.

typedef enum dcgmEncoderQueryType_enum dcgmEncoderType_t
typedef dcgmDeviceVgpuEncSessions_v1 dcgmDeviceVgpuEncSessions_t

Typedef for dcgmDeviceVgpuEncSessions_v1.

typedef dcgmDeviceVgpuProcessUtilInfo_v1 dcgmDeviceVgpuProcessUtilInfo_t

Typedef for dcgmDeviceVgpuProcessUtilInfo_v1.

typedef dcgmDeviceVgpuTypeInfo_v2 dcgmDeviceVgpuTypeInfo_t

Typedef for dcgmDeviceVgpuTypeInfo_v2.

typedef dcgmDeviceSettings_v2 dcgmDeviceSettings_t
typedef dcgmDeviceAttributes_v3 dcgmDeviceAttributes_t

Typedef for dcgmDeviceAttributes_v3.

typedef dcgmConfig_v1 dcgmConfig_t

Typedef for dcgmConfig_v1.

typedef int (*fpRecvUpdates)(void *userData)

Represents a callback to receive updates from asynchronous functions.

Currently the only implemented callback function is dcgmPolicyRegister and the void * data will be a pointer to dcgmPolicyCallbackResponse_t. Ex. dcgmPolicyCallbackResponse_t *callbackResponse = (dcgmPolicyCallbackResponse_t *) userData;

typedef dcgmPolicyViolation_v1 dcgmPolicyViolation_t
typedef enum dcgmPolicyConditionIdx_enum dcgmPolicyConditionIdx_t

Enumeration for policy conditions.

When used as part of dcgmPolicy_t these have corresponding parameters to allow them to be switched on/off or set specific violation thresholds

typedef enum dcgmPolicyCondition_enum dcgmPolicyCondition_t

Bitmask enumeration for policy conditions.

When used as part of dcgmPolicy_t these have corresponding parameters to allow them to be switched on/off or set specific violation thresholds

typedef struct dcgmPolicyConditionParams_st dcgmPolicyConditionParams_t

Structure for policy condition parameters.

This structure contains a tag that represents the type of the value being passed as well as a “val” which is a union of the possible value types. For example, to pass a true boolean: tag = BOOL, val.boolean = 1.

typedef enum dcgmPolicyMode_enum dcgmPolicyMode_t

Enumeration for policy modes.

typedef enum dcgmPolicyIsolation_enum dcgmPolicyIsolation_t

Enumeration for policy isolation modes.

typedef enum dcgmPolicyAction_enum dcgmPolicyAction_t

Enumeration for policy actions.

typedef enum dcgmPolicyValidation_enum dcgmPolicyValidation_t

Enumeration for policy validation actions.

typedef enum dcgmPolicyFailureResp_enum dcgmPolicyFailureResp_t

Enumeration for policy failure responses.

typedef dcgmPolicy_v1 dcgmPolicy_t

Typedef for dcgmPolicy_v1.

typedef dcgmPolicyCallbackResponse_v1 dcgmPolicyCallbackResponse_t

Typedef for dcgmPolicyCallbackResponse_v1.

typedef int (*dcgmFieldValueEnumeration_f)(unsigned int gpuId, dcgmFieldValue_v1 *values, int numValues, void *userData)

User callback function for processing one or more field updates.

This callback will be invoked one or more times per field until all of the expected field values have been enumerated. It is up to the callee to detect when the field id changes

Param gpuId

IN: GPU ID of the GPU this field value set belongs to

Param values

IN: Field values. These values must be copied as they will be destroyed as soon as this call returns.

Param numValues

IN: Number of entries that are valid in values[]

Param userData

IN: User data pointer passed to the update function that generated this callback

Return

0 if OK <0 if enumeration should stop. This allows to callee to abort field value enumeration.

typedef int (*dcgmFieldValueEntityEnumeration_f)(dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId, dcgmFieldValue_v1 *values, int numValues, void *userData)

User callback function for processing one or more field updates.

This callback will be invoked one or more times per field until all of the expected field values have been enumerated. It is up to the callee to detect when the field id changes

Param entityGroupId

IN: entityGroup of the entity this field value set belongs to

Param entityId

IN: Entity this field value set belongs to

Param values

IN: Field values. These values must be copied as they will be destroyed as soon as this call returns.

Param numValues

IN: Number of entries that are valid in values[]

Param userData

IN: User data pointer passed to the update function that generated this callback

Return

0 if OK <0 if enumeration should stop. This allows to callee to abort field value enumeration.

typedef enum dcgmHealthSystems_enum dcgmHealthSystems_t

Systems structure used to enable or disable health watch systems.

typedef enum dcgmHealthWatchResult_enum dcgmHealthWatchResults_t

Health Watch test results.

typedef dcgmHealthResponse_v4 dcgmHealthResponse_t

Typedef for dcgmHealthResponse_v4.

typedef dcgmPidInfo_v2 dcgmPidInfo_t

Typedef for dcgmPidInfo_v2.

typedef dcgmJobInfo_v3 dcgmJobInfo_t

Typedef for dcgmJobInfo_v3.

typedef dcgmRunningProcess_v1 dcgmRunningProcess_t

Typedef for dcgmRunningProcess_v1.

typedef enum dcgmDiagResult_enum dcgmDiagResult_t

Diagnostic test results.

typedef enum dcgmPerGpuTestIndices_enum dcgmPerGpuTestIndices_t

Diagnostic per gpu tests - fixed indices for dcgmDiagResponsePerGpu_t.results[].

typedef enum dcgmSoftwareTest_enum dcgmSoftwareTest_t
typedef dcgmDiagResponse_v7 dcgmDiagResponse_t

Typedef for dcgmDiagResponse_v6.

typedef enum dcgmGpuLevel_enum dcgmGpuTopologyLevel_t

Represents level relationships within a system between two GPUs The enums are spaced to allow for future relationships.

These match the definitions in nvml.h

typedef dcgmDeviceTopology_v1 dcgmDeviceTopology_t

Typedef for dcgmDeviceTopology_v1.

typedef dcgmGroupTopology_v1 dcgmGroupTopology_t

Typedef for dcgmGroupTopology_v1.

typedef enum dcgmIntrospectLevel_enum dcgmIntrospectLevel_t

Identifies a level to retrieve field introspection info for.

typedef dcgmIntrospectContext_v1 dcgmIntrospectContext_t

Typedef for dcgmIntrospectContext_v1.

typedef dcgmIntrospectFieldsExecTime_v1 dcgmIntrospectFieldsExecTime_t

Typedef for dcgmIntrospectFieldsExecTime_t.

typedef dcgmIntrospectFullFieldsExecTime_v2 dcgmIntrospectFullFieldsExecTime_t

typedef for dcgmIntrospectFullFieldsExecTime_v1

typedef enum dcgmIntrospectState_enum dcgmIntrospectState_t

State of DCGM metadata gathering.

If it is set to DISABLED then “Metadata” API calls to DCGM are not supported.

typedef dcgmIntrospectMemory_v1 dcgmIntrospectMemory_t

Typedef for dcgmIntrospectMemory_t.

typedef dcgmIntrospectFullMemory_v1 dcgmIntrospectFullMemory_t

typedef for dcgmIntrospectFullMemory_v1

typedef dcgmIntrospectCpuUtil_v1 dcgmIntrospectCpuUtil_t

Typedef for dcgmIntrospectCpuUtil_t.

typedef enum dcgmGpuNVLinkErrorType_enum dcgmGpuNVLinkErrorType_t

Identifies a GPU NVLink error type returned by DCGM_FI_DEV_GPU_NVLINK_ERRORS.

typedef dcgmTopoSchedHint_v1 dcgmTopoSchedHint_t
typedef enum dcgmNvLinkLinkState_enum dcgmNvLinkLinkState_t

NvLink link states.

typedef dcgmNvLinkStatus_v2 dcgmNvLinkStatus_t
typedef dcgmFieldSummaryRequest_v1 dcgmFieldSummaryRequest_t
typedef dcgmModuleGetStatuses_v1 dcgmModuleGetStatuses_t
typedef dcgmProfGetMetricGroups_v2 dcgmProfGetMetricGroups_t
typedef dcgmProfWatchFields_v1 dcgmProfWatchFields_t
typedef dcgmProfUnwatchFields_v1 dcgmProfUnwatchFields_t
typedef dcgmSettingsSetLoggingSeverity_v1 dcgmSettingsSetLoggingSeverity_t
typedef dcgmVersionInfo_v2 dcgmVersionInfo_t

Enums

enum DcgmLoggingSeverity_t

DCGM Logging Severities.

These match up with plog severities defined in Severity.h Each level includes all of the levels above it. For instance, level 4 includes 3,2, and 1 as well

Values:

enumerator DcgmLoggingSeverityUnspecified

Don’t care/inherit from the environment

enumerator DcgmLoggingSeverityNone

No logging

enumerator DcgmLoggingSeverityFatal

Fatal Errors

enumerator DcgmLoggingSeverityError

Errors

enumerator DcgmLoggingSeverityWarning

Warnings

enumerator DcgmLoggingSeverityInfo

Informative

enumerator DcgmLoggingSeverityDebug

Debug information (will generate large logs)

enumerator DcgmLoggingSeverityVerbose

Verbose debugging information

enum dcgmMigProfile_t

Enum for the different kinds of MIG profiles.

Values:

enumerator DcgmMigProfileNone

No profile (for GPUs)

enumerator DcgmMigProfileGpuInstanceSlice1

GPU instance slice 1

enumerator DcgmMigProfileGpuInstanceSlice2

GPU instance slice 2

enumerator DcgmMigProfileGpuInstanceSlice3

GPU instance slice 3

enumerator DcgmMigProfileGpuInstanceSlice4

GPU instance slice 4

enumerator DcgmMigProfileGpuInstanceSlice7

GPU instance slice 7

enumerator DcgmMigProfileGpuInstanceSlice8

GPU instance slice 8

enumerator DcgmMigProfileComputeInstanceSlice1

compute instance slice 1

enumerator DcgmMigProfileComputeInstanceSlice2

compute instance slice 2

enumerator DcgmMigProfileComputeInstanceSlice3

compute instance slice 3

enumerator DcgmMigProfileComputeInstanceSlice4

compute instance slice 4

enumerator DcgmMigProfileComputeInstanceSlice7

compute instance slice 7

enumerator DcgmMigProfileComputeInstanceSlice8

compute instance slice 8

enum dcgmFBCSessionType_enum

Values:

enumerator DCGM_FBC_SESSION_TYPE_UNKNOWN

Unknown.

enumerator DCGM_FBC_SESSION_TYPE_TOSYS

FB capture for a system buffer.

enumerator DCGM_FBC_SESSION_TYPE_CUDA

FB capture for a cuda buffer.

enumerator DCGM_FBC_SESSION_TYPE_VID

FB capture for a Vid buffer.

enumerator DCGM_FBC_SESSION_TYPE_HWENC

FB capture for a NVENC HW buffer.

enum dcgmEncoderQueryType_enum

Values:

enumerator DCGM_ENCODER_QUERY_H264
enumerator DCGM_ENCODER_QUERY_HEVC
enum dcgmPolicyConditionIdx_enum

Enumeration for policy conditions.

When used as part of dcgmPolicy_t these have corresponding parameters to allow them to be switched on/off or set specific violation thresholds

Values:

enumerator DCGM_POLICY_COND_IDX_DBE

Double bit errors &#8212; boolean in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_IDX_PCI

PCI events/errors &#8212; boolean in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_IDX_MAX_PAGES_RETIRED

Maximum number of retired pages &#8212; number required in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_IDX_THERMAL

Thermal violation &#8212; number required in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_IDX_POWER

Power violation &#8212; number required in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_IDX_NVLINK

NVLINK errors &#8212; boolean in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_IDX_XID

XID errors &#8212; number required in dcgmPolicyConditionParams_t.

enum dcgmPolicyCondition_enum

Bitmask enumeration for policy conditions.

When used as part of dcgmPolicy_t these have corresponding parameters to allow them to be switched on/off or set specific violation thresholds

Values:

enumerator DCGM_POLICY_COND_DBE

Double bit errors &#8212; boolean in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_PCI

PCI events/errors &#8212; boolean in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_MAX_PAGES_RETIRED

Maximum number of retired pages &#8212; number required in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_THERMAL

Thermal violation &#8212; number required in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_POWER

Power violation &#8212; number required in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_NVLINK

NVLINK errors &#8212; boolean in dcgmPolicyConditionParams_t.

enumerator DCGM_POLICY_COND_XID

XID errors &#8212; number required in dcgmPolicyConditionParams_t.

enum dcgmPolicyMode_enum

Enumeration for policy modes.

Values:

enumerator DCGM_POLICY_MODE_AUTOMATED

automatic mode

enumerator DCGM_POLICY_MODE_MANUAL

manual mode

enum dcgmPolicyIsolation_enum

Enumeration for policy isolation modes.

Values:

enumerator DCGM_POLICY_ISOLATION_NONE

no isolation of GPUs on error

enum dcgmPolicyAction_enum

Enumeration for policy actions.

Values:

enumerator DCGM_POLICY_ACTION_NONE

no action

enumerator DCGM_POLICY_ACTION_GPURESET

Deprecated - perform a GPU reset on violation.

enum dcgmPolicyValidation_enum

Enumeration for policy validation actions.

Values:

enumerator DCGM_POLICY_VALID_NONE

no validation after an action is performed

enumerator DCGM_POLICY_VALID_SV_SHORT

run a short System Validation on the system after failure

enumerator DCGM_POLICY_VALID_SV_MED

run a medium System Validation test after failure

enumerator DCGM_POLICY_VALID_SV_LONG

run a extensive System Validation test after failure

enumerator DCGM_POLICY_VALID_SV_XLONG

run a more extensive System Validation test after failure

enum dcgmPolicyFailureResp_enum

Enumeration for policy failure responses.

Values:

enumerator DCGM_POLICY_FAILURE_NONE

on failure of validation perform no action

enum dcgmHealthSystems_enum

Systems structure used to enable or disable health watch systems.

Values:

enumerator DCGM_HEALTH_WATCH_PCIE

PCIe system watches (must have 1m of data before query)

enumerator DCGM_HEALTH_WATCH_NVLINK

NVLINK system watches.

enumerator DCGM_HEALTH_WATCH_PMU

Power management unit watches.

enumerator DCGM_HEALTH_WATCH_MCU

Micro-controller unit watches.

enumerator DCGM_HEALTH_WATCH_MEM

Memory watches.

enumerator DCGM_HEALTH_WATCH_SM

Streaming multiprocessor watches.

enumerator DCGM_HEALTH_WATCH_INFOROM

Inforom watches.

enumerator DCGM_HEALTH_WATCH_THERMAL

Temperature watches (must have 1m of data before query)

enumerator DCGM_HEALTH_WATCH_POWER

Power watches (must have 1m of data before query)

enumerator DCGM_HEALTH_WATCH_DRIVER

Driver-related watches.

enumerator DCGM_HEALTH_WATCH_NVSWITCH_NONFATAL

Non-fatal errors in NvSwitch.

enumerator DCGM_HEALTH_WATCH_NVSWITCH_FATAL

Fatal errors in NvSwitch.

enumerator DCGM_HEALTH_WATCH_ALL

All watches enabled.

enum dcgmHealthWatchResult_enum

Health Watch test results.

Values:

enumerator DCGM_HEALTH_RESULT_PASS

All results within this system are reporting normal.

enumerator DCGM_HEALTH_RESULT_WARN

A warning has been issued, refer to the response for more information.

enumerator DCGM_HEALTH_RESULT_FAIL

A failure has been issued, refer to the response for more information.

enum dcgmDiagnosticLevel_t

Enumeration for diagnostic levels.

Values:

enumerator DCGM_DIAG_LVL_INVALID

Uninitialized.

enumerator DCGM_DIAG_LVL_SHORT

run a very basic health check on the system

enumerator DCGM_DIAG_LVL_MED

run a medium-length diagnostic (a few minutes)

enumerator DCGM_DIAG_LVL_LONG

run a extensive diagnostic (several minutes)

enumerator DCGM_DIAG_LVL_XLONG

run a very extensive diagnostic (many minutes)

enum dcgmDiagResult_enum

Diagnostic test results.

Values:

enumerator DCGM_DIAG_RESULT_PASS

This test passed as diagnostics.

enumerator DCGM_DIAG_RESULT_SKIP

This test was skipped.

enumerator DCGM_DIAG_RESULT_WARN

This test passed with warnings.

enumerator DCGM_DIAG_RESULT_FAIL

This test failed the diagnostics.

enumerator DCGM_DIAG_RESULT_NOT_RUN

This test wasn’t executed.

enum dcgmPerGpuTestIndices_enum

Diagnostic per gpu tests - fixed indices for dcgmDiagResponsePerGpu_t.results[].

Values:

enumerator DCGM_MEMORY_INDEX

Memory test index.

enumerator DCGM_DIAGNOSTIC_INDEX

Diagnostic test index.

enumerator DCGM_PCI_INDEX

PCIe test index.

enumerator DCGM_SM_STRESS_INDEX

SM Stress test index.

enumerator DCGM_TARGETED_STRESS_INDEX

Targeted Stress test index.

enumerator DCGM_TARGETED_POWER_INDEX

Targeted Power test index.

enumerator DCGM_MEMORY_BANDWIDTH_INDEX

Memory bandwidth test index.

enumerator DCGM_MEMTEST_INDEX

Memtest test index.

enumerator DCGM_PULSE_TEST_INDEX

Pulse test index.

enumerator DCGM_SOFTWARE_INDEX

Software test index.

enumerator DCGM_CONTEXT_CREATE_INDEX

Context create test index.

enumerator DCGM_UNKNOWN_INDEX

Unknown test.

enum dcgmSoftwareTest_enum

Values:

enumerator DCGM_SWTEST_BLACKLIST

test for presence of blacklisted drivers (e.g. nouveau)

enumerator DCGM_SWTEST_NVML_LIBRARY

test for presence (and version) of NVML lib

enumerator DCGM_SWTEST_CUDA_MAIN_LIBRARY

test for presence (and version) of CUDA lib

enumerator DCGM_SWTEST_CUDA_RUNTIME_LIBRARY

test for presence (and version) of CUDA RT lib

enumerator DCGM_SWTEST_PERMISSIONS

test for character device permissions

enumerator DCGM_SWTEST_PERSISTENCE_MODE

test for persistence mode enabled

enumerator DCGM_SWTEST_ENVIRONMENT

test for CUDA environment vars that may slow tests

enumerator DCGM_SWTEST_PAGE_RETIREMENT

test for pending frame buffer page retirement

enumerator DCGM_SWTEST_GRAPHICS_PROCESSES

test for graphics processes running

enumerator DCGM_SWTEST_INFOROM

test for inforom corruption

enum dcgmGpuLevel_enum

Represents level relationships within a system between two GPUs The enums are spaced to allow for future relationships.

These match the definitions in nvml.h

Values:

enumerator DCGM_TOPOLOGY_UNINITIALIZED
enumerator DCGM_TOPOLOGY_BOARD

multi-GPU board

enumerator DCGM_TOPOLOGY_SINGLE

all devices that only need traverse a single PCIe switch

enumerator DCGM_TOPOLOGY_MULTIPLE

all devices that need not traverse a host bridge

enumerator DCGM_TOPOLOGY_HOSTBRIDGE

all devices that are connected to the same host bridge

enumerator DCGM_TOPOLOGY_CPU

all devices that are connected to the same CPU but possibly multiple host bridges

enumerator DCGM_TOPOLOGY_SYSTEM

all devices in the system

enumerator DCGM_TOPOLOGY_NVLINK1

GPUs connected via a single NVLINK link.

enumerator DCGM_TOPOLOGY_NVLINK2

GPUs connected via two NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK3

GPUs connected via three NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK4

GPUs connected via four NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK5

GPUs connected via five NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK6

GPUs connected via six NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK7

GPUs connected via seven NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK8

GPUs connected via eight NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK9

GPUs connected via nine NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK10

GPUs connected via ten NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK11

GPUs connected via eleven NVLINK links.

enumerator DCGM_TOPOLOGY_NVLINK12

GPUs connected via twelve NVLINK links.

enum dcgmIntrospectLevel_enum

Identifies a level to retrieve field introspection info for.

Values:

enumerator DCGM_INTROSPECT_LVL_INVALID

Invalid value.

enumerator DCGM_INTROSPECT_LVL_FIELD

Introspection data is grouped by field ID.

enumerator DCGM_INTROSPECT_LVL_FIELD_GROUP

Introspection data is grouped by field group.

enumerator DCGM_INTROSPECT_LVL_ALL_FIELDS

Introspection data is aggregated for all fields.

enum dcgmIntrospectState_enum

State of DCGM metadata gathering.

If it is set to DISABLED then “Metadata” API calls to DCGM are not supported.

Values:

enumerator DCGM_INTROSPECT_STATE_DISABLED
enumerator DCGM_INTROSPECT_STATE_ENABLED
enum dcgmGpuNVLinkErrorType_enum

Identifies a GPU NVLink error type returned by DCGM_FI_DEV_GPU_NVLINK_ERRORS.

Values:

NVLink link recovery error occurred.

NVLink link fatal error occurred.

enum dcgmNvLinkLinkState_enum

NvLink link states.

Values:

enumerator DcgmNvLinkLinkStateNotSupported

NvLink is unsupported by this GPU (Default for GPUs)

enumerator DcgmNvLinkLinkStateDisabled

NvLink is supported for this link but this link is disabled (Default for NvSwitches)

enumerator DcgmNvLinkLinkStateDown

This NvLink link is down (inactive)

enumerator DcgmNvLinkLinkStateUp

This NvLink link is up (active)

enum dcgmModuleId_t

Module IDs.

Values:

enumerator DcgmModuleIdCore

Core DCGM - always loaded.

enumerator DcgmModuleIdNvSwitch

NvSwitch Module.

enumerator DcgmModuleIdVGPU

VGPU Module.

enumerator DcgmModuleIdIntrospect

Introspection Module.

enumerator DcgmModuleIdHealth

Health Module.

enumerator DcgmModuleIdPolicy

Policy Module.

enumerator DcgmModuleIdConfig

Config Module.

enumerator DcgmModuleIdDiag

GPU Diagnostic Module.

enumerator DcgmModuleIdProfiling

Profiling Module.

enumerator DcgmModuleIdCount

Always last. 1 greater than largest value above.

enum dcgmModuleStatus_t

Module Status.

Modules are lazy loaded, so they will be in status DcgmModuleStatusNotLoaded until they are used. One modules are used, they will move to another status.

Values:

enumerator DcgmModuleStatusNotLoaded

Module has not been loaded yet.

enumerator DcgmModuleStatusBlacklisted

Module has been blacklisted from being loaded.

enumerator DcgmModuleStatusFailed

Loading the module failed.

enumerator DcgmModuleStatusLoaded

Module has been loaded.

enumerator DcgmModuleStatusUnloaded

Module has been unloaded, happens during shutdown.

struct dcgmConnectV2Params_v1
#include <dcgm_structs.h>

Connection options for dcgmConnect_v2 (v1)

NOTE: This version is deprecated. use dcgmConnectV2Params_v2

Public Members

unsigned int version

Version number. Use dcgmConnectV2Params_version

unsigned int persistAfterDisconnect

Whether to persist DCGM state modified by this connection once the connection is terminated. Normally, all field watches created by a connection are removed once a connection goes away. 1 = do not clean up after this connection. 0 = clean up after this connection

struct dcgmConnectV2Params_v2
#include <dcgm_structs.h>

Connection options for dcgmConnect_v2 (v2)

Public Members

unsigned int version

Version number. Use dcgmConnectV2Params_version

unsigned int persistAfterDisconnect

Whether to persist DCGM state modified by this connection once the connection is terminated. Normally, all field watches created by a connection are removed once a connection goes away. 1 = do not clean up after this connection. 0 = clean up after this connection

unsigned int timeoutMs

When attempting to connect to the specified host engine, how long should we wait in milliseconds before giving up

unsigned int addressIsUnixSocket

Whether or not the passed-in address is a unix socket filename (1) or a TCP/IP address (0)

struct dcgmHostengineHealth_v1
#include <dcgm_structs.h>

Typedef for dcgmHostengineHealth_v1.

Public Members

unsigned int version

The version of this request.

unsigned int overallHealth

0 to indicate healthy, or a code to indicate the error

struct dcgmGroupEntityPair_t
#include <dcgm_structs.h>

Represents a entityGroupId + entityId pair to uniquely identify a given entityId inside a group of entities.

Added in DCGM 1.5.0

Public Members

dcgm_field_entity_group_t entityGroupId

Entity Group ID entity belongs to.

dcgm_field_eid_t entityId

Entity ID of the entity.

struct dcgmGroupInfo_v2
#include <dcgm_structs.h>

Structure to store information for DCGM group.

Added in DCGM 1.5.0

Public Members

unsigned int version

Version Number (use dcgmGroupInfo_version2)

unsigned int count

count of entityIds returned in entityList

char groupName[256]

Group Name.

dcgmGroupEntityPair_t entityList[64]

List of the entities that are in this group.

struct dcgmMigHierarchyInfo_t
#include <dcgm_structs.h>

Represents a pair of entity pairings to uniquely identify an entity and its place in the hierarchy.

Public Members

dcgmGroupEntityPair_t entity

Entity id and type for the entity in question.

dcgmGroupEntityPair_t parent

Entity id and type for the parent of the entity in question.

dcgmMigProfile_t sliceProfile

Entity MIG profile identifier.

struct dcgmMigEntityInfo_t
#include <dcgm_structs.h>

Provides additional information about location of MIG entities.

Public Members

char gpuUuid[128]

GPU UUID

unsigned int nvmlGpuIndex

GPU index from NVML

unsigned int nvmlInstanceId

GPU instance index within GPU. 0 to N. -1 for GPU entities

unsigned int nvmlComputeInstanceId

GPU Compute instance index within GPU instance. 0 to N. -1 for GPU Instance and GPU entities

unsigned int nvmlMigProfileId

Unique profile ID for GPU or Compute instances. -1 GPU entities

See also

nvmlComputeInstanceProfileInfo_st

See also

nvmlGpuInstanceProfileInfo_st

unsigned int nvmlProfileSlices

Number of slices in the MIG profile

struct dcgmMigHierarchyInfo_v2
struct dcgmMigHierarchy_v1
#include <dcgm_structs.h>

Structure to store the GPU hierarchy for a system.

Added in DCGM 2.0

struct dcgmMigHierarchy_v2
struct dcgmFieldGroupInfo_v1
#include <dcgm_structs.h>

Structure to represent information about a field group.

Public Members

unsigned int version

Version number (dcgmFieldGroupInfo_version)

unsigned int numFieldIds

Number of entries in fieldIds[] that are valid.

dcgmFieldGrp_t fieldGroupId

ID of this field group.

char fieldGroupName[256]

Field Group Name.

unsigned short fieldIds[128]

Field ids that belong to this group.

struct dcgmAllFieldGroup_v1

Public Members

unsigned int version

Version number (dcgmAllFieldGroupInfo_version)

unsigned int numFieldGroups

Number of entries in fieldGroups[] that are populated.

dcgmFieldGroupInfo_t fieldGroups[64]

Info about each field group.

struct dcgmErrorInfo_t
#include <dcgm_structs.h>

Structure to represent error attributes.

Public Members

unsigned int gpuId

Represents GPU ID.

short fieldId

One of DCGM_FI_?

int status

One of DCGM_ST_?

struct dcgmClockSet_v1
#include <dcgm_structs.h>

Represents a set of memory, SM, and video clocks for a device.

This can be current values or a target values based on context

Public Members

int version

Version Number (dcgmClockSet_version)

unsigned int memClock

Memory Clock (Memory Clock value OR DCGM_INT32_BLANK to Ignore/Use compatible value with smClk)

unsigned int smClock

SM Clock (SM Clock value OR DCGM_INT32_BLANK to Ignore/Use compatible value with memClk)

struct dcgmDeviceSupportedClockSets_v1
#include <dcgm_structs.h>

Represents list of supported clock sets for a device.

Public Members

unsigned int version

Version Number (dcgmDeviceSupportedClockSets_version)

unsigned int count

Number of supported clocks.

dcgmClockSet_t clockSet[256]

Valid clock sets for the device. Upto count entries are filled.

struct dcgmDevicePidAccountingStats_v1
#include <dcgm_structs.h>

Represents accounting data for one process.

Public Members

unsigned int version

Version Number. Should match dcgmDevicePidAccountingStats_version.

unsigned int pid

Process id of the process these stats are for.

unsigned int gpuUtilization

Percent of time over the process’s lifetime during which one or more kernels was executing on the GPU.

Set to DCGM_INT32_NOT_SUPPORTED if is not supported

unsigned int memoryUtilization

Percent of time over the process’s lifetime during which global (device) memory was being read or written.

Set to DCGM_INT32_NOT_SUPPORTED if is not supported

unsigned long long maxMemoryUsage

Maximum total memory in bytes that was ever allocated by the process.

Set to DCGM_INT64_NOT_SUPPORTED if is not supported

unsigned long long startTimestamp

CPU Timestamp in usec representing start time for the process.

unsigned long long activeTimeUsec

Amount of time in usec during which the compute context was active.

Note that this does not mean the context was being used. endTimestamp can be computed as startTimestamp + activeTime

struct dcgmDeviceThermals_v1
#include <dcgm_structs.h>

Represents thermal information.

Public Members

unsigned int version

Version Number.

unsigned int slowdownTemp

Slowdown temperature.

unsigned int shutdownTemp

Shutdown temperature.

struct dcgmDevicePowerLimits_v1
#include <dcgm_structs.h>

Represents various power limits.

Public Members

unsigned int version

Version Number.

unsigned int curPowerLimit

Power management limit associated with this device (in W)

unsigned int defaultPowerLimit

Power management limit effective at device boot (in W)

unsigned int enforcedPowerLimit

Effective power limit that the driver enforces after taking into account all limiters (in W)

unsigned int minPowerLimit

Minimum power management limit (in W)

unsigned int maxPowerLimit

Maximum power management limit (in W)

struct dcgmDeviceIdentifiers_v1
#include <dcgm_structs.h>

Represents device identifiers.

Public Members

unsigned int version

Version Number (dcgmDeviceIdentifiers_version)

char brandName[256]

Brand Name.

char deviceName[256]

Name of the device.

char pciBusId[256]

PCI Bus ID.

char serial[256]

Serial for the device.

char uuid[256]

UUID for the device.

char vbios[256]

VBIOS version.

char inforomImageVersion[256]

Inforom Image version.

unsigned int pciDeviceId

The combined 16-bit device id and 16-bit vendor id.

unsigned int pciSubSystemId

The 32-bit Sub System Device ID.

char driverVersion[256]

Driver Version.

unsigned int virtualizationMode

Virtualization Mode.

struct dcgmDeviceMemoryUsage_v1
#include <dcgm_structs.h>

Represents device memory and usage.

Public Members

unsigned int version

Version Number (dcgmDeviceMemoryUsage_version)

unsigned int bar1Total

Total BAR1 size in megabytes.

unsigned int fbTotal

Total framebuffer memory in megabytes.

unsigned int fbUsed

Used framebuffer memory in megabytes.

unsigned int fbFree

Free framebuffer memory in megabytes.

struct dcgmDeviceVgpuUtilInfo_v1
#include <dcgm_structs.h>

Represents utilization values for vGPUs running on the device.

Public Members

unsigned int version

Version Number (dcgmDeviceVgpuUtilInfo_version)

unsigned int vgpuId

vGPU instance ID

unsigned int smUtil

GPU utilization for vGPU.

unsigned int memUtil

Memory utilization for vGPU.

unsigned int encUtil

Encoder utilization for vGPU.

unsigned int decUtil

Decoder utilization for vGPU.

struct dcgmDeviceEncStats_v1
#include <dcgm_structs.h>

Represents current encoder statistics for the given device/vGPU instance.

Public Members

unsigned int version

Version Number (dcgmDeviceEncStats_version)

unsigned int sessionCount

Count of active encoder sessions.

unsigned int averageFps

Trailing average FPS of all active sessions.

unsigned int averageLatency

Encode latency in milliseconds.

struct dcgmDeviceFbcStats_v1
#include <dcgm_structs.h>

Represents current frame buffer capture sessions statistics for the given device/vGPU instance.

Public Members

unsigned int version

Version Number (dcgmDeviceFbcStats_version)

unsigned int sessionCount

Count of active FBC sessions.

unsigned int averageFps

Moving average new frames captured per second.

unsigned int averageLatency

Moving average new frame capture latency in microseconds.

struct dcgmDeviceFbcSessionInfo_v1
#include <dcgm_structs.h>

Represents information about active FBC session on the given device/vGPU instance.

Public Members

unsigned int version

Version Number (dcgmDeviceFbcSessionInfo_version)

unsigned int sessionId

Unique session ID.

unsigned int pid

Owning process ID.

unsigned int vgpuId

vGPU instance ID (only valid on vGPU hosts, otherwise zero)

unsigned int displayOrdinal

Display identifier.

dcgmFBCSessionType_t sessionType

Type of frame buffer capture session.

unsigned int sessionFlags

Session flags.

unsigned int hMaxResolution

Max horizontal resolution supported by the capture session.

unsigned int vMaxResolution

Max vertical resolution supported by the capture session.

unsigned int hResolution

Horizontal resolution requested by caller in capture call.

unsigned int vResolution

Vertical resolution requested by caller in capture call.

unsigned int averageFps

Moving average new frames captured per second.

unsigned int averageLatency

Moving average new frame capture latency in microseconds.

struct dcgmDeviceFbcSessions_v1
#include <dcgm_structs.h>

Represents all the active FBC sessions on the given device/vGPU instance.

Public Members

unsigned int version

Version Number (dcgmDeviceFbcSessions_version)

unsigned int sessionCount

Count of active FBC sessions.

dcgmDeviceFbcSessionInfo_t sessionInfo[256]

Info about the active FBC session.

struct dcgmDeviceVgpuEncSessions_v1
#include <dcgm_structs.h>

Represents information about active encoder sessions on the given vGPU instance.

Public Members

unsigned int version

Version Number (dcgmDeviceVgpuEncSessions_version)

unsigned int vgpuId

vGPU instance ID

unsigned int sessionId

Unique session ID.

unsigned int pid

Process ID.

dcgmEncoderType_t codecType

Video encoder type.

unsigned int hResolution

Current encode horizontal resolution.

unsigned int vResolution

Current encode vertical resolution.

unsigned int averageFps

Moving average encode frames per second.

unsigned int averageLatency

Moving average encode latency in milliseconds.

struct dcgmDeviceVgpuProcessUtilInfo_v1
#include <dcgm_structs.h>

Represents utilization values for processes running in vGPU VMs using the device.

Public Members

unsigned int version

Version Number (dcgmDeviceVgpuProcessUtilInfo_version)

unsigned int vgpuId

vGPU instance ID

unsigned int vgpuProcessSamplesCount

Count of processes running in the vGPU VM,for which utilization rates are being reported in this cycle.

unsigned int pid

Process ID of the process running in the vGPU VM.

char processName[64]

Process Name of process running in the vGPU VM.

unsigned int smUtil

GPU utilization of process running in the vGPU VM.

unsigned int memUtil

Memory utilization of process running in the vGPU VM.

unsigned int encUtil

Encoder utilization of process running in the vGPU VM.

unsigned int decUtil

Decoder utilization of process running in the vGPU VM.

struct dcgmDeviceVgpuTypeInfo_v1
#include <dcgm_structs.h>

Represents static info related to vGPUs supported on the device.

Public Members

unsigned int version

Version number (dcgmDeviceVgpuTypeInfo_version)

union dcgmDeviceVgpuTypeInfo_v1::[anonymous] vgpuTypeInfo

vGPU type ID and Supported vGPU type count

char vgpuTypeName[64]

vGPU type Name

char vgpuTypeClass[64]

Class of vGPU type.

char vgpuTypeLicense[128]

license of vGPU type

int deviceId

device ID of vGPU type

int subsystemId

Subsystem ID of vGPU type.

int numDisplayHeads

Count of vGPU’s supported display heads.

int maxInstances

maximum number of vGPU instances creatable on a device for given vGPU type

int frameRateLimit

Frame rate limit value of the vGPU type.

int maxResolutionX

vGPU display head’s maximum supported resolution in X dimension

int maxResolutionY

vGPU display head’s maximum supported resolution in Y dimension

int fbTotal

vGPU Total framebuffer size in megabytes

struct dcgmDeviceVgpuTypeInfo_v2

Public Members

unsigned int version

Version number (dcgmDeviceVgpuTypeInfo_version2)

union dcgmDeviceVgpuTypeInfo_v2::[anonymous] vgpuTypeInfo

vGPU type ID and Supported vGPU type count

char vgpuTypeName[64]

vGPU type Name

char vgpuTypeClass[64]

Class of vGPU type.

char vgpuTypeLicense[128]

license of vGPU type

int deviceId

device ID of vGPU type

int subsystemId

Subsystem ID of vGPU type.

int numDisplayHeads

Count of vGPU’s supported display heads.

int maxInstances

maximum number of vGPU instances creatable on a device for given vGPU type

int frameRateLimit

Frame rate limit value of the vGPU type.

int maxResolutionX

vGPU display head’s maximum supported resolution in X dimension

int maxResolutionY

vGPU display head’s maximum supported resolution in Y dimension

int fbTotal

vGPU Total framebuffer size in megabytes

int gpuInstanceProfileId

GPU Instance Profile ID for the given vGPU type.

struct dcgmDeviceSettings_v1
struct dcgmDeviceSettings_v2
struct dcgmDeviceAttributes_v1
#include <dcgm_structs.h>

Represents attributes corresponding to a device.

Public Members

unsigned int version

Version number (dcgmDeviceAttributes_version)

dcgmDeviceSupportedClockSets_t clockSets

Supported clocks for the device.

dcgmDeviceThermals_t thermalSettings

Thermal settings for the device.

dcgmDevicePowerLimits_t powerLimits

Various power limits for the device.

dcgmDeviceIdentifiers_t identifiers

Identifiers for the device.

dcgmDeviceMemoryUsage_t memoryUsage

Memory usage info for the device.

char unused[208]

Unused Space. Set to 0 for now.

struct dcgmDeviceAttributes_v2

Public Members

unsigned int version

Version number (dcgmDeviceAttributes_version)

dcgmDeviceSupportedClockSets_t clockSets

Supported clocks for the device.

dcgmDeviceThermals_t thermalSettings

Thermal settings for the device.

dcgmDevicePowerLimits_t powerLimits

Various power limits for the device.

dcgmDeviceIdentifiers_t identifiers

Identifiers for the device.

dcgmDeviceMemoryUsage_t memoryUsage

Memory usage info for the device.

dcgmDeviceSettings_v1 settings

Basic device settings.

struct dcgmDeviceAttributes_v3

Public Members

unsigned int version

Version number (dcgmDeviceAttributes_version)

dcgmDeviceSupportedClockSets_t clockSets

Supported clocks for the device.

dcgmDeviceThermals_t thermalSettings

Thermal settings for the device.

dcgmDevicePowerLimits_t powerLimits

Various power limits for the device.

dcgmDeviceIdentifiers_t identifiers

Identifiers for the device.

dcgmDeviceMemoryUsage_t memoryUsage

Memory usage info for the device.

dcgmDeviceSettings_v2 settings

Basic device settings.

struct dcgmConfigPerfStateSettings_t
#include <dcgm_structs.h>

Used to represent Performance state settings.

Public Members

unsigned int syncBoost

Sync Boost Mode (0: Disabled, 1 : Enabled, DCGM_INT32_BLANK : Ignored).

Note that using this setting may result in lower clocks than targetClocks

dcgmClockSet_t targetClocks

Target clocks.

Set smClock and memClock to DCGM_INT32_BLANK to ignore/use compatible values. For GPUs > Maxwell, setting this implies autoBoost=0

struct dcgmConfigPowerLimit_t
#include <dcgm_structs.h>

Used to represents the power capping limit for each GPU in the group or to represent the power budget for the entire group.

Public Members

dcgmConfigPowerLimitType_t type

Flag to represent power cap for each GPU or power budget for the group of GPUs.

unsigned int val

Power Limit in Watts (Set a value OR DCGM_INT32_BLANK to Ignore)

struct dcgmConfig_v1
#include <dcgm_structs.h>

Structure to represent default and target configuration for a device.

Public Members

unsigned int version

Version number (dcgmConfig_version)

unsigned int gpuId

GPU ID.

unsigned int eccMode

ECC Mode (0: Disabled, 1 : Enabled, DCGM_INT32_BLANK : Ignored)

unsigned int computeMode

Compute Mode (One of DCGM_CONFIG_COMPUTEMODE_? OR DCGM_INT32_BLANK to Ignore)

dcgmConfigPerfStateSettings_t perfState

Performance State Settings (clocks / boost mode)

dcgmConfigPowerLimit_t powerLimit

Power Limits.

struct dcgmPolicyViolation_v1

Public Members

unsigned int version

Version number (dcgmPolicyViolation_version)

unsigned int notifyOnEccDbe

true/false notification on ECC Double Bit Errors

unsigned int notifyOnPciEvent

true/false notification on PCI Events

unsigned int notifyOnMaxRetiredPages

number of retired pages to occur before notification

struct dcgmPolicyConditionParams_st
#include <dcgm_structs.h>

Structure for policy condition parameters.

This structure contains a tag that represents the type of the value being passed as well as a “val” which is a union of the possible value types. For example, to pass a true boolean: tag = BOOL, val.boolean = 1.

struct dcgmPolicyViolationNotify_t
#include <dcgm_structs.h>

Structure to fill when a user queries for policy violations.

Public Members

unsigned int gpuId

gpu ID

unsigned int violationOccurred

a violation based on the bit values in dcgmPolicyCondition_t

struct dcgmPolicy_v1
#include <dcgm_structs.h>

Define the structure that specifies a policy to be enforced for a GPU.

Public Members

unsigned int version

version number (dcgmPolicy_version)

dcgmPolicyCondition_t condition

Condition(s) to access dcgmPolicyCondition_t.

dcgmPolicyMode_t mode

Mode of operation dcgmPolicyMode_t.

dcgmPolicyIsolation_t isolation

Isolation level after a policy violation dcgmPolicyIsolation_t.

dcgmPolicyAction_t action

Action to perform after a policy violation dcgmPolicyAction_t action.

dcgmPolicyValidation_t validation

Validation to perform after action is taken dcgmPolicyValidation_t.

dcgmPolicyFailureResp_t response

Failure to validation response dcgmPolicyFailureResp_t.

dcgmPolicyConditionParams_t parms[7]

Parameters for the condition fields.

struct dcgmPolicyConditionDbe_t
#include <dcgm_structs.h>

Define the ECC DBE return structure.

Public Members

long long timestamp

timestamp of the error

enum dcgmPolicyConditionDbe_t::[anonymous] location

location of the error

unsigned int numerrors

number of errors

struct dcgmPolicyConditionPci_t
#include <dcgm_structs.h>

Define the PCI replay error return structure.

Public Members

long long timestamp

timestamp of the error

unsigned int counter

value of the PCIe replay counter

struct dcgmPolicyConditionMpr_t
#include <dcgm_structs.h>

Define the maximum pending retired pages limit return structure.

Public Members

long long timestamp

timestamp of the error

unsigned int sbepages

number of pending pages due to SBE

unsigned int dbepages

number of pending pages due to DBE

struct dcgmPolicyConditionThermal_t
#include <dcgm_structs.h>

Define the thermal policy violations return structure.

Public Members

long long timestamp

timestamp of the error

unsigned int thermalViolation

Temperature reached that violated policy.

struct dcgmPolicyConditionPower_t
#include <dcgm_structs.h>

Define the power policy violations return structure.

Public Members

long long timestamp

timestamp of the error

unsigned int powerViolation

Power value reached that violated policy.

#include <dcgm_structs.h>

Define the nvlink policy violations return structure.

Public Members

timestamp of the error

Nvlink counter field ID that violated policy.

Nvlink counter value that violated policy.

struct dcgmPolicyConditionXID_t
#include <dcgm_structs.h>

Define the xid policy violations return structure.

Public Members

long long timestamp

Timestamp of the error.

unsigned int errnum

The XID error number.

struct dcgmPolicyCallbackResponse_v1
#include <dcgm_structs.h>

Define the structure that is given to the callback function.

Public Members

unsigned int version

version number (dcgmPolicyCallbackResponse_version)

dcgmPolicyCondition_t condition

Condition that was violated.

dcgmPolicyConditionDbe_t dbe

ECC DBE return structure.

dcgmPolicyConditionPci_t pci

PCI replay error return structure.

dcgmPolicyConditionMpr_t mpr

Max retired pages limit return structure.

dcgmPolicyConditionThermal_t thermal

Thermal policy violations return structure.

dcgmPolicyConditionPower_t power

Power policy violations return structure.

dcgmPolicyConditionNvlink_t nvlink

Nvlink policy violations return structure.

dcgmPolicyConditionXID_t xid

XID policy violations return structure.

struct dcgmFieldValue_v1
#include <dcgm_structs.h>

This structure is used to represent value for the field to be queried.

Public Members

unsigned int version

version number (dcgmFieldValue_version1)

unsigned short fieldId

One of DCGM_FI_?

unsigned short fieldType

One of DCGM_FT_?

int status

Status for the querying the field. DCGM_ST_OK or one of DCGM_ST_?

int64_t ts

Timestamp in usec since 1970.

int64_t i64

Int64 value.

double dbl

Double value.

char str[256]

NULL terminated string.

char blob[4096]

Binary blob.

union dcgmFieldValue_v1::[anonymous] value

Value.

struct dcgmFieldValue_v2
#include <dcgm_structs.h>

This structure is used to represent value for the field to be queried.

Public Members

unsigned int version

version number (dcgmFieldValue_version2)

dcgm_field_entity_group_t entityGroupId

Entity group this field value’s entity belongs to.

dcgm_field_eid_t entityId

Entity this field value belongs to.

unsigned short fieldId

One of DCGM_FI_?

unsigned short fieldType

One of DCGM_FT_?

int status

Status for the querying the field. DCGM_ST_OK or one of DCGM_ST_?

unsigned int unused

Unused for now to align ts to an 8-byte boundary.

int64_t ts

Timestamp in usec since 1970.

int64_t i64

Int64 value.

double dbl

Double value.

char str[256]

NULL terminated string.

char blob[4096]

Binary blob.

union dcgmFieldValue_v2::[anonymous] value

Value.

struct dcgmStatSummaryInt64_t
#include <dcgm_structs.h>

Summary of time series data in int64 format.

Each value will either be set or be a BLANK value. Check for blank with the DCGM_INT64_IS_BLANK() macro.

See also

See dcgmvalue.h for the actual values of BLANK values

Public Members

long long minValue

Minimum value of the samples looked at.

long long maxValue

Maximum value of the samples looked at.

long long average

Simple average of the samples looked at. Blank values are ignored for this calculation.

struct dcgmStatSummaryInt32_t
#include <dcgm_structs.h>

Same as dcgmStatSummaryInt64_t, but with 32-bit integer values.

Public Members

int minValue

Minimum value of the samples looked at.

int maxValue

Maximum value of the samples looked at.

int average

Simple average of the samples looked at. Blank values are ignored for this calculation.

struct dcgmStatSummaryFp64_t
#include <dcgm_structs.h>

Summary of time series data in double-precision format.

Each value will either be set or be a BLANK value. Check for blank with the DCGM_FP64_IS_BLANK() macro.

See also

See dcgmvalue.h for the actual values of BLANK values

Public Members

double minValue

Minimum value of the samples looked at.

double maxValue

Maximum value of the samples looked at.

double average

Simple average of the samples looked at. Blank values are ignored for this calculation.

struct dcgmDiagErrorDetail_t
struct dcgmIncidentInfo_t

Public Members

dcgmHealthSystems_t system

system to which this information belongs

dcgmHealthWatchResults_t health

health diagnosis of this incident

dcgmDiagErrorDetail_t error

Information about the error(s) and their error codes.

dcgmGroupEntityPair_t entityInfo

identify which entity has this error

struct dcgmHealthResponse_v4
#include <dcgm_structs.h>

Health response structure version 4 - Simply list the incidents instead of reporting by entity.

Since DCGM 2.0

Public Members

unsigned int version

The version number of this struct.

dcgmHealthWatchResults_t overallHealth

The overall health of this entire host.

unsigned int incidentCount

The number of health incidents reported in this struct.

dcgmIncidentInfo_t incidents[64]

Report of the errors detected.

struct dcgmHealthSetParams_v2
#include <dcgm_structs.h>

Structure used to set health watches via the dcgmHealthSet_v2 API.

Public Members

unsigned int version

Version of this struct. Should be dcgmHealthSet_version2

dcgmGpuGrp_t groupId

Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.

dcgmHealthSystems_t systems

An enum representing systems that should be enabled for health checks logically OR’d together. Refer to dcgmHealthSystems_t for details.

long long updateInterval

How often to query the underlying health information from the NVIDIA driver in usec. This should be the same as how often you call dcgmHealthCheck

double maxKeepAge

How long to keep data cached for this field in seconds. This should be at least your maximum time between calling dcgmHealthCheck

struct dcgmProcessUtilInfo_t
#include <dcgm_structs.h>

per process utilization rates

struct dcgmProcessUtilSample_t
#include <dcgm_structs.h>

Internal structure used to get the PID and the corresponding utilization rate.

struct dcgmPidSingleInfo_t
#include <dcgm_structs.h>

Info corresponding to single PID.

Public Members

unsigned int gpuId

ID of the GPU this pertains to. GPU_ID_INVALID = summary information for multiple GPUs.

long long energyConsumed

Energy consumed by the gpu in milli-watt/seconds.

dcgmStatSummaryInt64_t pcieRxBandwidth

PCI-E bytes read from the GPU.

dcgmStatSummaryInt64_t pcieTxBandwidth

PCI-E bytes written to the GPU.

long long pcieReplays

Count of PCI-E replays that occurred.

long long startTime

Process start time in microseconds since 1970.

long long endTime

Process end time in microseconds since 1970 or reported as 0 if the process is not completed.

dcgmProcessUtilInfo_t processUtilization

Process SM and Memory Utilization (in percent)

dcgmStatSummaryInt32_t smUtilization

GPU SM Utilization in percent.

dcgmStatSummaryInt32_t memoryUtilization

GPU Memory Utilization in percent.

unsigned int eccSingleBit

Deprecated - Count of ECC single bit errors that occurred.

unsigned int eccDoubleBit

Count of ECC double bit errors that occurred.

dcgmStatSummaryInt32_t memoryClock

Memory clock in MHz.

dcgmStatSummaryInt32_t smClock

SM clock in MHz.

int numXidCriticalErrors

Number of valid entries in xidCriticalErrorsTs.

long long xidCriticalErrorsTs[10]

Timestamps of the critical XID errors that occurred.

int numOtherComputePids

Count of otherComputePids entries that are valid.

unsigned int otherComputePids[16]

Other compute processes that ran. 0=no process.

int numOtherGraphicsPids

Count of otherGraphicsPids entries that are valid.

unsigned int otherGraphicsPids[16]

Other graphics processes that ran. 0=no process.

long long maxGpuMemoryUsed

Maximum amount of GPU memory that was used in bytes.

long long powerViolationTime

Number of microseconds we were at reduced clocks due to power violation.

long long thermalViolationTime

Number of microseconds we were at reduced clocks due to thermal violation.

long long reliabilityViolationTime

Amount of microseconds we were at reduced clocks due to the reliability limit.

long long boardLimitViolationTime

Amount of microseconds we were at reduced clocks due to being at the board’s max voltage.

long long lowUtilizationTime

Amount of microseconds we were at reduced clocks due to low utilization.

long long syncBoostTime

Amount of microseconds we were at reduced clocks due to sync boost.

dcgmHealthWatchResults_t overallHealth

The overall health of the system. dcgmHealthWatchResults_t.

dcgmHealthSystems_t system

system to which this information belongs

dcgmHealthWatchResults_t health

health of the specified system on this GPU

struct dcgmPidInfo_v2
#include <dcgm_structs.h>

To store process statistics.

Public Members

unsigned int version

Version of this message (dcgmPidInfo_version)

unsigned int pid

PID of the process.

int numGpus

Number of GPUs that are valid in GPUs.

dcgmPidSingleInfo_t summary

Summary information for all GPUs listed in gpus[].

dcgmPidSingleInfo_t gpus[32]

Per-GPU information for this PID.

struct dcgmGpuUsageInfo_t
#include <dcgm_structs.h>

Info corresponding to the job on a GPU.

Public Members

unsigned int gpuId

ID of the GPU this pertains to. GPU_ID_INVALID = summary information for multiple GPUs.

long long energyConsumed

Energy consumed in milli-watt/seconds.

dcgmStatSummaryFp64_t powerUsage

Power usage Min/Max/Avg in watts.

dcgmStatSummaryInt64_t pcieRxBandwidth

PCI-E bytes read from the GPU.

dcgmStatSummaryInt64_t pcieTxBandwidth

PCI-E bytes written to the GPU.

long long pcieReplays

Count of PCI-E replays that occurred.

long long startTime

User provided job start time in microseconds since 1970.

long long endTime

User provided job end time in microseconds since 1970.

dcgmStatSummaryInt32_t smUtilization

GPU SM Utilization in percent.

dcgmStatSummaryInt32_t memoryUtilization

GPU Memory Utilization in percent.

unsigned int eccSingleBit

Deprecated - Count of ECC single bit errors that occurred.

unsigned int eccDoubleBit

Count of ECC double bit errors that occurred.

dcgmStatSummaryInt32_t memoryClock

Memory clock in MHz.

dcgmStatSummaryInt32_t smClock

SM clock in MHz.

int numXidCriticalErrors

Number of valid entries in xidCriticalErrorsTs.

long long xidCriticalErrorsTs[10]

Timestamps of the critical XID errors that occurred.

int numComputePids

Count of computePids entries that are valid.

dcgmProcessUtilInfo_t computePidInfo[16]

List of compute processes that ran during the job.

0=no process

int numGraphicsPids

Count of graphicsPids entries that are valid.

dcgmProcessUtilInfo_t graphicsPidInfo[16]

List of compute processes that ran during the job.

0=no process

long long maxGpuMemoryUsed

Maximum amount of GPU memory that was used in bytes.

long long powerViolationTime

Number of microseconds we were at reduced clocks due to power violation.

long long thermalViolationTime

Number of microseconds we were at reduced clocks due to thermal violation.

long long reliabilityViolationTime

Amount of microseconds we were at reduced clocks due to the reliability limit.

long long boardLimitViolationTime

Amount of microseconds we were at reduced clocks due to being at the board’s max voltage.

long long lowUtilizationTime

Amount of microseconds we were at reduced clocks due to low utilization.

long long syncBoostTime

Amount of microseconds we were at reduced clocks due to sync boost.

dcgmHealthWatchResults_t overallHealth

The overall health of the system. dcgmHealthWatchResults_t.

dcgmHealthSystems_t system

system to which this information belongs

dcgmHealthWatchResults_t health

health of the specified system on this GPU

struct dcgmJobInfo_v3
#include <dcgm_structs.h>

To store job statistics The following fields are not applicable in the summary info:

  • pcieRxBandwidth (Min/Max)

  • pcieTxBandwidth (Min/Max)

  • smUtilization (Min/Max)

  • memoryUtilization (Min/Max)

  • memoryClock (Min/Max)

  • smClock (Min/Max)

  • processSamples

The average value in the above fields (in the summary) is the average of the averages of respective fields from all GPUs

Public Members

unsigned int version

Version of this message (dcgmPidInfo_version)

int numGpus

Number of GPUs that are valid in gpus[].

dcgmGpuUsageInfo_t summary

Summary information for all GPUs listed in gpus[].

dcgmGpuUsageInfo_t gpus[32]

Per-GPU information for this PID.

struct dcgmRunningProcess_v1
#include <dcgm_structs.h>

Running process information for a compute or graphics process.

Public Members

unsigned int version

Version of this message (dcgmRunningProcess_version)

unsigned int pid

PID of the process.

unsigned long long memoryUsed

GPU memory used by this process in bytes.

struct dcgmDiagTestResult_v1

Public Members

dcgmDiagResult_t status

The result of the test.

char warning[1024]

Warning returned from the test, if any.

char info[1024]

Information details returned from the test, if any.

struct dcgmDiagTestResult_v2

Public Members

dcgmDiagResult_t status

The result of the test.

dcgmDiagErrorDetail_t error

The error message and error code, if any.

char info[1024]

Information details returned from the test, if any.

struct dcgmDiagResponsePerGpu_v2
#include <dcgm_structs.h>

Per GPU diagnostics result structure.

Public Members

unsigned int gpuId

ID for the GPU this information pertains.

unsigned int hwDiagnosticReturn

Per GPU hardware diagnostic test return code.

dcgmDiagTestResult_v2 results[7]

Array with a result for each per-gpu test.

struct dcgmDiagResponsePerGpu_v3
#include <dcgm_structs.h>

Per gpu response structure v3.

Since DCGM 2.4

Public Members

unsigned int gpuId

ID for the GPU this information pertains.

unsigned int hwDiagnosticReturn

Per GPU hardware diagnostic test return code.

dcgmDiagTestResult_v2 results[9]

Array with a result for each per-gpu test.

struct dcgmDiagResponse_v6
#include <dcgm_structs.h>

Global diagnostics result structure v6.

Since DCGM 2.0

Public Members

unsigned int version

version number (dcgmDiagResult_version)

unsigned int gpuCount

number of valid per GPU results

unsigned int levelOneTestCount

number of valid levelOne results

dcgmDiagTestResult_v2 levelOneResults[16]

Basic, system-wide test results.

dcgmDiagResponsePerGpu_v2 perGpuResponses[32]

per GPU test results

dcgmDiagErrorDetail_t systemError

System-wide error reported from NVVS.

char trainingMsg[1024]

Training Message.

struct dcgmDiagResponse_v7
#include <dcgm_structs.h>

Global diagnostics result structure v7.

Since DCGM 2.4

Public Members

unsigned int version

version number (dcgmDiagResult_version)

unsigned int gpuCount

number of valid per GPU results

unsigned int levelOneTestCount

number of valid levelOne results

dcgmDiagTestResult_v2 levelOneResults[16]

Basic, system-wide test results.

dcgmDiagResponsePerGpu_v3 perGpuResponses[32]

per GPU test results

dcgmDiagErrorDetail_t systemError

System-wide error reported from NVVS.

char trainingMsg[1024]

Training Message.

struct dcgmDeviceTopology_v1
#include <dcgm_structs.h>

Device topology information.

Public Members

unsigned int version

version number (dcgmDeviceTopology_version)

unsigned long cpuAffinityMask[8]

affinity mask for the specified GPU

a 1 represents affinity to the CPU in that bit position supports up to 256 cores

unsigned int numGpus

number of valid entries in gpuPaths

unsigned int gpuId

gpuId to which the path represents

dcgmGpuTopologyLevel_t path

path to the gpuId from this GPU.

Note that this is a bit-mask of DCGM_TOPOLOGY_* values and can contain both PCIe topology and NvLink topology where applicable. For instance: 0x210 = DCGM_TOPOLOGY_CPU | DCGM_TOPOLOGY_NVLINK2 Use the macros DCGM_TOPOLOGY_PATH_NVLINK and DCGM_TOPOLOGY_PATH_PCI to mask the NvLink and PCI paths, respectively.

unsigned int localNvLinkIds

bits representing the local links connected to gpuId e.g.

if this field == 3, links 0 and 1 are connected, field is only valid if NVLINKS actually exist between GPUs

struct dcgmGroupTopology_v1
#include <dcgm_structs.h>

Group topology information.

Public Members

unsigned int version

version number (dcgmGroupTopology_version)

unsigned long groupCpuAffinityMask[8]

the CPU affinity mask for all GPUs in the group

a 1 represents affinity to the CPU in that bit position supports up to 256 cores

unsigned int numaOptimalFlag

a zero value indicates that 1 or more GPUs in the group have a different CPU affinity and thus may not be optimal for certain algorithms

dcgmGpuTopologyLevel_t slowestPath

the slowest path amongst GPUs in the group

struct dcgmIntrospectContext_v1
#include <dcgm_structs.h>

Identifies the retrieval context for introspection API calls.

Public Members

unsigned int version

version number (dcgmIntrospectContext_version)

dcgmIntrospectLevel_t introspectLvl

Introspect Level dcgmIntrospectLevel_t.

dcgmGpuGrp_t fieldGroupId

Only needed if introspectLvl is DCGM_INTROSPECT_LVL_FIELD_GROUP.

unsigned short fieldId

Only needed if introspectLvl is DCGM_INTROSPECT_LVL_FIELD.

unsigned long long contextId

Overloaded way to access both fieldGroupId and fieldId.

struct dcgmIntrospectFieldsExecTime_v1
#include <dcgm_structs.h>

DCGM Execution time info for a set of fields.

Public Members

unsigned int version

version number (dcgmIntrospectFieldsExecTime_version)

long long meanUpdateFreqUsec

the mean update frequency of all fields

double recentUpdateUsec

the sum of every field’s most recent execution time after they have been normalized to meanUpdateFreqUsec”.

This is roughly how long it takes to update fields every meanUpdateFreqUsec

long long totalEverUpdateUsec

The total amount of time, ever, that has been spent updating all the fields.

struct dcgmIntrospectFullFieldsExecTime_v2
#include <dcgm_structs.h>

Full introspection info for field execution time.

Since DCGM 2.0

Public Members

unsigned int version

version number (dcgmIntrospectFullFieldsExecTime_version)

dcgmIntrospectFieldsExecTime_v1 aggregateInfo

info that includes global and device scope

int hasGlobalInfo

0 means globalInfo is populated, !0 means it’s not

dcgmIntrospectFieldsExecTime_v1 globalInfo

info that only includes global field scope

unsigned short gpuInfoCount

count of how many entries in gpuInfo are populated

unsigned int gpuIdsForGpuInfo[32]

the GPU ID at a given index identifies which gpu

the corresponding entry in gpuInfo is from

dcgmIntrospectFieldsExecTime_v1 gpuInfo[32]

info that is separated by the

GPU ID that the watches were for

struct dcgmIntrospectMemory_v1
#include <dcgm_structs.h>

DCGM Memory usage information.

Public Members

unsigned int version

version number (dcgmIntrospectMemory_version)

long long bytesUsed

number of bytes

struct dcgmIntrospectFullMemory_v1
#include <dcgm_structs.h>

Full introspection info for field memory.

Public Members

unsigned int version

version number (dcgmIntrospectFullMemory_version)

dcgmIntrospectMemory_v1 aggregateInfo

info that includes global and device scope

int hasGlobalInfo

0 means globalInfo is populated, !0 means it’s not

dcgmIntrospectMemory_v1 globalInfo

info that only includes global field scope

unsigned short gpuInfoCount

count of how many entries in gpuInfo are populated

unsigned int gpuIdsForGpuInfo[32]

the GPU ID at a given index identifies which gpu

the corresponding entry in gpuInfo is from

dcgmIntrospectMemory_v1 gpuInfo[32]

info that is divided by the

GPU ID that the watches were for

struct dcgmIntrospectCpuUtil_v1
#include <dcgm_structs.h>

DCGM CPU Utilization information.

Multiply values by 100 to get them in %.

Public Members

unsigned int version

version number (dcgmMetadataCpuUtil_version)

double total

fraction of device’s CPU resources that were used

double kernel

fraction of device’s CPU resources that were used in kernel mode

double user

fraction of device’s CPU resources that were used in user mode

struct dcgmRunDiag_v7

Public Members

unsigned int version

version of this message

unsigned int flags

flags specifying binary options for running it. See DCGM_RUN_FLAGS_*

unsigned int debugLevel

0-5 for the debug level the GPU diagnostic will use for logging.

dcgmGpuGrp_t groupId

group of GPUs to verify. Cannot be specified together with gpuList.

dcgmPolicyValidation_t validate

0-3 for which tests to run. Optional.

char testNames[20][50]

Specified list of test names. Optional.

char testParms[100][100]

Parameters to set for specified tests.

in the format: testName.parameterName=parameterValue. Optional.

char fakeGpuList[50]

Comma-separated list of GPUs. Cannot be specified with the groupId.

char gpuList[50]

Comma-separated list of GPUs. Cannot be specified with the groupId.

char debugLogFile[128]

Alternate name for the debug log file that should be used.

char statsPath[128]

Path that the plugin’s statistics files should be written to.

char configFileContents[10000]

Contents of nvvs config file (likely yaml)

char throttleMask[50]

Throttle reasons to ignore as either integer mask or csv list of.

reasons

char pluginPath[128]

Custom path to the diagnostic plugins - No longer supported as of 2.2.9.

unsigned int trainingIterations

Number of iterations for training.

unsigned int trainingVariance

Acceptable training variance as a percentage of the value. (0-100)

unsigned int trainingTolerance

Acceptable training tolerance as a percentage of the value. (0-100)

char goldenValuesFile[128]

The path where the golden values should be recorded.

unsigned int failCheckInterval

How often the fail early checks should occur when enabled.

struct dcgmTopoSchedHint_v1

Public Members

unsigned int version

version of this message

uint64_t inputGpuIds

bit-mask of the GPU ids to choose from

uint32_t numGpus

the number of GPUs that DCGM should choose

uint64_t hintFlags

Hints to ignore certain factors for the scheduling hint.

struct dcgmNvLinkGpuLinkStatus_v1
#include <dcgm_structs.h>

State of NvLink links for a GPU.

Public Members

dcgm_field_eid_t entityId

Entity ID of the GPU (gpuId)

dcgmNvLinkLinkState_t linkState[6]

Per-GPU link states.

struct dcgmNvLinkGpuLinkStatus_v2

Public Members

dcgm_field_eid_t entityId

Entity ID of the GPU (gpuId)

dcgmNvLinkLinkState_t linkState[12]

Per-GPU link states.

struct dcgmNvLinkNvSwitchLinkStatus_t
#include <dcgm_structs.h>

State of NvLink links for a NvSwitch.

Public Members

dcgm_field_eid_t entityId

Entity ID of the NvSwitch (physicalId)

dcgmNvLinkLinkState_t linkState[36]

Per-NvSwitch link states.

struct dcgmNvLinkStatus_v1
#include <dcgm_structs.h>

Status of all of the NvLinks in a given system.

Public Members

unsigned int version

Version of this request. Should be dcgmNvLinkStatus_version1.

unsigned int numGpus

Number of entries in gpus[] that are populated.

dcgmNvLinkGpuLinkStatus_v1 gpus[32]

Per-GPU NvLink link statuses.

unsigned int numNvSwitches

Number of entries in nvSwitches[] that are populated.

dcgmNvLinkNvSwitchLinkStatus_t nvSwitches[12]

Per-NvSwitch link statuses.

struct dcgmNvLinkStatus_v2

Public Members

unsigned int version

Version of this request. Should be dcgmNvLinkStatus_version1.

unsigned int numGpus

Number of entries in gpus[] that are populated.

dcgmNvLinkGpuLinkStatus_v2 gpus[32]

Per-GPU NvLink link statuses.

unsigned int numNvSwitches

Number of entries in nvSwitches[] that are populated.

dcgmNvLinkNvSwitchLinkStatus_t nvSwitches[12]

Per-NvSwitch link statuses.

struct dcgmSummaryResponse_t

Public Members

unsigned int fieldType

type of field that is summarized (int64 or fp64)

unsigned int summaryCount

the number of populated summaries in values

union dcgmSummaryResponse_t::[anonymous] values[7]

array for storing the values of each summary.

The summaries are stored in order. For example, if MIN AND MAX are requested, then 0 will be MIN and 1 will be MAX. If AVG and DIFF were requested, then AVG would be 0 and 1 would be DIFF

struct dcgmFieldSummaryRequest_v1

Public Members

unsigned int version

version of this message - dcgmFieldSummaryRequest_v1

unsigned short fieldId

field id to be summarized

dcgm_field_entity_group_t entityGroupId

the type of entity whose field we’re getting

dcgm_field_eid_t entityId

ordinal id for this entity

uint32_t summaryTypeMask

bit-mask of DCGM_SUMMARY_*, the requested summaries

uint64_t startTime

start time for the interval being summarized.

0 means to use any data before.

uint64_t endTime

end time for the interval being summarized.

0 means to use any data after.

dcgmSummaryResponse_t response

response data for this request

struct dcgmModuleGetStatusesModule_t
#include <dcgm_structs.h>

Status of all of the modules of the host engine.

Public Members

dcgmModuleId_t id

ID of this module.

dcgmModuleStatus_t status

Status of this module.

struct dcgmModuleGetStatuses_v1

Public Members

unsigned int version

Version of this request. Should be dcgmModuleGetStatuses_version1.

unsigned int numStatuses

Number of entries in statuses[] that are populated.

dcgmModuleGetStatusesModule_t statuses[16]

Per-module status information.

struct dcgmStartEmbeddedV2Params_v1
#include <dcgm_structs.h>

Options for dcgmStartEmbedded_v2.

Added in DCGM 2.0.0

Public Members

unsigned int version

Version number. Use dcgmStartEmbeddedV2Params_version1

dcgmOperationMode_t opMode

IN: Collect data automatically or manually when asked by the user.

dcgmHandle_t dcgmHandle

OUT: DCGM Handle to use for API calls

const char *logFile

IN: File that DCGM should log to. NULL = do not log. ‘-’ = stdout

DcgmLoggingSeverity_t severity

IN: Severity at which DCGM should log to logFile

unsigned int blackListCount

IN: Number of modules that to be blacklisted in blackList[]

unsigned int unused

IN: Unused. Set to 0. Aligns structure to 8-bytes

struct dcgmStartEmbeddedV2Params_v2
#include <dcgm_structs.h>

Options for dcgmStartEmbedded_v2.

Added in DCGM 2.4.0

Public Members

unsigned int version

Version number. Use dcgmStartEmbeddedV2Params_version2

dcgmOperationMode_t opMode

IN: Collect data automatically or manually when asked by the user.

dcgmHandle_t dcgmHandle

OUT: DCGM Handle to use for API calls

const char *logFile

IN: File that DCGM should log to. NULL = do not log. ‘-’ = stdout

DcgmLoggingSeverity_t severity

IN: Severity at which DCGM should log to logFile

unsigned int blackListCount

IN: Number of modules that to be blacklisted in blackList[]

const char *serviceAccount

IN: Service account for unprivileged processes

dcgmModuleId_t blackList[DcgmModuleIdCount]

IN: IDs of modules to blacklist

char _padding[4]

IN: Unused. Aligns the struct to 8 bytes.

struct dcgmProfMetricGroupInfo_t
#include <dcgm_structs.h>

Structure to return all of the profiling metric groups that are available for the given groupId.

Public Members

unsigned short majorId

Major ID of this metric group.

Metric groups with the same majorId cannot be watched concurrently with other metric groups with the same majorId

unsigned short minorId

Minor ID of this metric group.

This distinguishes metric groups within the same major metric group from each other

unsigned int numFieldIds

Number of field IDs that are populated in fieldIds[].

unsigned short fieldIds[8]

DCGM Field IDs that are part of this profiling.

group. See DCGM_FI_PROF_* definitions in dcgm_fields.h for details.

struct dcgmProfGetMetricGroups_v2

Input parameters

unsigned int version

Version of this request. Should be dcgmProfGetMetricGroups_version.

unsigned int unused

Not used for now. Set to 0.

dcgmGpuGrp_t groupId

Group of GPUs we should get the metric groups for.

These must all be the exact same GPU or DCGM_ST_GROUP_INCOMPATIBLE will be returned

Output

unsigned int numMetricGroups

Number of entries in metricGroups[] that are populated.

unsigned int unused1

Not used for now. Set to 0.

dcgmProfMetricGroupInfo_t metricGroups[10]

Info for each metric group.

struct dcgmProfWatchFields_v1
#include <dcgm_structs.h>

Structure to pass to dcgmProfWatchFields() when watching profiling metrics.

Public Members

unsigned int version

Version of this request. Should be dcgmProfWatchFields_version.

dcgmGpuGrp_t groupId

Group ID representing collection of one or more GPUs.

Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs. The GPUs of the group must all be identical or DCGM_ST_GROUP_INCOMPATIBLE will be returned by this API.

unsigned int numFieldIds

Number of field IDs that are being passed in fieldIds[].

unsigned short fieldIds[16]

DCGM_FI_PROF_? field IDs to watch.

long long updateFreq

How often to update this field in usec.

Note that profiling metrics may need to be sampled more frequently than this value. See dcgmProfMetricGroupInfo_t.minUpdateFreqUsec of the metric group matching metricGroupTag to see what this minimum is. If minUpdateFreqUsec < updateFreq then samples will be aggregated to updateFreq intervals in DCGM’s internal cache.

double maxKeepAge

How long to keep data for every fieldId in seconds.

int maxKeepSamples

Maximum number of samples to keep for each fieldId. 0=no limit.

unsigned int flags

For future use. Set to 0 for now.

struct dcgmProfUnwatchFields_v1
#include <dcgm_structs.h>

Structure to pass to dcgmProfUnwatchFields when unwatching profiling metrics.

Public Members

unsigned int version

Version of this request. Should be dcgmProfUnwatchFields_version.

dcgmGpuGrp_t groupId

Group ID representing collection of one or more GPUs.

Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs. The GPUs of the group must all be identical or DCGM_ST_GROUP_INCOMPATIBLE will be returned by this API.

unsigned int flags

For future use. Set to 0 for now.

struct dcgmSettingsSetLoggingSeverity_v1
#include <dcgm_structs.h>

Version 1 of dcgmSettingsSetLoggingSeverity_t.

struct dcgmVersionInfo_v2
#include <dcgm_structs.h>

Structure to describe the DCGM build environment ver 2.0.

Public Members

char rawBuildInfoString[256 * 2]

Raw form of the DCGM build info.

There may be multiple kv-pairs separated by semicolon (;).

Every pair is separated by a colon char (:). Only the very first colon is considered as a separation.

Values can contain colon chars. Values and Keys cannot contain semicolon chars.

Usually defined keys are:

version : DCGM Version.arch : Target DCGM Architecture.buildid : Build ID. Usually a sequential number.commit : Commit ID (Usually a git commit hash).author : Author of the commit above.branch : Branch (Usually a git branch that was used for the build).buildtype : Build Type.builddate : Date of the build.buildplatform : Platform where the build was made.

Any or all keys may be absent.

This values are for reference only are not supposed to participate in some complicated logic.