Data Structures¶
- group dcgmStructs
Unnamed Group
-
DCGM_RUN_FLAGS_VERBOSE¶
Flags options for running the GPU diagnostic.
Output in verbose mode; include information as well as warnings
-
DCGM_RUN_FLAGS_STATSONFAIL¶
Output stats only on failure.
-
DCGM_RUN_FLAGS_TRAIN¶
Train DCGM diagnostic and output a configuration file with golden values.
-
DCGM_RUN_FLAGS_FORCE_TRAIN¶
Ignore warnings against training the diagnostic and train anyway.
-
DCGM_RUN_FLAGS_FAIL_EARLY¶
Enable fail early checks for the Targeted Stress, Targeted Power, SM Stress, and Diagnostic tests.
Unnamed Group
-
DCGM_TOPO_HINT_F_NONE¶
Topology hints for dcgmSelectGpusByTopology()
No hints specified
-
DCGM_TOPO_HINT_F_IGNOREHEALTH¶
Ignore the health of the GPUs when picking GPUs for job execution.
By default, only healthy GPUs are considered.
Defines
-
dcgmConnectV2Params_version1¶
Version 1 for dcgmConnectV2Params_v1.
-
dcgmConnectV2Params_version2¶
Version 2 for dcgmConnectV2Params_v2.
-
dcgmConnectV2Params_version¶
Latest version for dcgmConnectV2Params_t.
-
dcgmHostengineHealth_version1¶
-
dcgmHostengineHealth_version¶
Latest version for dcgmHostengineHealth_t.
-
dcgmGroupInfo_version2¶
Version 2 for dcgmGroupInfo_v2.
-
dcgmGroupInfo_version¶
Latest version for dcgmGroupInfo_t.
-
DCGM_MAX_INSTANCES_PER_GPU¶
-
DCGM_MAX_COMPUTE_INSTANCES_PER_GPU¶
-
DCGM_MAX_TOTAL_INSTANCES_PER_GPU¶
-
DCGM_MAX_HIERARCHY_INFO¶
-
DCGM_MAX_INSTANCES¶
-
DCGM_MAX_COMPUTE_INSTANCES¶
-
dcgmMigHierarchy_version1¶
-
dcgmMigHierarchy_version2¶
-
dcgmMigHierarchy_version¶
-
DCGM_MAX_NUM_FIELD_GROUPS¶
Maximum number of field groups that can exist.
-
DCGM_MAX_FIELD_IDS_PER_FIELD_GROUP¶
Maximum number of field IDs that can be in a single field group.
-
dcgmFieldGroupInfo_version1¶
Version 1 for dcgmFieldGroupInfo_v1.
-
dcgmFieldGroupInfo_version¶
Latest version for dcgmFieldGroupInfo_t.
-
dcgmAllFieldGroup_version1¶
Version 1 for dcgmAllFieldGroup_v1.
-
dcgmAllFieldGroup_version¶
Latest version for dcgmAllFieldGroup_t.
-
dcgmClockSet_version1¶
Version 1 for dcgmClockSet_v1.
-
dcgmClockSet_version¶
Latest version for dcgmClockSet_t.
-
dcgmDeviceSupportedClockSets_version1¶
Version 1 for dcgmDeviceSupportedClockSets_v1.
-
dcgmDeviceSupportedClockSets_version¶
Latest version for dcgmDeviceSupportedClockSets_t.
-
dcgmDevicePidAccountingStats_version1¶
Version 1 for dcgmDevicePidAccountingStats_v1.
-
dcgmDevicePidAccountingStats_version¶
Latest version for dcgmDevicePidAccountingStats_t.
-
dcgmDeviceThermals_version1¶
Version 1 for dcgmDeviceThermals_v1.
-
dcgmDeviceThermals_version¶
Latest version for dcgmDeviceThermals_t.
-
dcgmDevicePowerLimits_version1¶
Version 1 for dcgmDevicePowerLimits_v1.
-
dcgmDevicePowerLimits_version¶
Latest version for dcgmDevicePowerLimits_t.
-
dcgmDeviceIdentifiers_version1¶
Version 1 for dcgmDeviceIdentifiers_v1.
-
dcgmDeviceIdentifiers_version¶
Latest version for dcgmDeviceIdentifiers_t.
-
dcgmDeviceMemoryUsage_version1¶
Version 1 for dcgmDeviceMemoryUsage_v1.
-
dcgmDeviceMemoryUsage_version¶
Latest version for dcgmDeviceMemoryUsage_t.
-
dcgmDeviceVgpuUtilInfo_version1¶
Version 1 for dcgmDeviceVgpuUtilInfo_v1.
-
dcgmDeviceVgpuUtilInfo_version¶
Latest version for dcgmDeviceVgpuUtilInfo_t.
-
dcgmDeviceEncStats_version1¶
Version 1 for dcgmDeviceEncStats_v1.
-
dcgmDeviceEncStats_version¶
Latest version for dcgmDeviceEncStats_t.
-
dcgmDeviceFbcStats_version1¶
Version 1 for dcgmDeviceFbcStats_v1.
-
dcgmDeviceFbcStats_version¶
Latest version for dcgmDeviceEncStats_t.
-
dcgmDeviceFbcSessionInfo_version1¶
Version 1 for dcgmDeviceFbcSessionInfo_v1.
-
dcgmDeviceFbcSessionInfo_version¶
Latest version for dcgmDeviceFbcSessionInfo_t.
-
dcgmDeviceFbcSessions_version1¶
Version 1 for dcgmDeviceFbcSessions_v1.
-
dcgmDeviceFbcSessions_version¶
Latest version for dcgmDeviceFbcSessions_t.
-
dcgmDeviceVgpuEncSessions_version1¶
Version 1 for dcgmDeviceVgpuEncSessions_v1.
-
dcgmDeviceVgpuEncSessions_version¶
Latest version for dcgmDeviceVgpuEncSessions_t.
-
dcgmDeviceVgpuProcessUtilInfo_version1¶
Version 1 for dcgmDeviceVgpuProcessUtilInfo_v1.
-
dcgmDeviceVgpuProcessUtilInfo_version¶
Latest version for dcgmDeviceVgpuProcessUtilInfo_t.
-
dcgmDeviceVgpuTypeInfo_version1¶
Version 1 for dcgmDeviceVgpuTypeInfo_v1.
-
dcgmDeviceVgpuTypeInfo_version2¶
Version 2 for dcgmDeviceVgpuTypeInfo_v2.
-
dcgmDeviceVgpuTypeInfo_version¶
Latest version for dcgmDeviceVgpuTypeInfo_t.
-
dcgmDevicesSettings_version1¶
-
dcgmDeviceSettings_version2¶
-
dcgmDeviceSettings_version¶
-
dcgmDeviceAttributes_version1¶
Version 1 for dcgmDeviceAttributes_v1.
-
dcgmDeviceAttributes_version2¶
Version 2 for dcgmDeviceAttributes_v2.
-
dcgmDeviceAttributes_version3¶
Version 3 for dcgmDeviceAttributes_v3.
-
dcgmDeviceAttributes_version¶
Latest version for dcgmDeviceAttributes_t.
-
DCGM_MAX_VGPU_TYPES_PER_PGPU¶
Maximum number of vGPU types per physical GPU.
-
DCGM_DEVICE_UUID_BUFFER_SIZE¶
Represents the size of a buffer that holds string related to attributes specific to vGPU instance.
-
dcgmConfig_version1¶
Version 1 for dcgmConfig_v1.
-
dcgmConfig_version¶
Latest version for dcgmConfig_t.
-
dcgmPolicyViolation_version1¶
-
dcgmPolicyViolation_version¶
-
DCGM_POLICY_COND_IDX_MAX¶
-
DCGM_POLICY_COND_MAX¶
-
dcgmPolicy_version1¶
Version 1 for dcgmPolicy_v1.
-
dcgmPolicy_version¶
Latest version for dcgmPolicy_t.
-
dcgmPolicyCallbackResponse_version1¶
Version 1 for dcgmPolicyCallbackResponse_v1.
-
dcgmPolicyCallbackResponse_version¶
Latest version for dcgmPolicyCallbackResponse_t.
-
DCGM_MAX_BLOB_LENGTH¶
Set above size of largest blob entry.
Currently this is dcgmDeviceVgpuTypeInfo_v1
-
dcgmFieldValue_version1¶
Version 1 for dcgmFieldValue_v1.
-
dcgmFieldValue_version2¶
Version 2 for dcgmFieldValue_v2.
-
DCGM_FV_FLAG_LIVE_DATA¶
Field value flags used by dcgmEntitiesGetLatestValues.
Retrieve live data from the driver rather than cached data. Warning: Setting this flag will result in multiple calls to the NVIDIA driver that will be much slower than retrieving a cached value.
-
DCGM_HEALTH_WATCH_COUNT_V1¶
For iterating through the dcgmHealthSystems_v1 enum
-
DCGM_HEALTH_WATCH_COUNT_V2¶
For iterating through the dcgmHealthSystems_v2 enum
-
DCGM_HEALTH_WATCH_MAX_INCIDENTS¶
-
dcgmHealthResponse_version4¶
Version 4 for dcgmHealthResponse_v4.
-
dcgmHealthResponse_version¶
Latest version for dcgmHealthResponse_t.
-
dcgmHealthSetParams_version2¶
Version 2 for dcgmHealthSet_v2.
-
DCGM_MAX_PID_INFO_NUM¶
-
dcgmPidInfo_version2¶
Version 2 for dcgmPidInfo_v2.
-
dcgmPidInfo_version¶
Latest version for dcgmPidInfo_t.
-
dcgmJobInfo_version3¶
Version 3 for dcgmJobInfo_v3.
-
dcgmJobInfo_version¶
Latest version for dcgmJobInfo_t.
-
dcgmRunningProcess_version1¶
Version 1 for dcgmRunningProcess_v1.
-
dcgmRunningProcess_version¶
Latest version for dcgmRunningProcess_t.
-
DCGM_SM_PERF_INDEX¶
-
DCGM_TARGETED_PERF_INDEX¶
-
DCGM_PER_GPU_TEST_COUNT_V6¶
-
DCGM_PER_GPU_TEST_COUNT_V7¶
-
DCGM_SWTEST_COUNT¶
-
LEVEL_ONE_MAX_RESULTS¶
-
dcgmDiagResponse_version6¶
Version 6 for dcgmDiagResponse_v6.
-
dcgmDiagResponse_version7¶
Version 7 for dcgmDiagResponse_v7.
-
dcgmDiagResponse_version¶
Latest version for dcgmDiagResponse_t.
-
DCGM_TOPOLOGY_PATH_PCI(x)¶
-
DCGM_TOPOLOGY_PATH_NVLINK(x)¶
-
DCGM_AFFINITY_BITMASK_ARRAY_SIZE¶
-
dcgmDeviceTopology_version1¶
Version 1 for dcgmDeviceTopology_v1.
-
dcgmDeviceTopology_version¶
Latest version for dcgmDeviceTopology_t.
-
dcgmGroupTopology_version1¶
Version 1 for dcgmGroupTopology_v1.
-
dcgmGroupTopology_version¶
Latest version for dcgmGroupTopology_t.
-
dcgmIntrospectContext_version1¶
Version 1 for dcgmIntrospectContext_t.
-
dcgmIntrospectContext_version¶
Latest version for dcgmIntrospectContext_t.
-
dcgmIntrospectFieldsExecTime_version1¶
Version 1 for dcgmIntrospectFieldsExecTime_t.
-
dcgmIntrospectFieldsExecTime_version¶
Latest version for dcgmIntrospectFieldsExecTime_t.
-
dcgmIntrospectFullFieldsExecTime_version2¶
Version 1 for dcgmIntrospectFullFieldsExecTime_t.
-
dcgmIntrospectFullFieldsExecTime_version¶
Latest version for dcgmIntrospectFullFieldsExecTime_t.
-
dcgmIntrospectMemory_version1¶
Version 1 for dcgmIntrospectMemory_t.
-
dcgmIntrospectMemory_version¶
Latest version for dcgmIntrospectMemory_t.
-
dcgmIntrospectFullMemory_version1¶
Version 1 for dcgmIntrospectFullMemory_t.
-
dcgmIntrospectFullMemory_version¶
Latest version for dcgmIntrospectFullMemory_t.
-
dcgmIntrospectCpuUtil_version1¶
Version 1 for dcgmIntrospectCpuUtil_t.
-
dcgmIntrospectCpuUtil_version¶
Latest version for dcgmIntrospectCpuUtil_t.
-
DCGM_MAX_CONFIG_FILE_LEN¶
-
DCGM_MAX_TEST_NAMES¶
-
DCGM_MAX_TEST_NAMES_LEN¶
-
DCGM_MAX_TEST_PARMS¶
-
DCGM_MAX_TEST_PARMS_LEN¶
-
DCGM_GPU_LIST_LEN¶
-
DCGM_FILE_LEN¶
-
DCGM_PATH_LEN¶
-
DCGM_THROTTLE_MASK_LEN¶
-
dcgmRunDiag_version7¶
Version 7 for dcgmRunDiag_t.
-
DCGM_GEGE_FLAG_ONLY_SUPPORTED¶
Flags for dcgmGetEntityGroupEntities’s flags parameter.
Only return entities that are supported by DCGM. This mimics the behavior of dcgmGetAllSupportedDevices().
-
dcgmTopoSchedHint_version1¶
-
dcgmNvLinkStatus_version1¶
Version 1 of dcgmNvLinkStatus.
-
dcgmNvLinkStatus_version2¶
Version 2 of dcgmNvLinkStatus.
-
DCGM_SUMMARY_MIN¶
-
DCGM_SUMMARY_MAX¶
-
DCGM_SUMMARY_AVG¶
-
DCGM_SUMMARY_SUM¶
-
DCGM_SUMMARY_COUNT¶
-
DCGM_SUMMARY_INTEGRAL¶
-
DCGM_SUMMARY_DIFF¶
-
DCGM_SUMMARY_SIZE¶
-
dcgmFieldSummaryRequest_version1¶
-
DCGM_MODULE_STATUSES_CAPACITY¶
-
dcgmModuleGetStatuses_version1¶
Version 1 of dcgmModuleGetStatuses.
-
dcgmModuleGetStatuses_version¶
-
dcgmStartEmbeddedV2Params_version1¶
Version 1 for dcgmStartEmbeddedV2Params_v1.
-
dcgmStartEmbeddedV2Params_version2¶
Version 2 for dcgmStartEmbeddedV2Params.
-
DCGM_PROF_MAX_NUM_GROUPS¶
Maximum number of metric ID groups that can exist in DCGM.
-
DCGM_PROF_MAX_FIELD_IDS_PER_GROUP¶
Maximum number of field IDs that can be in a single DCGM profiling metric group.
-
dcgmProfGetMetricGroups_version2¶
Version 1 of dcgmProfGetMetricGroups_t.
-
dcgmProfGetMetricGroups_version¶
-
dcgmProfWatchFields_version1¶
Version 1 of dcgmProfWatchFields_v1.
-
dcgmProfWatchFields_version¶
-
dcgmProfUnwatchFields_version1¶
Version 1 of dcgmProfUnwatchFields_v1.
-
dcgmProfUnwatchFields_version¶
-
dcgmSettingsSetLoggingSeverity_version1¶
-
dcgmSettingsSetLoggingSeverity_version¶
-
dcgmVersionInfo_version2¶
Version 2 of the dcgmVersionInfo_v2.
-
dcgmVersionInfo_version¶
Typedefs
-
typedef uintptr_t dcgmHandle_t¶
Identifier for DCGM Handle.
-
typedef uintptr_t dcgmGpuGrp_t¶
Identifier for a group of GPUs. A group can have one or more GPUs.
-
typedef uintptr_t dcgmFieldGrp_t¶
Identifier for a group of fields.
-
typedef uintptr_t dcgmStatus_t¶
Identifier for list of status codes.
-
typedef dcgmConnectV2Params_v2 dcgmConnectV2Params_t¶
Typedef for dcgmConnectV2Params_v2.
-
typedef dcgmHostengineHealth_v1 dcgmHostengineHealth_t¶
Typedef for dcgmHostengineHealth_t.
-
typedef dcgmGroupInfo_v2 dcgmGroupInfo_t¶
Typedef for dcgmGroupInfo_v2.
-
typedef dcgmFieldGroupInfo_v1 dcgmFieldGroupInfo_t¶
-
typedef dcgmAllFieldGroup_v1 dcgmAllFieldGroup_t¶
-
typedef dcgmClockSet_v1 dcgmClockSet_t¶
Typedef for dcgmClockSet_v1.
-
typedef dcgmDeviceSupportedClockSets_v1 dcgmDeviceSupportedClockSets_t¶
Typedef for dcgmDeviceSupportedClockSets_v1.
-
typedef dcgmDevicePidAccountingStats_v1 dcgmDevicePidAccountingStats_t¶
Typedef for dcgmDevicePidAccountingStats_v1.
-
typedef dcgmDeviceThermals_v1 dcgmDeviceThermals_t¶
Typedef for dcgmDeviceThermals_v1.
-
typedef dcgmDevicePowerLimits_v1 dcgmDevicePowerLimits_t¶
Typedef for dcgmDevicePowerLimits_v1.
-
typedef dcgmDeviceIdentifiers_v1 dcgmDeviceIdentifiers_t¶
Typedef for dcgmDeviceIdentifiers_v1.
-
typedef dcgmDeviceMemoryUsage_v1 dcgmDeviceMemoryUsage_t¶
Typedef for dcgmDeviceMemoryUsage_v1.
-
typedef dcgmDeviceVgpuUtilInfo_v1 dcgmDeviceVgpuUtilInfo_t¶
Typedef for dcgmDeviceVgpuUtilInfo_v1.
-
typedef dcgmDeviceEncStats_v1 dcgmDeviceEncStats_t¶
Typedef for dcgmDeviceEncStats_v1.
-
typedef dcgmDeviceFbcStats_v1 dcgmDeviceFbcStats_t¶
Typedef for dcgmDeviceFbcStats_v1.
-
typedef enum dcgmFBCSessionType_enum dcgmFBCSessionType_t¶
-
typedef dcgmDeviceFbcSessionInfo_v1 dcgmDeviceFbcSessionInfo_t¶
Typedef for dcgmDeviceFbcSessionInfo_v1.
-
typedef dcgmDeviceFbcSessions_v1 dcgmDeviceFbcSessions_t¶
Typedef for dcgmDeviceFbcSessions_v1.
-
typedef enum dcgmEncoderQueryType_enum dcgmEncoderType_t¶
-
typedef dcgmDeviceVgpuEncSessions_v1 dcgmDeviceVgpuEncSessions_t¶
Typedef for dcgmDeviceVgpuEncSessions_v1.
-
typedef dcgmDeviceVgpuProcessUtilInfo_v1 dcgmDeviceVgpuProcessUtilInfo_t¶
Typedef for dcgmDeviceVgpuProcessUtilInfo_v1.
-
typedef dcgmDeviceVgpuTypeInfo_v2 dcgmDeviceVgpuTypeInfo_t¶
Typedef for dcgmDeviceVgpuTypeInfo_v2.
-
typedef dcgmDeviceSettings_v2 dcgmDeviceSettings_t¶
-
typedef dcgmDeviceAttributes_v3 dcgmDeviceAttributes_t¶
Typedef for dcgmDeviceAttributes_v3.
-
typedef dcgmConfig_v1 dcgmConfig_t¶
Typedef for dcgmConfig_v1.
-
typedef int (*fpRecvUpdates)(void *userData)¶
Represents a callback to receive updates from asynchronous functions.
Currently the only implemented callback function is dcgmPolicyRegister and the void * data will be a pointer to dcgmPolicyCallbackResponse_t. Ex. dcgmPolicyCallbackResponse_t *callbackResponse = (dcgmPolicyCallbackResponse_t *) userData;
-
typedef dcgmPolicyViolation_v1 dcgmPolicyViolation_t¶
-
typedef enum dcgmPolicyConditionIdx_enum dcgmPolicyConditionIdx_t¶
Enumeration for policy conditions.
When used as part of dcgmPolicy_t these have corresponding parameters to allow them to be switched on/off or set specific violation thresholds
-
typedef enum dcgmPolicyCondition_enum dcgmPolicyCondition_t¶
Bitmask enumeration for policy conditions.
When used as part of dcgmPolicy_t these have corresponding parameters to allow them to be switched on/off or set specific violation thresholds
-
typedef struct dcgmPolicyConditionParams_st dcgmPolicyConditionParams_t¶
Structure for policy condition parameters.
This structure contains a tag that represents the type of the value being passed as well as a “val” which is a union of the possible value types. For example, to pass a true boolean: tag = BOOL, val.boolean = 1.
-
typedef enum dcgmPolicyMode_enum dcgmPolicyMode_t¶
Enumeration for policy modes.
-
typedef enum dcgmPolicyIsolation_enum dcgmPolicyIsolation_t¶
Enumeration for policy isolation modes.
-
typedef enum dcgmPolicyAction_enum dcgmPolicyAction_t¶
Enumeration for policy actions.
-
typedef enum dcgmPolicyValidation_enum dcgmPolicyValidation_t¶
Enumeration for policy validation actions.
-
typedef enum dcgmPolicyFailureResp_enum dcgmPolicyFailureResp_t¶
Enumeration for policy failure responses.
-
typedef dcgmPolicy_v1 dcgmPolicy_t¶
Typedef for dcgmPolicy_v1.
-
typedef dcgmPolicyCallbackResponse_v1 dcgmPolicyCallbackResponse_t¶
Typedef for dcgmPolicyCallbackResponse_v1.
-
typedef int (*dcgmFieldValueEnumeration_f)(unsigned int gpuId, dcgmFieldValue_v1 *values, int numValues, void *userData)¶
User callback function for processing one or more field updates.
This callback will be invoked one or more times per field until all of the expected field values have been enumerated. It is up to the callee to detect when the field id changes
- Param gpuId
IN: GPU ID of the GPU this field value set belongs to
- Param values
IN: Field values. These values must be copied as they will be destroyed as soon as this call returns.
- Param numValues
IN: Number of entries that are valid in values[]
- Param userData
IN: User data pointer passed to the update function that generated this callback
- Return
0 if OK <0 if enumeration should stop. This allows to callee to abort field value enumeration.
-
typedef int (*dcgmFieldValueEntityEnumeration_f)(dcgm_field_entity_group_t entityGroupId, dcgm_field_eid_t entityId, dcgmFieldValue_v1 *values, int numValues, void *userData)¶
User callback function for processing one or more field updates.
This callback will be invoked one or more times per field until all of the expected field values have been enumerated. It is up to the callee to detect when the field id changes
- Param entityGroupId
IN: entityGroup of the entity this field value set belongs to
- Param entityId
IN: Entity this field value set belongs to
- Param values
IN: Field values. These values must be copied as they will be destroyed as soon as this call returns.
- Param numValues
IN: Number of entries that are valid in values[]
- Param userData
IN: User data pointer passed to the update function that generated this callback
- Return
0 if OK <0 if enumeration should stop. This allows to callee to abort field value enumeration.
-
typedef enum dcgmHealthSystems_enum dcgmHealthSystems_t¶
Systems structure used to enable or disable health watch systems.
-
typedef enum dcgmHealthWatchResult_enum dcgmHealthWatchResults_t¶
Health Watch test results.
-
typedef dcgmHealthResponse_v4 dcgmHealthResponse_t¶
Typedef for dcgmHealthResponse_v4.
-
typedef dcgmPidInfo_v2 dcgmPidInfo_t¶
Typedef for dcgmPidInfo_v2.
-
typedef dcgmJobInfo_v3 dcgmJobInfo_t¶
Typedef for dcgmJobInfo_v3.
-
typedef dcgmRunningProcess_v1 dcgmRunningProcess_t¶
Typedef for dcgmRunningProcess_v1.
-
typedef enum dcgmDiagResult_enum dcgmDiagResult_t¶
Diagnostic test results.
-
typedef enum dcgmPerGpuTestIndices_enum dcgmPerGpuTestIndices_t¶
Diagnostic per gpu tests - fixed indices for dcgmDiagResponsePerGpu_t.results[].
-
typedef enum dcgmSoftwareTest_enum dcgmSoftwareTest_t¶
-
typedef dcgmDiagResponse_v7 dcgmDiagResponse_t¶
Typedef for dcgmDiagResponse_v6.
-
typedef enum dcgmGpuLevel_enum dcgmGpuTopologyLevel_t¶
Represents level relationships within a system between two GPUs The enums are spaced to allow for future relationships.
These match the definitions in nvml.h
-
typedef dcgmDeviceTopology_v1 dcgmDeviceTopology_t¶
Typedef for dcgmDeviceTopology_v1.
-
typedef dcgmGroupTopology_v1 dcgmGroupTopology_t¶
Typedef for dcgmGroupTopology_v1.
-
typedef enum dcgmIntrospectLevel_enum dcgmIntrospectLevel_t¶
Identifies a level to retrieve field introspection info for.
-
typedef dcgmIntrospectContext_v1 dcgmIntrospectContext_t¶
Typedef for dcgmIntrospectContext_v1.
-
typedef dcgmIntrospectFieldsExecTime_v1 dcgmIntrospectFieldsExecTime_t¶
Typedef for dcgmIntrospectFieldsExecTime_t.
-
typedef dcgmIntrospectFullFieldsExecTime_v2 dcgmIntrospectFullFieldsExecTime_t¶
typedef for dcgmIntrospectFullFieldsExecTime_v1
-
typedef enum dcgmIntrospectState_enum dcgmIntrospectState_t¶
State of DCGM metadata gathering.
If it is set to DISABLED then “Metadata” API calls to DCGM are not supported.
-
typedef dcgmIntrospectMemory_v1 dcgmIntrospectMemory_t¶
Typedef for dcgmIntrospectMemory_t.
-
typedef dcgmIntrospectFullMemory_v1 dcgmIntrospectFullMemory_t¶
typedef for dcgmIntrospectFullMemory_v1
-
typedef dcgmIntrospectCpuUtil_v1 dcgmIntrospectCpuUtil_t¶
Typedef for dcgmIntrospectCpuUtil_t.
-
typedef enum dcgmGpuNVLinkErrorType_enum dcgmGpuNVLinkErrorType_t¶
Identifies a GPU NVLink error type returned by DCGM_FI_DEV_GPU_NVLINK_ERRORS.
-
typedef dcgmTopoSchedHint_v1 dcgmTopoSchedHint_t¶
-
typedef enum dcgmNvLinkLinkState_enum dcgmNvLinkLinkState_t¶
NvLink link states.
-
typedef dcgmNvLinkStatus_v2 dcgmNvLinkStatus_t¶
-
typedef dcgmFieldSummaryRequest_v1 dcgmFieldSummaryRequest_t¶
-
typedef dcgmModuleGetStatuses_v1 dcgmModuleGetStatuses_t¶
-
typedef dcgmProfGetMetricGroups_v2 dcgmProfGetMetricGroups_t¶
-
typedef dcgmProfWatchFields_v1 dcgmProfWatchFields_t¶
-
typedef dcgmProfUnwatchFields_v1 dcgmProfUnwatchFields_t¶
-
typedef dcgmSettingsSetLoggingSeverity_v1 dcgmSettingsSetLoggingSeverity_t¶
-
typedef dcgmVersionInfo_v2 dcgmVersionInfo_t¶
Enums
-
enum DcgmLoggingSeverity_t¶
DCGM Logging Severities.
These match up with plog severities defined in Severity.h Each level includes all of the levels above it. For instance, level 4 includes 3,2, and 1 as well
Values:
-
enumerator DcgmLoggingSeverityUnspecified¶
Don’t care/inherit from the environment
-
enumerator DcgmLoggingSeverityNone¶
No logging
-
enumerator DcgmLoggingSeverityFatal¶
Fatal Errors
-
enumerator DcgmLoggingSeverityError¶
Errors
-
enumerator DcgmLoggingSeverityWarning¶
Warnings
-
enumerator DcgmLoggingSeverityInfo¶
Informative
-
enumerator DcgmLoggingSeverityDebug¶
Debug information (will generate large logs)
-
enumerator DcgmLoggingSeverityVerbose¶
Verbose debugging information
-
enumerator DcgmLoggingSeverityUnspecified¶
-
enum dcgmMigProfile_t¶
Enum for the different kinds of MIG profiles.
Values:
-
enumerator DcgmMigProfileNone¶
No profile (for GPUs)
-
enumerator DcgmMigProfileGpuInstanceSlice1¶
GPU instance slice 1
-
enumerator DcgmMigProfileGpuInstanceSlice2¶
GPU instance slice 2
-
enumerator DcgmMigProfileGpuInstanceSlice3¶
GPU instance slice 3
-
enumerator DcgmMigProfileGpuInstanceSlice4¶
GPU instance slice 4
-
enumerator DcgmMigProfileGpuInstanceSlice7¶
GPU instance slice 7
-
enumerator DcgmMigProfileGpuInstanceSlice8¶
GPU instance slice 8
-
enumerator DcgmMigProfileComputeInstanceSlice1¶
compute instance slice 1
-
enumerator DcgmMigProfileComputeInstanceSlice2¶
compute instance slice 2
-
enumerator DcgmMigProfileComputeInstanceSlice3¶
compute instance slice 3
-
enumerator DcgmMigProfileComputeInstanceSlice4¶
compute instance slice 4
-
enumerator DcgmMigProfileComputeInstanceSlice7¶
compute instance slice 7
-
enumerator DcgmMigProfileComputeInstanceSlice8¶
compute instance slice 8
-
enumerator DcgmMigProfileNone¶
-
enum dcgmFBCSessionType_enum¶
Values:
-
enumerator DCGM_FBC_SESSION_TYPE_UNKNOWN¶
Unknown.
-
enumerator DCGM_FBC_SESSION_TYPE_TOSYS¶
FB capture for a system buffer.
-
enumerator DCGM_FBC_SESSION_TYPE_CUDA¶
FB capture for a cuda buffer.
-
enumerator DCGM_FBC_SESSION_TYPE_VID¶
FB capture for a Vid buffer.
-
enumerator DCGM_FBC_SESSION_TYPE_HWENC¶
FB capture for a NVENC HW buffer.
-
enumerator DCGM_FBC_SESSION_TYPE_UNKNOWN¶
-
enum dcgmEncoderQueryType_enum¶
Values:
-
enumerator DCGM_ENCODER_QUERY_H264¶
-
enumerator DCGM_ENCODER_QUERY_HEVC¶
-
enumerator DCGM_ENCODER_QUERY_H264¶
-
enum dcgmPolicyConditionIdx_enum¶
Enumeration for policy conditions.
When used as part of dcgmPolicy_t these have corresponding parameters to allow them to be switched on/off or set specific violation thresholds
Values:
-
enumerator DCGM_POLICY_COND_IDX_DBE¶
Double bit errors — boolean in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_IDX_PCI¶
PCI events/errors — boolean in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_IDX_MAX_PAGES_RETIRED¶
Maximum number of retired pages — number required in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_IDX_THERMAL¶
Thermal violation — number required in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_IDX_POWER¶
Power violation — number required in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_IDX_NVLINK¶
NVLINK errors — boolean in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_IDX_XID¶
XID errors — number required in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_IDX_DBE¶
-
enum dcgmPolicyCondition_enum¶
Bitmask enumeration for policy conditions.
When used as part of dcgmPolicy_t these have corresponding parameters to allow them to be switched on/off or set specific violation thresholds
Values:
-
enumerator DCGM_POLICY_COND_DBE¶
Double bit errors — boolean in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_PCI¶
PCI events/errors — boolean in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_MAX_PAGES_RETIRED¶
Maximum number of retired pages — number required in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_THERMAL¶
Thermal violation — number required in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_POWER¶
Power violation — number required in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_NVLINK¶
NVLINK errors — boolean in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_XID¶
XID errors — number required in dcgmPolicyConditionParams_t.
-
enumerator DCGM_POLICY_COND_DBE¶
-
enum dcgmPolicyMode_enum¶
Enumeration for policy modes.
Values:
-
enumerator DCGM_POLICY_MODE_AUTOMATED¶
automatic mode
-
enumerator DCGM_POLICY_MODE_MANUAL¶
manual mode
-
enumerator DCGM_POLICY_MODE_AUTOMATED¶
-
enum dcgmPolicyIsolation_enum¶
Enumeration for policy isolation modes.
Values:
-
enumerator DCGM_POLICY_ISOLATION_NONE¶
no isolation of GPUs on error
-
enumerator DCGM_POLICY_ISOLATION_NONE¶
-
enum dcgmPolicyAction_enum¶
Enumeration for policy actions.
Values:
-
enumerator DCGM_POLICY_ACTION_NONE¶
no action
-
enumerator DCGM_POLICY_ACTION_GPURESET¶
Deprecated - perform a GPU reset on violation.
-
enumerator DCGM_POLICY_ACTION_NONE¶
-
enum dcgmPolicyValidation_enum¶
Enumeration for policy validation actions.
Values:
-
enumerator DCGM_POLICY_VALID_NONE¶
no validation after an action is performed
-
enumerator DCGM_POLICY_VALID_SV_SHORT¶
run a short System Validation on the system after failure
-
enumerator DCGM_POLICY_VALID_SV_MED¶
run a medium System Validation test after failure
-
enumerator DCGM_POLICY_VALID_SV_LONG¶
run a extensive System Validation test after failure
-
enumerator DCGM_POLICY_VALID_SV_XLONG¶
run a more extensive System Validation test after failure
-
enumerator DCGM_POLICY_VALID_NONE¶
-
enum dcgmPolicyFailureResp_enum¶
Enumeration for policy failure responses.
Values:
-
enumerator DCGM_POLICY_FAILURE_NONE¶
on failure of validation perform no action
-
enumerator DCGM_POLICY_FAILURE_NONE¶
-
enum dcgmHealthSystems_enum¶
Systems structure used to enable or disable health watch systems.
Values:
-
enumerator DCGM_HEALTH_WATCH_PCIE¶
PCIe system watches (must have 1m of data before query)
-
enumerator DCGM_HEALTH_WATCH_NVLINK¶
NVLINK system watches.
-
enumerator DCGM_HEALTH_WATCH_PMU¶
Power management unit watches.
-
enumerator DCGM_HEALTH_WATCH_MCU¶
Micro-controller unit watches.
-
enumerator DCGM_HEALTH_WATCH_MEM¶
Memory watches.
-
enumerator DCGM_HEALTH_WATCH_SM¶
Streaming multiprocessor watches.
-
enumerator DCGM_HEALTH_WATCH_INFOROM¶
Inforom watches.
-
enumerator DCGM_HEALTH_WATCH_THERMAL¶
Temperature watches (must have 1m of data before query)
-
enumerator DCGM_HEALTH_WATCH_POWER¶
Power watches (must have 1m of data before query)
-
enumerator DCGM_HEALTH_WATCH_DRIVER¶
Driver-related watches.
-
enumerator DCGM_HEALTH_WATCH_NVSWITCH_NONFATAL¶
Non-fatal errors in NvSwitch.
-
enumerator DCGM_HEALTH_WATCH_NVSWITCH_FATAL¶
Fatal errors in NvSwitch.
-
enumerator DCGM_HEALTH_WATCH_ALL¶
All watches enabled.
-
enumerator DCGM_HEALTH_WATCH_PCIE¶
-
enum dcgmHealthWatchResult_enum¶
Health Watch test results.
Values:
-
enumerator DCGM_HEALTH_RESULT_PASS¶
All results within this system are reporting normal.
-
enumerator DCGM_HEALTH_RESULT_WARN¶
A warning has been issued, refer to the response for more information.
-
enumerator DCGM_HEALTH_RESULT_FAIL¶
A failure has been issued, refer to the response for more information.
-
enumerator DCGM_HEALTH_RESULT_PASS¶
-
enum dcgmDiagnosticLevel_t¶
Enumeration for diagnostic levels.
Values:
-
enumerator DCGM_DIAG_LVL_INVALID¶
Uninitialized.
-
enumerator DCGM_DIAG_LVL_SHORT¶
run a very basic health check on the system
-
enumerator DCGM_DIAG_LVL_MED¶
run a medium-length diagnostic (a few minutes)
-
enumerator DCGM_DIAG_LVL_LONG¶
run a extensive diagnostic (several minutes)
-
enumerator DCGM_DIAG_LVL_XLONG¶
run a very extensive diagnostic (many minutes)
-
enumerator DCGM_DIAG_LVL_INVALID¶
-
enum dcgmDiagResult_enum¶
Diagnostic test results.
Values:
-
enumerator DCGM_DIAG_RESULT_PASS¶
This test passed as diagnostics.
-
enumerator DCGM_DIAG_RESULT_SKIP¶
This test was skipped.
-
enumerator DCGM_DIAG_RESULT_WARN¶
This test passed with warnings.
-
enumerator DCGM_DIAG_RESULT_FAIL¶
This test failed the diagnostics.
-
enumerator DCGM_DIAG_RESULT_NOT_RUN¶
This test wasn’t executed.
-
enumerator DCGM_DIAG_RESULT_PASS¶
-
enum dcgmPerGpuTestIndices_enum¶
Diagnostic per gpu tests - fixed indices for dcgmDiagResponsePerGpu_t.results[].
Values:
-
enumerator DCGM_MEMORY_INDEX¶
Memory test index.
-
enumerator DCGM_DIAGNOSTIC_INDEX¶
Diagnostic test index.
-
enumerator DCGM_PCI_INDEX¶
PCIe test index.
-
enumerator DCGM_SM_STRESS_INDEX¶
SM Stress test index.
-
enumerator DCGM_TARGETED_STRESS_INDEX¶
Targeted Stress test index.
-
enumerator DCGM_TARGETED_POWER_INDEX¶
Targeted Power test index.
-
enumerator DCGM_MEMORY_BANDWIDTH_INDEX¶
Memory bandwidth test index.
-
enumerator DCGM_MEMTEST_INDEX¶
Memtest test index.
-
enumerator DCGM_PULSE_TEST_INDEX¶
Pulse test index.
-
enumerator DCGM_SOFTWARE_INDEX¶
Software test index.
-
enumerator DCGM_CONTEXT_CREATE_INDEX¶
Context create test index.
-
enumerator DCGM_UNKNOWN_INDEX¶
Unknown test.
-
enumerator DCGM_MEMORY_INDEX¶
-
enum dcgmSoftwareTest_enum¶
Values:
-
enumerator DCGM_SWTEST_BLACKLIST¶
test for presence of blacklisted drivers (e.g. nouveau)
-
enumerator DCGM_SWTEST_NVML_LIBRARY¶
test for presence (and version) of NVML lib
-
enumerator DCGM_SWTEST_CUDA_MAIN_LIBRARY¶
test for presence (and version) of CUDA lib
-
enumerator DCGM_SWTEST_CUDA_RUNTIME_LIBRARY¶
test for presence (and version) of CUDA RT lib
-
enumerator DCGM_SWTEST_PERMISSIONS¶
test for character device permissions
-
enumerator DCGM_SWTEST_PERSISTENCE_MODE¶
test for persistence mode enabled
-
enumerator DCGM_SWTEST_ENVIRONMENT¶
test for CUDA environment vars that may slow tests
-
enumerator DCGM_SWTEST_PAGE_RETIREMENT¶
test for pending frame buffer page retirement
-
enumerator DCGM_SWTEST_GRAPHICS_PROCESSES¶
test for graphics processes running
-
enumerator DCGM_SWTEST_INFOROM¶
test for inforom corruption
-
enumerator DCGM_SWTEST_BLACKLIST¶
-
enum dcgmGpuLevel_enum¶
Represents level relationships within a system between two GPUs The enums are spaced to allow for future relationships.
These match the definitions in nvml.h
Values:
-
enumerator DCGM_TOPOLOGY_UNINITIALIZED¶
-
enumerator DCGM_TOPOLOGY_BOARD¶
multi-GPU board
-
enumerator DCGM_TOPOLOGY_SINGLE¶
all devices that only need traverse a single PCIe switch
-
enumerator DCGM_TOPOLOGY_MULTIPLE¶
all devices that need not traverse a host bridge
-
enumerator DCGM_TOPOLOGY_HOSTBRIDGE¶
all devices that are connected to the same host bridge
-
enumerator DCGM_TOPOLOGY_CPU¶
all devices that are connected to the same CPU but possibly multiple host bridges
-
enumerator DCGM_TOPOLOGY_SYSTEM¶
all devices in the system
-
enumerator DCGM_TOPOLOGY_NVLINK1¶
GPUs connected via a single NVLINK link.
-
enumerator DCGM_TOPOLOGY_NVLINK2¶
GPUs connected via two NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK3¶
GPUs connected via three NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK4¶
GPUs connected via four NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK5¶
GPUs connected via five NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK6¶
GPUs connected via six NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK7¶
GPUs connected via seven NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK8¶
GPUs connected via eight NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK9¶
GPUs connected via nine NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK10¶
GPUs connected via ten NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK11¶
GPUs connected via eleven NVLINK links.
-
enumerator DCGM_TOPOLOGY_NVLINK12¶
GPUs connected via twelve NVLINK links.
-
enumerator DCGM_TOPOLOGY_UNINITIALIZED¶
-
enum dcgmIntrospectLevel_enum¶
Identifies a level to retrieve field introspection info for.
Values:
-
enumerator DCGM_INTROSPECT_LVL_INVALID¶
Invalid value.
-
enumerator DCGM_INTROSPECT_LVL_FIELD¶
Introspection data is grouped by field ID.
-
enumerator DCGM_INTROSPECT_LVL_FIELD_GROUP¶
Introspection data is grouped by field group.
-
enumerator DCGM_INTROSPECT_LVL_ALL_FIELDS¶
Introspection data is aggregated for all fields.
-
enumerator DCGM_INTROSPECT_LVL_INVALID¶
-
enum dcgmIntrospectState_enum¶
State of DCGM metadata gathering.
If it is set to DISABLED then “Metadata” API calls to DCGM are not supported.
Values:
-
enumerator DCGM_INTROSPECT_STATE_DISABLED¶
-
enumerator DCGM_INTROSPECT_STATE_ENABLED¶
-
enumerator DCGM_INTROSPECT_STATE_DISABLED¶
-
enum dcgmGpuNVLinkErrorType_enum¶
Identifies a GPU NVLink error type returned by DCGM_FI_DEV_GPU_NVLINK_ERRORS.
Values:
-
enumerator DCGM_GPU_NVLINK_ERROR_RECOVERY_REQUIRED¶
NVLink link recovery error occurred.
-
enumerator DCGM_GPU_NVLINK_ERROR_FATAL¶
NVLink link fatal error occurred.
-
enumerator DCGM_GPU_NVLINK_ERROR_RECOVERY_REQUIRED¶
-
enum dcgmNvLinkLinkState_enum¶
NvLink link states.
Values:
-
enumerator DcgmNvLinkLinkStateNotSupported¶
NvLink is unsupported by this GPU (Default for GPUs)
-
enumerator DcgmNvLinkLinkStateDisabled¶
NvLink is supported for this link but this link is disabled (Default for NvSwitches)
-
enumerator DcgmNvLinkLinkStateDown¶
This NvLink link is down (inactive)
-
enumerator DcgmNvLinkLinkStateUp¶
This NvLink link is up (active)
-
enumerator DcgmNvLinkLinkStateNotSupported¶
-
enum dcgmModuleId_t¶
Module IDs.
Values:
-
enumerator DcgmModuleIdCore¶
Core DCGM - always loaded.
-
enumerator DcgmModuleIdNvSwitch¶
NvSwitch Module.
-
enumerator DcgmModuleIdVGPU¶
VGPU Module.
-
enumerator DcgmModuleIdIntrospect¶
Introspection Module.
-
enumerator DcgmModuleIdHealth¶
Health Module.
-
enumerator DcgmModuleIdPolicy¶
Policy Module.
-
enumerator DcgmModuleIdConfig¶
Config Module.
-
enumerator DcgmModuleIdDiag¶
GPU Diagnostic Module.
-
enumerator DcgmModuleIdProfiling¶
Profiling Module.
-
enumerator DcgmModuleIdCount¶
Always last. 1 greater than largest value above.
-
enumerator DcgmModuleIdCore¶
-
enum dcgmModuleStatus_t¶
Module Status.
Modules are lazy loaded, so they will be in status DcgmModuleStatusNotLoaded until they are used. One modules are used, they will move to another status.
Values:
-
enumerator DcgmModuleStatusNotLoaded¶
Module has not been loaded yet.
-
enumerator DcgmModuleStatusBlacklisted¶
Module has been blacklisted from being loaded.
-
enumerator DcgmModuleStatusFailed¶
Loading the module failed.
-
enumerator DcgmModuleStatusLoaded¶
Module has been loaded.
-
enumerator DcgmModuleStatusUnloaded¶
Module has been unloaded, happens during shutdown.
-
enumerator DcgmModuleStatusNotLoaded¶
-
struct dcgmConnectV2Params_v1¶
- #include <dcgm_structs.h>
Connection options for dcgmConnect_v2 (v1)
NOTE: This version is deprecated. use dcgmConnectV2Params_v2
Public Members
-
unsigned int version¶
Version number. Use dcgmConnectV2Params_version
-
unsigned int persistAfterDisconnect¶
Whether to persist DCGM state modified by this connection once the connection is terminated. Normally, all field watches created by a connection are removed once a connection goes away. 1 = do not clean up after this connection. 0 = clean up after this connection
-
unsigned int version¶
-
struct dcgmConnectV2Params_v2¶
- #include <dcgm_structs.h>
Connection options for dcgmConnect_v2 (v2)
Public Members
-
unsigned int version¶
Version number. Use dcgmConnectV2Params_version
-
unsigned int persistAfterDisconnect¶
Whether to persist DCGM state modified by this connection once the connection is terminated. Normally, all field watches created by a connection are removed once a connection goes away. 1 = do not clean up after this connection. 0 = clean up after this connection
-
unsigned int timeoutMs¶
When attempting to connect to the specified host engine, how long should we wait in milliseconds before giving up
-
unsigned int addressIsUnixSocket¶
Whether or not the passed-in address is a unix socket filename (1) or a TCP/IP address (0)
-
unsigned int version¶
-
struct dcgmHostengineHealth_v1¶
- #include <dcgm_structs.h>
Typedef for dcgmHostengineHealth_v1.
-
struct dcgmGroupEntityPair_t¶
- #include <dcgm_structs.h>
Represents a entityGroupId + entityId pair to uniquely identify a given entityId inside a group of entities.
Added in DCGM 1.5.0
Public Members
-
dcgm_field_entity_group_t entityGroupId¶
Entity Group ID entity belongs to.
-
dcgm_field_eid_t entityId¶
Entity ID of the entity.
-
dcgm_field_entity_group_t entityGroupId¶
-
struct dcgmGroupInfo_v2¶
- #include <dcgm_structs.h>
Structure to store information for DCGM group.
Added in DCGM 1.5.0
Public Members
-
unsigned int version¶
Version Number (use dcgmGroupInfo_version2)
-
unsigned int count¶
count of entityIds returned in entityList
-
char groupName[256]¶
Group Name.
-
dcgmGroupEntityPair_t entityList[64]¶
List of the entities that are in this group.
-
unsigned int version¶
-
struct dcgmMigHierarchyInfo_t¶
- #include <dcgm_structs.h>
Represents a pair of entity pairings to uniquely identify an entity and its place in the hierarchy.
Public Members
-
dcgmGroupEntityPair_t entity¶
Entity id and type for the entity in question.
-
dcgmGroupEntityPair_t parent¶
Entity id and type for the parent of the entity in question.
-
dcgmMigProfile_t sliceProfile¶
Entity MIG profile identifier.
-
dcgmGroupEntityPair_t entity¶
-
struct dcgmMigEntityInfo_t¶
- #include <dcgm_structs.h>
Provides additional information about location of MIG entities.
Public Members
-
char gpuUuid[128]¶
GPU UUID
-
unsigned int nvmlGpuIndex¶
GPU index from NVML
-
unsigned int nvmlInstanceId¶
GPU instance index within GPU. 0 to N. -1 for GPU entities
-
unsigned int nvmlComputeInstanceId¶
GPU Compute instance index within GPU instance. 0 to N. -1 for GPU Instance and GPU entities
-
unsigned int nvmlMigProfileId¶
Unique profile ID for GPU or Compute instances. -1 GPU entities
See also
nvmlComputeInstanceProfileInfo_st
See also
nvmlGpuInstanceProfileInfo_st
-
unsigned int nvmlProfileSlices¶
Number of slices in the MIG profile
-
char gpuUuid[128]¶
-
struct dcgmMigHierarchyInfo_v2¶
-
struct dcgmMigHierarchy_v1¶
- #include <dcgm_structs.h>
Structure to store the GPU hierarchy for a system.
Added in DCGM 2.0
-
struct dcgmMigHierarchy_v2¶
-
struct dcgmFieldGroupInfo_v1¶
- #include <dcgm_structs.h>
Structure to represent information about a field group.
Public Members
-
unsigned int version¶
Version number (dcgmFieldGroupInfo_version)
-
unsigned int numFieldIds¶
Number of entries in fieldIds[] that are valid.
-
dcgmFieldGrp_t fieldGroupId¶
ID of this field group.
-
char fieldGroupName[256]¶
Field Group Name.
-
unsigned short fieldIds[128]¶
Field ids that belong to this group.
-
unsigned int version¶
-
struct dcgmAllFieldGroup_v1¶
Public Members
-
unsigned int version¶
Version number (dcgmAllFieldGroupInfo_version)
-
unsigned int numFieldGroups¶
Number of entries in fieldGroups[] that are populated.
-
dcgmFieldGroupInfo_t fieldGroups[64]¶
Info about each field group.
-
unsigned int version¶
-
struct dcgmErrorInfo_t¶
- #include <dcgm_structs.h>
Structure to represent error attributes.
-
struct dcgmClockSet_v1¶
- #include <dcgm_structs.h>
Represents a set of memory, SM, and video clocks for a device.
This can be current values or a target values based on context
-
struct dcgmDeviceSupportedClockSets_v1¶
- #include <dcgm_structs.h>
Represents list of supported clock sets for a device.
Public Members
-
unsigned int version¶
Version Number (dcgmDeviceSupportedClockSets_version)
-
unsigned int count¶
Number of supported clocks.
-
dcgmClockSet_t clockSet[256]¶
Valid clock sets for the device. Upto count entries are filled.
-
unsigned int version¶
-
struct dcgmDevicePidAccountingStats_v1¶
- #include <dcgm_structs.h>
Represents accounting data for one process.
Public Members
-
unsigned int version¶
Version Number. Should match dcgmDevicePidAccountingStats_version.
-
unsigned int pid¶
Process id of the process these stats are for.
-
unsigned int gpuUtilization¶
Percent of time over the process’s lifetime during which one or more kernels was executing on the GPU.
Set to DCGM_INT32_NOT_SUPPORTED if is not supported
-
unsigned int memoryUtilization¶
Percent of time over the process’s lifetime during which global (device) memory was being read or written.
Set to DCGM_INT32_NOT_SUPPORTED if is not supported
-
unsigned long long maxMemoryUsage¶
Maximum total memory in bytes that was ever allocated by the process.
Set to DCGM_INT64_NOT_SUPPORTED if is not supported
-
unsigned long long startTimestamp¶
CPU Timestamp in usec representing start time for the process.
-
unsigned long long activeTimeUsec¶
Amount of time in usec during which the compute context was active.
Note that this does not mean the context was being used. endTimestamp can be computed as startTimestamp + activeTime
-
unsigned int version¶
-
struct dcgmDeviceThermals_v1¶
- #include <dcgm_structs.h>
Represents thermal information.
-
struct dcgmDevicePowerLimits_v1¶
- #include <dcgm_structs.h>
Represents various power limits.
Public Members
-
unsigned int version¶
Version Number.
-
unsigned int curPowerLimit¶
Power management limit associated with this device (in W)
-
unsigned int defaultPowerLimit¶
Power management limit effective at device boot (in W)
-
unsigned int enforcedPowerLimit¶
Effective power limit that the driver enforces after taking into account all limiters (in W)
-
unsigned int minPowerLimit¶
Minimum power management limit (in W)
-
unsigned int maxPowerLimit¶
Maximum power management limit (in W)
-
unsigned int version¶
-
struct dcgmDeviceIdentifiers_v1¶
- #include <dcgm_structs.h>
Represents device identifiers.
Public Members
-
unsigned int version¶
Version Number (dcgmDeviceIdentifiers_version)
-
char brandName[256]¶
Brand Name.
-
char deviceName[256]¶
Name of the device.
-
char pciBusId[256]¶
PCI Bus ID.
-
char serial[256]¶
Serial for the device.
-
char uuid[256]¶
UUID for the device.
-
char vbios[256]¶
VBIOS version.
-
char inforomImageVersion[256]¶
Inforom Image version.
-
unsigned int pciDeviceId¶
The combined 16-bit device id and 16-bit vendor id.
-
unsigned int pciSubSystemId¶
The 32-bit Sub System Device ID.
-
char driverVersion[256]¶
Driver Version.
-
unsigned int virtualizationMode¶
Virtualization Mode.
-
unsigned int version¶
-
struct dcgmDeviceMemoryUsage_v1¶
- #include <dcgm_structs.h>
Represents device memory and usage.
Public Members
-
unsigned int version¶
Version Number (dcgmDeviceMemoryUsage_version)
-
unsigned int bar1Total¶
Total BAR1 size in megabytes.
-
unsigned int fbTotal¶
Total framebuffer memory in megabytes.
-
unsigned int fbUsed¶
Used framebuffer memory in megabytes.
-
unsigned int fbFree¶
Free framebuffer memory in megabytes.
-
unsigned int version¶
-
struct dcgmDeviceVgpuUtilInfo_v1¶
- #include <dcgm_structs.h>
Represents utilization values for vGPUs running on the device.
Public Members
-
unsigned int version¶
Version Number (dcgmDeviceVgpuUtilInfo_version)
-
unsigned int vgpuId¶
vGPU instance ID
-
unsigned int smUtil¶
GPU utilization for vGPU.
-
unsigned int memUtil¶
Memory utilization for vGPU.
-
unsigned int encUtil¶
Encoder utilization for vGPU.
-
unsigned int decUtil¶
Decoder utilization for vGPU.
-
unsigned int version¶
-
struct dcgmDeviceEncStats_v1¶
- #include <dcgm_structs.h>
Represents current encoder statistics for the given device/vGPU instance.
-
struct dcgmDeviceFbcStats_v1¶
- #include <dcgm_structs.h>
Represents current frame buffer capture sessions statistics for the given device/vGPU instance.
-
struct dcgmDeviceFbcSessionInfo_v1¶
- #include <dcgm_structs.h>
Represents information about active FBC session on the given device/vGPU instance.
Public Members
-
unsigned int version¶
Version Number (dcgmDeviceFbcSessionInfo_version)
-
unsigned int sessionId¶
Unique session ID.
-
unsigned int pid¶
Owning process ID.
-
unsigned int vgpuId¶
vGPU instance ID (only valid on vGPU hosts, otherwise zero)
-
unsigned int displayOrdinal¶
Display identifier.
-
dcgmFBCSessionType_t sessionType¶
Type of frame buffer capture session.
-
unsigned int sessionFlags¶
Session flags.
-
unsigned int hMaxResolution¶
Max horizontal resolution supported by the capture session.
-
unsigned int vMaxResolution¶
Max vertical resolution supported by the capture session.
-
unsigned int hResolution¶
Horizontal resolution requested by caller in capture call.
-
unsigned int vResolution¶
Vertical resolution requested by caller in capture call.
-
unsigned int averageFps¶
Moving average new frames captured per second.
-
unsigned int averageLatency¶
Moving average new frame capture latency in microseconds.
-
unsigned int version¶
-
struct dcgmDeviceFbcSessions_v1¶
- #include <dcgm_structs.h>
Represents all the active FBC sessions on the given device/vGPU instance.
Public Members
-
unsigned int version¶
Version Number (dcgmDeviceFbcSessions_version)
-
unsigned int sessionCount¶
Count of active FBC sessions.
-
dcgmDeviceFbcSessionInfo_t sessionInfo[256]¶
Info about the active FBC session.
-
unsigned int version¶
-
struct dcgmDeviceVgpuEncSessions_v1¶
- #include <dcgm_structs.h>
Represents information about active encoder sessions on the given vGPU instance.
Public Members
-
unsigned int version¶
Version Number (dcgmDeviceVgpuEncSessions_version)
-
unsigned int vgpuId¶
vGPU instance ID
-
unsigned int sessionId¶
Unique session ID.
-
unsigned int pid¶
Process ID.
-
dcgmEncoderType_t codecType¶
Video encoder type.
-
unsigned int hResolution¶
Current encode horizontal resolution.
-
unsigned int vResolution¶
Current encode vertical resolution.
-
unsigned int averageFps¶
Moving average encode frames per second.
-
unsigned int averageLatency¶
Moving average encode latency in milliseconds.
-
unsigned int version¶
-
struct dcgmDeviceVgpuProcessUtilInfo_v1¶
- #include <dcgm_structs.h>
Represents utilization values for processes running in vGPU VMs using the device.
Public Members
-
unsigned int version¶
Version Number (dcgmDeviceVgpuProcessUtilInfo_version)
-
unsigned int vgpuId¶
vGPU instance ID
-
unsigned int vgpuProcessSamplesCount¶
Count of processes running in the vGPU VM,for which utilization rates are being reported in this cycle.
-
unsigned int pid¶
Process ID of the process running in the vGPU VM.
-
char processName[64]¶
Process Name of process running in the vGPU VM.
-
unsigned int smUtil¶
GPU utilization of process running in the vGPU VM.
-
unsigned int memUtil¶
Memory utilization of process running in the vGPU VM.
-
unsigned int encUtil¶
Encoder utilization of process running in the vGPU VM.
-
unsigned int decUtil¶
Decoder utilization of process running in the vGPU VM.
-
unsigned int version¶
-
struct dcgmDeviceVgpuTypeInfo_v1¶
- #include <dcgm_structs.h>
Represents static info related to vGPUs supported on the device.
Public Members
-
unsigned int version¶
Version number (dcgmDeviceVgpuTypeInfo_version)
-
union dcgmDeviceVgpuTypeInfo_v1::[anonymous] vgpuTypeInfo¶
vGPU type ID and Supported vGPU type count
-
char vgpuTypeName[64]¶
vGPU type Name
-
char vgpuTypeClass[64]¶
Class of vGPU type.
-
char vgpuTypeLicense[128]¶
license of vGPU type
-
int deviceId¶
device ID of vGPU type
-
int subsystemId¶
Subsystem ID of vGPU type.
-
int numDisplayHeads¶
Count of vGPU’s supported display heads.
-
int maxInstances¶
maximum number of vGPU instances creatable on a device for given vGPU type
-
int frameRateLimit¶
Frame rate limit value of the vGPU type.
-
int maxResolutionX¶
vGPU display head’s maximum supported resolution in X dimension
-
int maxResolutionY¶
vGPU display head’s maximum supported resolution in Y dimension
-
int fbTotal¶
vGPU Total framebuffer size in megabytes
-
unsigned int version¶
-
struct dcgmDeviceVgpuTypeInfo_v2¶
Public Members
-
unsigned int version¶
Version number (dcgmDeviceVgpuTypeInfo_version2)
-
union dcgmDeviceVgpuTypeInfo_v2::[anonymous] vgpuTypeInfo¶
vGPU type ID and Supported vGPU type count
-
char vgpuTypeName[64]¶
vGPU type Name
-
char vgpuTypeClass[64]¶
Class of vGPU type.
-
char vgpuTypeLicense[128]¶
license of vGPU type
-
int deviceId¶
device ID of vGPU type
-
int subsystemId¶
Subsystem ID of vGPU type.
-
int numDisplayHeads¶
Count of vGPU’s supported display heads.
-
int maxInstances¶
maximum number of vGPU instances creatable on a device for given vGPU type
-
int frameRateLimit¶
Frame rate limit value of the vGPU type.
-
int maxResolutionX¶
vGPU display head’s maximum supported resolution in X dimension
-
int maxResolutionY¶
vGPU display head’s maximum supported resolution in Y dimension
-
int fbTotal¶
vGPU Total framebuffer size in megabytes
-
int gpuInstanceProfileId¶
GPU Instance Profile ID for the given vGPU type.
-
unsigned int version¶
-
struct dcgmDeviceSettings_v1¶
-
struct dcgmDeviceSettings_v2¶
-
struct dcgmDeviceAttributes_v1¶
- #include <dcgm_structs.h>
Represents attributes corresponding to a device.
Public Members
-
unsigned int version¶
Version number (dcgmDeviceAttributes_version)
-
dcgmDeviceSupportedClockSets_t clockSets¶
Supported clocks for the device.
-
dcgmDeviceThermals_t thermalSettings¶
Thermal settings for the device.
-
dcgmDevicePowerLimits_t powerLimits¶
Various power limits for the device.
-
dcgmDeviceIdentifiers_t identifiers¶
Identifiers for the device.
-
dcgmDeviceMemoryUsage_t memoryUsage¶
Memory usage info for the device.
-
char unused[208]¶
Unused Space. Set to 0 for now.
-
unsigned int version¶
-
struct dcgmDeviceAttributes_v2¶
Public Members
-
unsigned int version¶
Version number (dcgmDeviceAttributes_version)
-
dcgmDeviceSupportedClockSets_t clockSets¶
Supported clocks for the device.
-
dcgmDeviceThermals_t thermalSettings¶
Thermal settings for the device.
-
dcgmDevicePowerLimits_t powerLimits¶
Various power limits for the device.
-
dcgmDeviceIdentifiers_t identifiers¶
Identifiers for the device.
-
dcgmDeviceMemoryUsage_t memoryUsage¶
Memory usage info for the device.
-
dcgmDeviceSettings_v1 settings¶
Basic device settings.
-
unsigned int version¶
-
struct dcgmDeviceAttributes_v3¶
Public Members
-
unsigned int version¶
Version number (dcgmDeviceAttributes_version)
-
dcgmDeviceSupportedClockSets_t clockSets¶
Supported clocks for the device.
-
dcgmDeviceThermals_t thermalSettings¶
Thermal settings for the device.
-
dcgmDevicePowerLimits_t powerLimits¶
Various power limits for the device.
-
dcgmDeviceIdentifiers_t identifiers¶
Identifiers for the device.
-
dcgmDeviceMemoryUsage_t memoryUsage¶
Memory usage info for the device.
-
dcgmDeviceSettings_v2 settings¶
Basic device settings.
-
unsigned int version¶
-
struct dcgmConfigPerfStateSettings_t¶
- #include <dcgm_structs.h>
Used to represent Performance state settings.
Public Members
-
unsigned int syncBoost¶
Sync Boost Mode (0: Disabled, 1 : Enabled, DCGM_INT32_BLANK : Ignored).
Note that using this setting may result in lower clocks than targetClocks
-
dcgmClockSet_t targetClocks¶
Target clocks.
Set smClock and memClock to DCGM_INT32_BLANK to ignore/use compatible values. For GPUs > Maxwell, setting this implies autoBoost=0
-
unsigned int syncBoost¶
-
struct dcgmConfigPowerLimit_t¶
- #include <dcgm_structs.h>
Used to represents the power capping limit for each GPU in the group or to represent the power budget for the entire group.
Public Members
-
dcgmConfigPowerLimitType_t type¶
Flag to represent power cap for each GPU or power budget for the group of GPUs.
-
unsigned int val¶
Power Limit in Watts (Set a value OR DCGM_INT32_BLANK to Ignore)
-
dcgmConfigPowerLimitType_t type¶
-
struct dcgmConfig_v1¶
- #include <dcgm_structs.h>
Structure to represent default and target configuration for a device.
Public Members
-
unsigned int version¶
Version number (dcgmConfig_version)
-
unsigned int gpuId¶
GPU ID.
-
unsigned int eccMode¶
ECC Mode (0: Disabled, 1 : Enabled, DCGM_INT32_BLANK : Ignored)
-
unsigned int computeMode¶
Compute Mode (One of DCGM_CONFIG_COMPUTEMODE_? OR DCGM_INT32_BLANK to Ignore)
-
dcgmConfigPerfStateSettings_t perfState¶
Performance State Settings (clocks / boost mode)
-
dcgmConfigPowerLimit_t powerLimit¶
Power Limits.
-
unsigned int version¶
-
struct dcgmPolicyViolation_v1¶
Public Members
-
unsigned int version¶
Version number (dcgmPolicyViolation_version)
-
unsigned int notifyOnEccDbe¶
true/false notification on ECC Double Bit Errors
-
unsigned int notifyOnPciEvent¶
true/false notification on PCI Events
-
unsigned int notifyOnMaxRetiredPages¶
number of retired pages to occur before notification
-
unsigned int version¶
-
struct dcgmPolicyConditionParams_st¶
- #include <dcgm_structs.h>
Structure for policy condition parameters.
This structure contains a tag that represents the type of the value being passed as well as a “val” which is a union of the possible value types. For example, to pass a true boolean: tag = BOOL, val.boolean = 1.
-
struct dcgmPolicyViolationNotify_t¶
- #include <dcgm_structs.h>
Structure to fill when a user queries for policy violations.
Public Members
-
unsigned int gpuId¶
gpu ID
-
unsigned int violationOccurred¶
a violation based on the bit values in dcgmPolicyCondition_t
-
unsigned int gpuId¶
-
struct dcgmPolicy_v1¶
- #include <dcgm_structs.h>
Define the structure that specifies a policy to be enforced for a GPU.
Public Members
-
unsigned int version¶
version number (dcgmPolicy_version)
-
dcgmPolicyCondition_t condition¶
Condition(s) to access dcgmPolicyCondition_t.
-
dcgmPolicyMode_t mode¶
Mode of operation dcgmPolicyMode_t.
-
dcgmPolicyIsolation_t isolation¶
Isolation level after a policy violation dcgmPolicyIsolation_t.
-
dcgmPolicyAction_t action¶
Action to perform after a policy violation dcgmPolicyAction_t action.
-
dcgmPolicyValidation_t validation¶
Validation to perform after action is taken dcgmPolicyValidation_t.
-
dcgmPolicyFailureResp_t response¶
Failure to validation response dcgmPolicyFailureResp_t.
-
dcgmPolicyConditionParams_t parms[7]¶
Parameters for the condition fields.
-
unsigned int version¶
-
struct dcgmPolicyConditionDbe_t¶
- #include <dcgm_structs.h>
Define the ECC DBE return structure.
Public Members
-
long long timestamp¶
timestamp of the error
-
enum dcgmPolicyConditionDbe_t::[anonymous] location¶
location of the error
-
unsigned int numerrors¶
number of errors
-
long long timestamp¶
-
struct dcgmPolicyConditionPci_t¶
- #include <dcgm_structs.h>
Define the PCI replay error return structure.
-
struct dcgmPolicyConditionMpr_t¶
- #include <dcgm_structs.h>
Define the maximum pending retired pages limit return structure.
-
struct dcgmPolicyConditionThermal_t¶
- #include <dcgm_structs.h>
Define the thermal policy violations return structure.
-
struct dcgmPolicyConditionPower_t¶
- #include <dcgm_structs.h>
Define the power policy violations return structure.
-
struct dcgmPolicyConditionNvlink_t¶
- #include <dcgm_structs.h>
Define the nvlink policy violations return structure.
-
struct dcgmPolicyConditionXID_t¶
- #include <dcgm_structs.h>
Define the xid policy violations return structure.
-
struct dcgmPolicyCallbackResponse_v1¶
- #include <dcgm_structs.h>
Define the structure that is given to the callback function.
Public Members
-
unsigned int version¶
version number (dcgmPolicyCallbackResponse_version)
-
dcgmPolicyCondition_t condition¶
Condition that was violated.
-
dcgmPolicyConditionDbe_t dbe¶
ECC DBE return structure.
-
dcgmPolicyConditionPci_t pci¶
PCI replay error return structure.
-
dcgmPolicyConditionMpr_t mpr¶
Max retired pages limit return structure.
-
dcgmPolicyConditionThermal_t thermal¶
Thermal policy violations return structure.
-
dcgmPolicyConditionPower_t power¶
Power policy violations return structure.
-
dcgmPolicyConditionNvlink_t nvlink¶
Nvlink policy violations return structure.
-
dcgmPolicyConditionXID_t xid¶
XID policy violations return structure.
-
unsigned int version¶
-
struct dcgmFieldValue_v1¶
- #include <dcgm_structs.h>
This structure is used to represent value for the field to be queried.
Public Members
-
unsigned int version¶
version number (dcgmFieldValue_version1)
-
unsigned short fieldId¶
One of DCGM_FI_?
-
unsigned short fieldType¶
One of DCGM_FT_?
-
int status¶
Status for the querying the field. DCGM_ST_OK or one of DCGM_ST_?
-
int64_t ts¶
Timestamp in usec since 1970.
-
int64_t i64¶
Int64 value.
-
double dbl¶
Double value.
-
char str[256]¶
NULL terminated string.
-
char blob[4096]¶
Binary blob.
-
union dcgmFieldValue_v1::[anonymous] value¶
Value.
-
unsigned int version¶
-
struct dcgmFieldValue_v2¶
- #include <dcgm_structs.h>
This structure is used to represent value for the field to be queried.
Public Members
-
unsigned int version¶
version number (dcgmFieldValue_version2)
-
dcgm_field_entity_group_t entityGroupId¶
Entity group this field value’s entity belongs to.
-
dcgm_field_eid_t entityId¶
Entity this field value belongs to.
-
unsigned short fieldId¶
One of DCGM_FI_?
-
unsigned short fieldType¶
One of DCGM_FT_?
-
int status¶
Status for the querying the field. DCGM_ST_OK or one of DCGM_ST_?
-
unsigned int unused¶
Unused for now to align ts to an 8-byte boundary.
-
int64_t ts¶
Timestamp in usec since 1970.
-
int64_t i64¶
Int64 value.
-
double dbl¶
Double value.
-
char str[256]¶
NULL terminated string.
-
char blob[4096]¶
Binary blob.
-
union dcgmFieldValue_v2::[anonymous] value¶
Value.
-
unsigned int version¶
-
struct dcgmStatSummaryInt64_t¶
- #include <dcgm_structs.h>
Summary of time series data in int64 format.
Each value will either be set or be a BLANK value. Check for blank with the DCGM_INT64_IS_BLANK() macro.
See also
See dcgmvalue.h for the actual values of BLANK values
-
struct dcgmStatSummaryInt32_t¶
- #include <dcgm_structs.h>
Same as dcgmStatSummaryInt64_t, but with 32-bit integer values.
-
struct dcgmStatSummaryFp64_t¶
- #include <dcgm_structs.h>
Summary of time series data in double-precision format.
Each value will either be set or be a BLANK value. Check for blank with the DCGM_FP64_IS_BLANK() macro.
See also
See dcgmvalue.h for the actual values of BLANK values
-
struct dcgmDiagErrorDetail_t¶
-
struct dcgmIncidentInfo_t¶
Public Members
-
dcgmHealthSystems_t system¶
system to which this information belongs
-
dcgmHealthWatchResults_t health¶
health diagnosis of this incident
-
dcgmDiagErrorDetail_t error¶
Information about the error(s) and their error codes.
-
dcgmGroupEntityPair_t entityInfo¶
identify which entity has this error
-
dcgmHealthSystems_t system¶
-
struct dcgmHealthResponse_v4¶
- #include <dcgm_structs.h>
Health response structure version 4 - Simply list the incidents instead of reporting by entity.
Since DCGM 2.0
Public Members
-
unsigned int version¶
The version number of this struct.
-
dcgmHealthWatchResults_t overallHealth¶
The overall health of this entire host.
-
unsigned int incidentCount¶
The number of health incidents reported in this struct.
-
dcgmIncidentInfo_t incidents[64]¶
Report of the errors detected.
-
unsigned int version¶
-
struct dcgmHealthSetParams_v2¶
- #include <dcgm_structs.h>
Structure used to set health watches via the dcgmHealthSet_v2 API.
Public Members
-
unsigned int version¶
Version of this struct. Should be dcgmHealthSet_version2
-
dcgmGpuGrp_t groupId¶
Group ID representing collection of one or more entities. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs or DCGM_GROUP_ALL_NVSWITCHES to perform operation on all the NvSwitches.
-
dcgmHealthSystems_t systems¶
An enum representing systems that should be enabled for health checks logically OR’d together. Refer to dcgmHealthSystems_t for details.
-
long long updateInterval¶
How often to query the underlying health information from the NVIDIA driver in usec. This should be the same as how often you call dcgmHealthCheck
-
double maxKeepAge¶
How long to keep data cached for this field in seconds. This should be at least your maximum time between calling dcgmHealthCheck
-
unsigned int version¶
-
struct dcgmProcessUtilInfo_t¶
- #include <dcgm_structs.h>
per process utilization rates
-
struct dcgmProcessUtilSample_t¶
- #include <dcgm_structs.h>
Internal structure used to get the PID and the corresponding utilization rate.
-
struct dcgmPidSingleInfo_t¶
- #include <dcgm_structs.h>
Info corresponding to single PID.
Public Members
-
unsigned int gpuId¶
ID of the GPU this pertains to. GPU_ID_INVALID = summary information for multiple GPUs.
-
long long energyConsumed¶
Energy consumed by the gpu in milli-watt/seconds.
-
dcgmStatSummaryInt64_t pcieRxBandwidth¶
PCI-E bytes read from the GPU.
-
dcgmStatSummaryInt64_t pcieTxBandwidth¶
PCI-E bytes written to the GPU.
-
long long pcieReplays¶
Count of PCI-E replays that occurred.
-
long long startTime¶
Process start time in microseconds since 1970.
-
long long endTime¶
Process end time in microseconds since 1970 or reported as 0 if the process is not completed.
-
dcgmProcessUtilInfo_t processUtilization¶
Process SM and Memory Utilization (in percent)
-
dcgmStatSummaryInt32_t smUtilization¶
GPU SM Utilization in percent.
-
dcgmStatSummaryInt32_t memoryUtilization¶
GPU Memory Utilization in percent.
-
unsigned int eccSingleBit¶
Deprecated - Count of ECC single bit errors that occurred.
-
unsigned int eccDoubleBit¶
Count of ECC double bit errors that occurred.
-
dcgmStatSummaryInt32_t memoryClock¶
Memory clock in MHz.
-
dcgmStatSummaryInt32_t smClock¶
SM clock in MHz.
-
int numXidCriticalErrors¶
Number of valid entries in xidCriticalErrorsTs.
-
long long xidCriticalErrorsTs[10]¶
Timestamps of the critical XID errors that occurred.
-
int numOtherComputePids¶
Count of otherComputePids entries that are valid.
-
unsigned int otherComputePids[16]¶
Other compute processes that ran. 0=no process.
-
int numOtherGraphicsPids¶
Count of otherGraphicsPids entries that are valid.
-
unsigned int otherGraphicsPids[16]¶
Other graphics processes that ran. 0=no process.
-
long long maxGpuMemoryUsed¶
Maximum amount of GPU memory that was used in bytes.
-
long long powerViolationTime¶
Number of microseconds we were at reduced clocks due to power violation.
-
long long thermalViolationTime¶
Number of microseconds we were at reduced clocks due to thermal violation.
-
long long reliabilityViolationTime¶
Amount of microseconds we were at reduced clocks due to the reliability limit.
-
long long boardLimitViolationTime¶
Amount of microseconds we were at reduced clocks due to being at the board’s max voltage.
-
long long lowUtilizationTime¶
Amount of microseconds we were at reduced clocks due to low utilization.
-
long long syncBoostTime¶
Amount of microseconds we were at reduced clocks due to sync boost.
-
dcgmHealthWatchResults_t overallHealth¶
The overall health of the system. dcgmHealthWatchResults_t.
-
dcgmHealthSystems_t system¶
system to which this information belongs
-
dcgmHealthWatchResults_t health¶
health of the specified system on this GPU
-
unsigned int gpuId¶
-
struct dcgmPidInfo_v2¶
- #include <dcgm_structs.h>
To store process statistics.
Public Members
-
unsigned int version¶
Version of this message (dcgmPidInfo_version)
-
unsigned int pid¶
PID of the process.
-
int numGpus¶
Number of GPUs that are valid in GPUs.
-
dcgmPidSingleInfo_t summary¶
Summary information for all GPUs listed in gpus[].
-
dcgmPidSingleInfo_t gpus[32]¶
Per-GPU information for this PID.
-
unsigned int version¶
-
struct dcgmGpuUsageInfo_t¶
- #include <dcgm_structs.h>
Info corresponding to the job on a GPU.
Public Members
-
unsigned int gpuId¶
ID of the GPU this pertains to. GPU_ID_INVALID = summary information for multiple GPUs.
-
long long energyConsumed¶
Energy consumed in milli-watt/seconds.
-
dcgmStatSummaryFp64_t powerUsage¶
Power usage Min/Max/Avg in watts.
-
dcgmStatSummaryInt64_t pcieRxBandwidth¶
PCI-E bytes read from the GPU.
-
dcgmStatSummaryInt64_t pcieTxBandwidth¶
PCI-E bytes written to the GPU.
-
long long pcieReplays¶
Count of PCI-E replays that occurred.
-
long long startTime¶
User provided job start time in microseconds since 1970.
-
long long endTime¶
User provided job end time in microseconds since 1970.
-
dcgmStatSummaryInt32_t smUtilization¶
GPU SM Utilization in percent.
-
dcgmStatSummaryInt32_t memoryUtilization¶
GPU Memory Utilization in percent.
-
unsigned int eccSingleBit¶
Deprecated - Count of ECC single bit errors that occurred.
-
unsigned int eccDoubleBit¶
Count of ECC double bit errors that occurred.
-
dcgmStatSummaryInt32_t memoryClock¶
Memory clock in MHz.
-
dcgmStatSummaryInt32_t smClock¶
SM clock in MHz.
-
int numXidCriticalErrors¶
Number of valid entries in xidCriticalErrorsTs.
-
long long xidCriticalErrorsTs[10]¶
Timestamps of the critical XID errors that occurred.
-
int numComputePids¶
Count of computePids entries that are valid.
-
dcgmProcessUtilInfo_t computePidInfo[16]¶
List of compute processes that ran during the job.
0=no process
-
int numGraphicsPids¶
Count of graphicsPids entries that are valid.
-
dcgmProcessUtilInfo_t graphicsPidInfo[16]¶
List of compute processes that ran during the job.
0=no process
-
long long maxGpuMemoryUsed¶
Maximum amount of GPU memory that was used in bytes.
-
long long powerViolationTime¶
Number of microseconds we were at reduced clocks due to power violation.
-
long long thermalViolationTime¶
Number of microseconds we were at reduced clocks due to thermal violation.
-
long long reliabilityViolationTime¶
Amount of microseconds we were at reduced clocks due to the reliability limit.
-
long long boardLimitViolationTime¶
Amount of microseconds we were at reduced clocks due to being at the board’s max voltage.
-
long long lowUtilizationTime¶
Amount of microseconds we were at reduced clocks due to low utilization.
-
long long syncBoostTime¶
Amount of microseconds we were at reduced clocks due to sync boost.
-
dcgmHealthWatchResults_t overallHealth¶
The overall health of the system. dcgmHealthWatchResults_t.
-
dcgmHealthSystems_t system¶
system to which this information belongs
-
dcgmHealthWatchResults_t health¶
health of the specified system on this GPU
-
unsigned int gpuId¶
-
struct dcgmJobInfo_v3¶
- #include <dcgm_structs.h>
To store job statistics The following fields are not applicable in the summary info:
pcieRxBandwidth (Min/Max)
pcieTxBandwidth (Min/Max)
smUtilization (Min/Max)
memoryUtilization (Min/Max)
memoryClock (Min/Max)
smClock (Min/Max)
processSamples
The average value in the above fields (in the summary) is the average of the averages of respective fields from all GPUs
Public Members
-
unsigned int version¶
Version of this message (dcgmPidInfo_version)
-
int numGpus¶
Number of GPUs that are valid in gpus[].
-
dcgmGpuUsageInfo_t summary¶
Summary information for all GPUs listed in gpus[].
-
dcgmGpuUsageInfo_t gpus[32]¶
Per-GPU information for this PID.
-
struct dcgmRunningProcess_v1¶
- #include <dcgm_structs.h>
Running process information for a compute or graphics process.
-
struct dcgmDiagTestResult_v1¶
Public Members
-
dcgmDiagResult_t status¶
The result of the test.
-
char warning[1024]¶
Warning returned from the test, if any.
-
char info[1024]¶
Information details returned from the test, if any.
-
dcgmDiagResult_t status¶
-
struct dcgmDiagTestResult_v2¶
Public Members
-
dcgmDiagResult_t status¶
The result of the test.
-
dcgmDiagErrorDetail_t error¶
The error message and error code, if any.
-
char info[1024]¶
Information details returned from the test, if any.
-
dcgmDiagResult_t status¶
-
struct dcgmDiagResponsePerGpu_v2¶
- #include <dcgm_structs.h>
Per GPU diagnostics result structure.
Public Members
-
unsigned int gpuId¶
ID for the GPU this information pertains.
-
unsigned int hwDiagnosticReturn¶
Per GPU hardware diagnostic test return code.
-
dcgmDiagTestResult_v2 results[7]¶
Array with a result for each per-gpu test.
-
unsigned int gpuId¶
-
struct dcgmDiagResponsePerGpu_v3¶
- #include <dcgm_structs.h>
Per gpu response structure v3.
Since DCGM 2.4
Public Members
-
unsigned int gpuId¶
ID for the GPU this information pertains.
-
unsigned int hwDiagnosticReturn¶
Per GPU hardware diagnostic test return code.
-
dcgmDiagTestResult_v2 results[9]¶
Array with a result for each per-gpu test.
-
unsigned int gpuId¶
-
struct dcgmDiagResponse_v6¶
- #include <dcgm_structs.h>
Global diagnostics result structure v6.
Since DCGM 2.0
Public Members
-
unsigned int version¶
version number (dcgmDiagResult_version)
-
unsigned int gpuCount¶
number of valid per GPU results
-
unsigned int levelOneTestCount¶
number of valid levelOne results
-
dcgmDiagTestResult_v2 levelOneResults[16]¶
Basic, system-wide test results.
-
dcgmDiagResponsePerGpu_v2 perGpuResponses[32]¶
per GPU test results
-
dcgmDiagErrorDetail_t systemError¶
System-wide error reported from NVVS.
-
char trainingMsg[1024]¶
Training Message.
-
unsigned int version¶
-
struct dcgmDiagResponse_v7¶
- #include <dcgm_structs.h>
Global diagnostics result structure v7.
Since DCGM 2.4
Public Members
-
unsigned int version¶
version number (dcgmDiagResult_version)
-
unsigned int gpuCount¶
number of valid per GPU results
-
unsigned int levelOneTestCount¶
number of valid levelOne results
-
dcgmDiagTestResult_v2 levelOneResults[16]¶
Basic, system-wide test results.
-
dcgmDiagResponsePerGpu_v3 perGpuResponses[32]¶
per GPU test results
-
dcgmDiagErrorDetail_t systemError¶
System-wide error reported from NVVS.
-
char trainingMsg[1024]¶
Training Message.
-
unsigned int version¶
-
struct dcgmDeviceTopology_v1¶
- #include <dcgm_structs.h>
Device topology information.
Public Members
-
unsigned int version¶
version number (dcgmDeviceTopology_version)
-
unsigned long cpuAffinityMask[8]¶
affinity mask for the specified GPU
a 1 represents affinity to the CPU in that bit position supports up to 256 cores
-
unsigned int numGpus¶
number of valid entries in gpuPaths
-
unsigned int gpuId¶
gpuId to which the path represents
-
dcgmGpuTopologyLevel_t path¶
path to the gpuId from this GPU.
Note that this is a bit-mask of DCGM_TOPOLOGY_* values and can contain both PCIe topology and NvLink topology where applicable. For instance: 0x210 = DCGM_TOPOLOGY_CPU | DCGM_TOPOLOGY_NVLINK2 Use the macros DCGM_TOPOLOGY_PATH_NVLINK and DCGM_TOPOLOGY_PATH_PCI to mask the NvLink and PCI paths, respectively.
-
unsigned int localNvLinkIds¶
bits representing the local links connected to gpuId e.g.
if this field == 3, links 0 and 1 are connected, field is only valid if NVLINKS actually exist between GPUs
-
unsigned int version¶
-
struct dcgmGroupTopology_v1¶
- #include <dcgm_structs.h>
Group topology information.
Public Members
-
unsigned int version¶
version number (dcgmGroupTopology_version)
-
unsigned long groupCpuAffinityMask[8]¶
the CPU affinity mask for all GPUs in the group
a 1 represents affinity to the CPU in that bit position supports up to 256 cores
-
unsigned int numaOptimalFlag¶
a zero value indicates that 1 or more GPUs in the group have a different CPU affinity and thus may not be optimal for certain algorithms
-
dcgmGpuTopologyLevel_t slowestPath¶
the slowest path amongst GPUs in the group
-
unsigned int version¶
-
struct dcgmIntrospectContext_v1¶
- #include <dcgm_structs.h>
Identifies the retrieval context for introspection API calls.
Public Members
-
unsigned int version¶
version number (dcgmIntrospectContext_version)
-
dcgmIntrospectLevel_t introspectLvl¶
Introspect Level dcgmIntrospectLevel_t.
-
dcgmGpuGrp_t fieldGroupId¶
Only needed if introspectLvl is DCGM_INTROSPECT_LVL_FIELD_GROUP.
-
unsigned short fieldId¶
Only needed if introspectLvl is DCGM_INTROSPECT_LVL_FIELD.
-
unsigned long long contextId¶
Overloaded way to access both fieldGroupId and fieldId.
-
unsigned int version¶
-
struct dcgmIntrospectFieldsExecTime_v1¶
- #include <dcgm_structs.h>
DCGM Execution time info for a set of fields.
Public Members
-
unsigned int version¶
version number (dcgmIntrospectFieldsExecTime_version)
-
long long meanUpdateFreqUsec¶
the mean update frequency of all fields
-
double recentUpdateUsec¶
the sum of every field’s most recent execution time after they have been normalized to meanUpdateFreqUsec”.
This is roughly how long it takes to update fields every meanUpdateFreqUsec
-
long long totalEverUpdateUsec¶
The total amount of time, ever, that has been spent updating all the fields.
-
unsigned int version¶
-
struct dcgmIntrospectFullFieldsExecTime_v2¶
- #include <dcgm_structs.h>
Full introspection info for field execution time.
Since DCGM 2.0
Public Members
-
unsigned int version¶
version number (dcgmIntrospectFullFieldsExecTime_version)
-
dcgmIntrospectFieldsExecTime_v1 aggregateInfo¶
info that includes global and device scope
-
int hasGlobalInfo¶
0 means globalInfo is populated, !0 means it’s not
-
dcgmIntrospectFieldsExecTime_v1 globalInfo¶
info that only includes global field scope
-
unsigned int gpuIdsForGpuInfo[32]¶
the GPU ID at a given index identifies which gpu
the corresponding entry in gpuInfo is from
-
dcgmIntrospectFieldsExecTime_v1 gpuInfo[32]¶
info that is separated by the
GPU ID that the watches were for
-
unsigned int version¶
-
struct dcgmIntrospectMemory_v1¶
- #include <dcgm_structs.h>
DCGM Memory usage information.
-
struct dcgmIntrospectFullMemory_v1¶
- #include <dcgm_structs.h>
Full introspection info for field memory.
Public Members
-
unsigned int version¶
version number (dcgmIntrospectFullMemory_version)
-
dcgmIntrospectMemory_v1 aggregateInfo¶
info that includes global and device scope
-
int hasGlobalInfo¶
0 means globalInfo is populated, !0 means it’s not
-
dcgmIntrospectMemory_v1 globalInfo¶
info that only includes global field scope
-
unsigned int gpuIdsForGpuInfo[32]¶
the GPU ID at a given index identifies which gpu
the corresponding entry in gpuInfo is from
-
dcgmIntrospectMemory_v1 gpuInfo[32]¶
info that is divided by the
GPU ID that the watches were for
-
unsigned int version¶
-
struct dcgmIntrospectCpuUtil_v1¶
- #include <dcgm_structs.h>
DCGM CPU Utilization information.
Multiply values by 100 to get them in %.
Public Members
-
unsigned int version¶
version number (dcgmMetadataCpuUtil_version)
-
double total¶
fraction of device’s CPU resources that were used
-
double kernel¶
fraction of device’s CPU resources that were used in kernel mode
-
double user¶
fraction of device’s CPU resources that were used in user mode
-
unsigned int version¶
-
struct dcgmRunDiag_v7¶
Public Members
-
unsigned int version¶
version of this message
-
unsigned int flags¶
flags specifying binary options for running it. See DCGM_RUN_FLAGS_*
-
unsigned int debugLevel¶
0-5 for the debug level the GPU diagnostic will use for logging.
-
dcgmGpuGrp_t groupId¶
group of GPUs to verify. Cannot be specified together with gpuList.
-
dcgmPolicyValidation_t validate¶
0-3 for which tests to run. Optional.
-
char testNames[20][50]¶
Specified list of test names. Optional.
-
char testParms[100][100]¶
Parameters to set for specified tests.
in the format: testName.parameterName=parameterValue. Optional.
-
char fakeGpuList[50]¶
Comma-separated list of GPUs. Cannot be specified with the groupId.
-
char gpuList[50]¶
Comma-separated list of GPUs. Cannot be specified with the groupId.
-
char debugLogFile[128]¶
Alternate name for the debug log file that should be used.
-
char statsPath[128]¶
Path that the plugin’s statistics files should be written to.
-
char configFileContents[10000]¶
Contents of nvvs config file (likely yaml)
-
char throttleMask[50]¶
Throttle reasons to ignore as either integer mask or csv list of.
reasons
-
char pluginPath[128]¶
Custom path to the diagnostic plugins - No longer supported as of 2.2.9.
-
unsigned int trainingIterations¶
Number of iterations for training.
-
unsigned int trainingVariance¶
Acceptable training variance as a percentage of the value. (0-100)
-
unsigned int trainingTolerance¶
Acceptable training tolerance as a percentage of the value. (0-100)
-
char goldenValuesFile[128]¶
The path where the golden values should be recorded.
-
unsigned int failCheckInterval¶
How often the fail early checks should occur when enabled.
-
unsigned int version¶
-
struct dcgmTopoSchedHint_v1¶
-
struct dcgmNvLinkGpuLinkStatus_v1¶
- #include <dcgm_structs.h>
State of NvLink links for a GPU.
Public Members
-
dcgm_field_eid_t entityId¶
Entity ID of the GPU (gpuId)
-
dcgmNvLinkLinkState_t linkState[6]¶
Per-GPU link states.
-
dcgm_field_eid_t entityId¶
-
struct dcgmNvLinkGpuLinkStatus_v2¶
Public Members
-
dcgm_field_eid_t entityId¶
Entity ID of the GPU (gpuId)
-
dcgmNvLinkLinkState_t linkState[12]¶
Per-GPU link states.
-
dcgm_field_eid_t entityId¶
-
struct dcgmNvLinkNvSwitchLinkStatus_t¶
- #include <dcgm_structs.h>
State of NvLink links for a NvSwitch.
Public Members
-
dcgm_field_eid_t entityId¶
Entity ID of the NvSwitch (physicalId)
-
dcgmNvLinkLinkState_t linkState[36]¶
Per-NvSwitch link states.
-
dcgm_field_eid_t entityId¶
-
struct dcgmNvLinkStatus_v1¶
- #include <dcgm_structs.h>
Status of all of the NvLinks in a given system.
Public Members
-
unsigned int version¶
Version of this request. Should be dcgmNvLinkStatus_version1.
-
unsigned int numGpus¶
Number of entries in gpus[] that are populated.
-
dcgmNvLinkGpuLinkStatus_v1 gpus[32]¶
Per-GPU NvLink link statuses.
-
unsigned int numNvSwitches¶
Number of entries in nvSwitches[] that are populated.
-
dcgmNvLinkNvSwitchLinkStatus_t nvSwitches[12]¶
Per-NvSwitch link statuses.
-
unsigned int version¶
-
struct dcgmNvLinkStatus_v2¶
Public Members
-
unsigned int version¶
Version of this request. Should be dcgmNvLinkStatus_version1.
-
unsigned int numGpus¶
Number of entries in gpus[] that are populated.
-
dcgmNvLinkGpuLinkStatus_v2 gpus[32]¶
Per-GPU NvLink link statuses.
-
unsigned int numNvSwitches¶
Number of entries in nvSwitches[] that are populated.
-
dcgmNvLinkNvSwitchLinkStatus_t nvSwitches[12]¶
Per-NvSwitch link statuses.
-
unsigned int version¶
-
struct dcgmSummaryResponse_t¶
Public Members
-
unsigned int fieldType¶
type of field that is summarized (int64 or fp64)
-
union dcgmSummaryResponse_t::[anonymous] values[7]¶
array for storing the values of each summary.
The summaries are stored in order. For example, if MIN AND MAX are requested, then 0 will be MIN and 1 will be MAX. If AVG and DIFF were requested, then AVG would be 0 and 1 would be DIFF
-
unsigned int fieldType¶
-
struct dcgmFieldSummaryRequest_v1¶
Public Members
-
unsigned int version¶
version of this message - dcgmFieldSummaryRequest_v1
-
unsigned short fieldId¶
field id to be summarized
-
dcgm_field_entity_group_t entityGroupId¶
the type of entity whose field we’re getting
-
dcgm_field_eid_t entityId¶
ordinal id for this entity
-
uint32_t summaryTypeMask¶
bit-mask of DCGM_SUMMARY_*, the requested summaries
-
uint64_t startTime¶
start time for the interval being summarized.
0 means to use any data before.
-
uint64_t endTime¶
end time for the interval being summarized.
0 means to use any data after.
-
dcgmSummaryResponse_t response¶
response data for this request
-
unsigned int version¶
-
struct dcgmModuleGetStatusesModule_t¶
- #include <dcgm_structs.h>
Status of all of the modules of the host engine.
Public Members
-
dcgmModuleId_t id¶
ID of this module.
-
dcgmModuleStatus_t status¶
Status of this module.
-
dcgmModuleId_t id¶
-
struct dcgmModuleGetStatuses_v1¶
Public Members
-
unsigned int version¶
Version of this request. Should be dcgmModuleGetStatuses_version1.
-
unsigned int numStatuses¶
Number of entries in statuses[] that are populated.
-
dcgmModuleGetStatusesModule_t statuses[16]¶
Per-module status information.
-
unsigned int version¶
-
struct dcgmStartEmbeddedV2Params_v1¶
- #include <dcgm_structs.h>
Options for dcgmStartEmbedded_v2.
Added in DCGM 2.0.0
Public Members
-
unsigned int version¶
Version number. Use dcgmStartEmbeddedV2Params_version1
-
dcgmOperationMode_t opMode¶
IN: Collect data automatically or manually when asked by the user.
-
dcgmHandle_t dcgmHandle¶
OUT: DCGM Handle to use for API calls
-
const char *logFile¶
IN: File that DCGM should log to. NULL = do not log. ‘-’ = stdout
-
DcgmLoggingSeverity_t severity¶
IN: Severity at which DCGM should log to logFile
-
unsigned int blackListCount¶
IN: Number of modules that to be blacklisted in blackList[]
-
unsigned int unused¶
IN: Unused. Set to 0. Aligns structure to 8-bytes
-
unsigned int version¶
-
struct dcgmStartEmbeddedV2Params_v2¶
- #include <dcgm_structs.h>
Options for dcgmStartEmbedded_v2.
Added in DCGM 2.4.0
Public Members
-
unsigned int version¶
Version number. Use dcgmStartEmbeddedV2Params_version2
-
dcgmOperationMode_t opMode¶
IN: Collect data automatically or manually when asked by the user.
-
dcgmHandle_t dcgmHandle¶
OUT: DCGM Handle to use for API calls
-
const char *logFile¶
IN: File that DCGM should log to. NULL = do not log. ‘-’ = stdout
-
DcgmLoggingSeverity_t severity¶
IN: Severity at which DCGM should log to logFile
-
unsigned int blackListCount¶
IN: Number of modules that to be blacklisted in blackList[]
-
const char *serviceAccount¶
IN: Service account for unprivileged processes
-
dcgmModuleId_t blackList[DcgmModuleIdCount]¶
IN: IDs of modules to blacklist
-
char _padding[4]¶
IN: Unused. Aligns the struct to 8 bytes.
-
unsigned int version¶
-
struct dcgmProfMetricGroupInfo_t¶
- #include <dcgm_structs.h>
Structure to return all of the profiling metric groups that are available for the given groupId.
Public Members
-
unsigned short majorId¶
Major ID of this metric group.
Metric groups with the same majorId cannot be watched concurrently with other metric groups with the same majorId
-
unsigned short minorId¶
Minor ID of this metric group.
This distinguishes metric groups within the same major metric group from each other
-
unsigned int numFieldIds¶
Number of field IDs that are populated in fieldIds[].
-
unsigned short fieldIds[8]¶
DCGM Field IDs that are part of this profiling.
group. See DCGM_FI_PROF_* definitions in dcgm_fields.h for details.
-
unsigned short majorId¶
-
struct dcgmProfGetMetricGroups_v2¶
Input parameters
-
unsigned int version¶
Version of this request. Should be dcgmProfGetMetricGroups_version.
-
unsigned int unused¶
Not used for now. Set to 0.
-
dcgmGpuGrp_t groupId¶
Group of GPUs we should get the metric groups for.
These must all be the exact same GPU or DCGM_ST_GROUP_INCOMPATIBLE will be returned
Output
-
unsigned int numMetricGroups¶
Number of entries in metricGroups[] that are populated.
-
unsigned int unused1¶
Not used for now. Set to 0.
-
dcgmProfMetricGroupInfo_t metricGroups[10]¶
Info for each metric group.
-
unsigned int version¶
-
struct dcgmProfWatchFields_v1¶
- #include <dcgm_structs.h>
Structure to pass to dcgmProfWatchFields() when watching profiling metrics.
Public Members
-
unsigned int version¶
Version of this request. Should be dcgmProfWatchFields_version.
-
dcgmGpuGrp_t groupId¶
Group ID representing collection of one or more GPUs.
Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs. The GPUs of the group must all be identical or DCGM_ST_GROUP_INCOMPATIBLE will be returned by this API.
-
unsigned int numFieldIds¶
Number of field IDs that are being passed in fieldIds[].
-
unsigned short fieldIds[16]¶
DCGM_FI_PROF_? field IDs to watch.
-
long long updateFreq¶
How often to update this field in usec.
Note that profiling metrics may need to be sampled more frequently than this value. See dcgmProfMetricGroupInfo_t.minUpdateFreqUsec of the metric group matching metricGroupTag to see what this minimum is. If minUpdateFreqUsec < updateFreq then samples will be aggregated to updateFreq intervals in DCGM’s internal cache.
-
double maxKeepAge¶
How long to keep data for every fieldId in seconds.
-
int maxKeepSamples¶
Maximum number of samples to keep for each fieldId. 0=no limit.
-
unsigned int flags¶
For future use. Set to 0 for now.
-
unsigned int version¶
-
struct dcgmProfUnwatchFields_v1¶
- #include <dcgm_structs.h>
Structure to pass to dcgmProfUnwatchFields when unwatching profiling metrics.
Public Members
-
unsigned int version¶
Version of this request. Should be dcgmProfUnwatchFields_version.
-
dcgmGpuGrp_t groupId¶
Group ID representing collection of one or more GPUs.
Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs. The GPUs of the group must all be identical or DCGM_ST_GROUP_INCOMPATIBLE will be returned by this API.
-
unsigned int flags¶
For future use. Set to 0 for now.
-
unsigned int version¶
-
struct dcgmSettingsSetLoggingSeverity_v1¶
- #include <dcgm_structs.h>
Version 1 of dcgmSettingsSetLoggingSeverity_t.
-
struct dcgmVersionInfo_v2¶
- #include <dcgm_structs.h>
Structure to describe the DCGM build environment ver 2.0.
Public Members
-
char rawBuildInfoString[256 * 2]¶
Raw form of the DCGM build info.
There may be multiple kv-pairs separated by semicolon (;).
Every pair is separated by a colon char (:). Only the very first colon is considered as a separation.
Values can contain colon chars. Values and Keys cannot contain semicolon chars.
Usually defined keys are:
version : DCGM Version.arch : Target DCGM Architecture.buildid : Build ID. Usually a sequential number.commit : Commit ID (Usually a git commit hash).author : Author of the commit above.branch : Branch (Usually a git branch that was used for the build).buildtype : Build Type.builddate : Date of the build.buildplatform : Platform where the build was made.
Any or all keys may be absent.
This values are for reference only are not supposed to participate in some complicated logic.
-
char rawBuildInfoString[256 * 2]¶
-
DCGM_RUN_FLAGS_VERBOSE¶