Profiling¶
- group DCGMAPI_PROFILING
This chapter describes the methods that watch profiling fields from within DCGM.
Functions
-
dcgmReturn_t dcgmProfGetSupportedMetricGroups(dcgmHandle_t pDcgmHandle, dcgmProfGetMetricGroups_t *metricGroups)¶
Get all of the profiling metric groups for a given GPU group.
Profiling metrics are watched in groups of fields that are all watched together. For instance, if you want to watch DCGM_FI_PROF_GR_ENGINE_ACTIVITY, this might also be in the same group as DCGM_FI_PROF_SM_EFFICIENCY. Watching this group would result in DCGM storing values for both of these metrics.
Some groups cannot be watched concurrently as others as they utilize the same hardware resource. For instance, you may not be able to watch DCGM_FI_PROF_TENSOR_OP_UTIL at the same time as DCGM_FI_PROF_GR_ENGINE_ACTIVITY on your hardware. At the same time, you may be able to watch DCGM_FI_PROF_TENSOR_OP_UTIL at the same time as DCGM_FI_PROF_NVLINK_TX_DATA.
Metrics that can be watched concurrently will have different .majorId fields in their dcgmProfMetricGroupInfo_t
See dcgmGroupCreate for details on creating a GPU group See dcgmProfWatchFields to actually watch a metric group
- Parameters
pDcgmHandle – IN: DCGM Handle
metricGroups –
IN/OUT: Metric groups supported for metricGroups->groupId.
metricGroups->version should be set to dcgmProfGetMetricGroups_version upon calling.
- Returns
DCGM_ST_OK if the request succeeds.
DCGM_ST_BADPARAM if a parameter is missing or bad.
DCGM_ST_GROUP_INCOMPATIBLE if metricGroups->groupId’s GPUs are not identical GPUs.
DCGM_ST_NOT_SUPPORTED if profiling metrics are not supported for the given GPU group.
-
dcgmReturn_t dcgmProfWatchFields(dcgmHandle_t pDcgmHandle, dcgmProfWatchFields_t *watchFields)¶
Request that DCGM start recording updates for a given list of profiling field IDs.
Once metrics have been watched by this API, any of the normal DCGM field-value retrieval APIs can be used on the underlying fieldIds of this metric group. See dcgmGetLatestValues_v2, dcgmGetLatestValuesForFields, dcgmEntityGetLatestValues, and dcgmEntitiesGetLatestValues.
- Parameters
pDcgmHandle – IN: DCGM Handle
watchFields – IN: Details of which metric groups to watch for which GPUs. See dcgmProfWatchFields_v1 for details of what should be put in each struct member. watchFields->version should be set to dcgmProfWatchFields_version upon calling.
- Returns
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if a parameter is invalid
DCGM_ST_NOT_SUPPORTED if profiling metric group metricGroupTag is not supported for the given GPU group.
DCGM_ST_GROUP_INCOMPATIBLE if groupId’s GPUs are not identical GPUs. Profiling metrics are only support for homogenous groups of GPUs.
DCGM_ST_PROFILING_MULTI_PASS if any of the metric groups could not be watched concurrently due to requiring the hardware to gather them with multiple passes
-
dcgmReturn_t dcgmProfUnwatchFields(dcgmHandle_t pDcgmHandle, dcgmProfUnwatchFields_t *unwatchFields)¶
Request that DCGM stop recording updates for all profiling field IDs for all GPUs.
- Parameters
pDcgmHandle – IN: DCGM Handle
unwatchFields – IN: Details of which metric groups to unwatch for which GPUs. See dcgmProfUnwatchFields_v1 for details of what should be put in each struct member. unwatchFields->version should be set to dcgmProfUnwatchFields_version upon calling.
- Returns
DCGM_ST_OK if the call was successful
DCGM_ST_BADPARAM if a parameter is invalid
-
dcgmReturn_t dcgmProfPause(dcgmHandle_t pDcgmHandle)¶
Pause profiling activities in DCGM.
This should be used when you are monitoring profiling fields from DCGM but want to be able to still run developer tools like nvprof, nsight systems, and nsight compute. Profiling fields start with DCGM_PROF_ and are in the field ID range 1001-1012.
Call this API before you launch one of those tools and dcgmProfResume() after the tool has completed.
DCGM will save BLANK values while profiling is paused.
Calling this while profiling activities are already paused is fine and will be treated as a no-op.
- Parameters
pDcgmHandle – IN: DCGM Handle
- Returns
DCGM_ST_OK If the call was successful.
DCGM_ST_BADPARAM if a parameter is invalid.
-
dcgmReturn_t dcgmProfResume(dcgmHandle_t pDcgmHandle)¶
Resume profiling activities in DCGM that were previously paused with dcgmProfPause().
Call this API after you have completed running other NVIDIA developer tools to reenable DCGM profiling metrics.
DCGM will save BLANK values while profiling is paused.
Calling this while profiling activities have already been resumed is fine and will be treated as a no-op.
- Parameters
pDcgmHandle – IN: DCGM Handle
- Returns
DCGM_ST_OK If the call was successful.
DCGM_ST_BADPARAM if a parameter is invalid.
-
dcgmReturn_t dcgmProfGetSupportedMetricGroups(dcgmHandle_t pDcgmHandle, dcgmProfGetMetricGroups_t *metricGroups)¶