Profiling
- group Profiling
This chapter describes the methods that watch profiling fields from within DCGM.
Functions
-
dcgmReturn_t dcgmProfGetSupportedMetricGroups(dcgmHandle_t pDcgmHandle, dcgmProfGetMetricGroups_t *metricGroups)
Get all of the profiling metric groups for a given GPU group.
Profiling metrics are watched in groups of fields that are all watched together. For instance, if you want to watch DCGM_FI_PROF_GR_ENGINE_ACTIVITY, this might also be in the same group as DCGM_FI_PROF_SM_EFFICIENCY. Watching this group would result in DCGM storing values for both of these metrics.
Some groups cannot be watched concurrently as others as they utilize the same hardware resource. For instance, you may not be able to watch DCGM_FI_PROF_TENSOR_OP_UTIL at the same time as DCGM_FI_PROF_GR_ENGINE_ACTIVITY on your hardware. At the same time, you may be able to watch DCGM_FI_PROF_TENSOR_OP_UTIL at the same time as DCGM_FI_PROF_NVLINK_TX_DATA.
Metrics that can be watched concurrently will have different .majorId fields in their dcgmProfMetricGroupInfo_t
See dcgmGroupCreate for details on creating a GPU group See dcgmWatchFields to actually watch the underlying profiling fields
- Parameters:
pDcgmHandle – IN: DCGM Handle
metricGroups –
IN/OUT: Metric groups supported for metricGroups->groupId.
metricGroups->version should be set to dcgmProfGetMetricGroups_version upon calling.
- Returns:
DCGM_ST_OK if the request succeeds.
DCGM_ST_BADPARAM if a parameter is missing or bad.
DCGM_ST_GROUP_INCOMPATIBLE if metricGroups->groupId’s GPUs are not identical GPUs.
DCGM_ST_NOT_SUPPORTED if profiling metrics are not supported for the given GPU group.
-
dcgmReturn_t dcgmProfPause(dcgmHandle_t pDcgmHandle)
Pause profiling activities in DCGM.
This should be used when you are monitoring profiling fields from DCGM but want to be able to still run developer tools like nvprof, nsight systems, and nsight compute. Profiling fields start with DCGM_PROF_ and are in the field ID range 1001-1012.
Call this API before you launch one of those tools and dcgmProfResume() after the tool has completed.
DCGM will save BLANK values while profiling is paused.
Calling this while profiling activities are already paused is fine and will be treated as a no-op.
- Parameters:
pDcgmHandle – IN: DCGM Handle
- Returns:
DCGM_ST_OK If the call was successful.
DCGM_ST_BADPARAM if a parameter is invalid.
-
dcgmReturn_t dcgmProfResume(dcgmHandle_t pDcgmHandle)
Resume profiling activities in DCGM that were previously paused with dcgmProfPause().
Call this API after you have completed running other NVIDIA developer tools to reenable DCGM profiling metrics.
DCGM will save BLANK values while profiling is paused.
Calling this while profiling activities have already been resumed is fine and will be treated as a no-op.
- Parameters:
pDcgmHandle – IN: DCGM Handle
- Returns:
DCGM_ST_OK If the call was successful.
DCGM_ST_BADPARAM if a parameter is invalid.
-
dcgmReturn_t dcgmProfGetSupportedMetricGroups(dcgmHandle_t pDcgmHandle, dcgmProfGetMetricGroups_t *metricGroups)