Profiling

group DCGMAPI_PROFILING

This chapter describes the methods that watch profiling fields from within DCGM.

Functions

dcgmReturn_t dcgmProfGetSupportedMetricGroups(dcgmHandle_t pDcgmHandle, dcgmProfGetMetricGroups_t *metricGroups)

Get all of the profiling metric groups for a given GPU group.

Profiling metrics are watched in groups of fields that are all watched together. For instance, if you want to watch DCGM_FI_PROF_GR_ENGINE_ACTIVITY, this might also be in the same group as DCGM_FI_PROF_SM_EFFICIENCY. Watching this group would result in DCGM storing values for both of these metrics.

Some groups cannot be watched concurrently as others as they utilize the same hardware resource. For instance, you may not be able to watch DCGM_FI_PROF_TENSOR_OP_UTIL at the same time as DCGM_FI_PROF_GR_ENGINE_ACTIVITY on your hardware. At the same time, you may be able to watch DCGM_FI_PROF_TENSOR_OP_UTIL at the same time as DCGM_FI_PROF_NVLINK_TX_DATA.

Metrics that can be watched concurrently will have different .majorId fields in their dcgmProfMetricGroupInfo_t

See dcgmGroupCreate for details on creating a GPU group See dcgmProfWatchFields to actually watch a metric group

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • metricGroups

    IN/OUT: Metric groups supported for metricGroups->groupId.

    metricGroups->version should be set to dcgmProfGetMetricGroups_version upon calling.

Returns

dcgmReturn_t dcgmProfWatchFields(dcgmHandle_t pDcgmHandle, dcgmProfWatchFields_t *watchFields)

Request that DCGM start recording updates for a given list of profiling field IDs.

Once metrics have been watched by this API, any of the normal DCGM field-value retrieval APIs can be used on the underlying fieldIds of this metric group. See dcgmGetLatestValues_v2, dcgmGetLatestValuesForFields, dcgmEntityGetLatestValues, and dcgmEntitiesGetLatestValues.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • watchFields – IN: Details of which metric groups to watch for which GPUs. See dcgmProfWatchFields_v1 for details of what should be put in each struct member. watchFields->version should be set to dcgmProfWatchFields_version upon calling.

Returns

dcgmReturn_t dcgmProfUnwatchFields(dcgmHandle_t pDcgmHandle, dcgmProfUnwatchFields_t *unwatchFields)

Request that DCGM stop recording updates for all profiling field IDs for all GPUs.

Parameters
  • pDcgmHandle – IN: DCGM Handle

  • unwatchFields – IN: Details of which metric groups to unwatch for which GPUs. See dcgmProfUnwatchFields_v1 for details of what should be put in each struct member. unwatchFields->version should be set to dcgmProfUnwatchFields_version upon calling.

Returns

dcgmReturn_t dcgmProfPause(dcgmHandle_t pDcgmHandle)

Pause profiling activities in DCGM.

This should be used when you are monitoring profiling fields from DCGM but want to be able to still run developer tools like nvprof, nsight systems, and nsight compute. Profiling fields start with DCGM_PROF_ and are in the field ID range 1001-1012.

Call this API before you launch one of those tools and dcgmProfResume() after the tool has completed.

DCGM will save BLANK values while profiling is paused.

Calling this while profiling activities are already paused is fine and will be treated as a no-op.

Parameters

pDcgmHandle – IN: DCGM Handle

Returns

dcgmReturn_t dcgmProfResume(dcgmHandle_t pDcgmHandle)

Resume profiling activities in DCGM that were previously paused with dcgmProfPause().

Call this API after you have completed running other NVIDIA developer tools to reenable DCGM profiling metrics.

DCGM will save BLANK values while profiling is paused.

Calling this while profiling activities have already been resumed is fine and will be treated as a no-op.

Parameters

pDcgmHandle – IN: DCGM Handle

Returns