Job Statistics

group DCGMAPI_JOB_STATS

The client can invoke DCGM APIs to start and stop collecting the stats at the process boundaries (during prologue and epilogue).

This will enable DCGM to monitor all the PIDs while the job is in progress, and provide a summary of active processes and resource usage during the window of interest.

Functions

dcgmReturn_t dcgmWatchJobFields(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, long long updateFreq, double maxKeepAge, int maxKeepSamples)

Request that DCGM start recording stats for fields that are queried with dcgmJobGetStats()

Note that the first update of the field will not occur until the next field update cycle. To force a field update cycle, call dcgmUpdateAllFields(1).

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • updateFreq – IN: How often to update this field in usec

  • maxKeepAge – IN: How long to keep data for this field in seconds

  • maxKeepSamples – IN: Maximum number of samples to keep. 0=no limit

Returns:

  • DCGM_ST_OK if the call was successful

  • DCGM_ST_BADPARAM if a parameter is invalid

  • DCGM_ST_REQUIRES_ROOT if the host engine is being run as non-root, and accounting mode could not be enabled (requires root). Run “nvidia-smi -am 1” as root on the node before starting DCGM to fix this.

dcgmReturn_t dcgmJobStartStats(dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, char jobId[64])

This API is used by the client to notify DCGM about the job to be started.

Should be invoked as part of job prologue

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • groupId – IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.

  • jobId – IN: User provided string to represent the job

Returns:

dcgmReturn_t dcgmJobStopStats(dcgmHandle_t pDcgmHandle, char jobId[64])

This API is used by the clients to notify DCGM to stop collecting stats for the job represented by job id.

Should be invoked as part of job epilogue. The job Id remains available to view the stats at any point but cannot be used to start a new job. You must call dcgmWatchJobFields() before this call to enable watching of job

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • jobId – IN: User provided string to represent the job

Returns:

dcgmReturn_t dcgmJobGetStats(dcgmHandle_t pDcgmHandle, char jobId[64], dcgmJobInfo_t *pJobInfo)

Get stats for the job identified by DCGM generated job id.

The stats can be retrieved at any point when the job is in process. If you want to reuse this jobId, call dcgmJobRemove after this call.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • jobId – IN: User provided string to represent the job

  • pJobInfo

    IN/OUT: Structure to return information about the job.

    .version should be set to

    dcgmJobInfo_version before this call.

Returns:

dcgmReturn_t dcgmJobRemove(dcgmHandle_t pDcgmHandle, char jobId[64])

This API tells DCGM to stop tracking the job given by jobId.

After this call, you will no longer be able to call dcgmJobGetStats() on this jobId. However, you will be able to reuse jobId after this call.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • jobId – IN: User provided string to represent the job

Returns:

dcgmReturn_t dcgmJobRemoveAll(dcgmHandle_t pDcgmHandle)

This API tells DCGM to stop tracking all jobs.

After this call, you will no longer be able to call dcgmJobGetStats() any jobs until you call dcgmJobStartStats again. You will be able to reuse any previously-used jobIds after this call.

Parameters:

pDcgmHandle – IN: DCGM Handle

Returns: