1.6. Job Statistics

The client can invoke DCGM APIs to start and stop collecting the stats at the process boundaries (during prologue and epilogue). This will enable DCGM to monitor all the PIDs while the job is in progress, and provide a summary of active processes and resource usage during the window of interest.

Functions

dcgmReturn_t dcgmJobGetStats ( dcgmHandle_t pDcgmHandle, char  jobId[64], dcgmJobInfo_t* pJobInfo )
dcgmReturn_t dcgmJobRemove ( dcgmHandle_t pDcgmHandle, char  jobId[64] )
dcgmReturn_t dcgmJobRemoveAll ( dcgmHandle_t pDcgmHandle )
dcgmReturn_t dcgmJobStartStats ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, char  jobId[64] )
dcgmReturn_t dcgmJobStopStats ( dcgmHandle_t pDcgmHandle, char  jobId[64] )
dcgmReturn_t dcgmWatchJobFields ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, long long updateFreq, double  maxKeepAge, int  maxKeepSamples )

Functions

dcgmReturn_t dcgmJobGetStats ( dcgmHandle_t pDcgmHandle, char  jobId[64], dcgmJobInfo_t* pJobInfo )
Parameters
pDcgmHandle
IN: DCGM Handle
jobId
IN: User provided string to represent the job
pJobInfo
IN/OUT: Structure to return information about the job. .version should be set to dcgmJobInfo_version before this call.
Returns

Description

Get stats for the job identified by DCGM generated job id. The stats can be retrieved at any point when the job is in process. If you want to reuse this jobId, call dcgmJobRemove after this call.

dcgmReturn_t dcgmJobRemove ( dcgmHandle_t pDcgmHandle, char  jobId[64] )
Parameters
pDcgmHandle
IN: DCGM Handle
jobId
IN: User provided string to represent the job
Returns

Description

This API tells DCGM to stop tracking the job given by jobId. After this call, you will no longer be able to call dcgmJobGetStats() on this jobId. However, you will be able to reuse jobId after this call.

dcgmReturn_t dcgmJobRemoveAll ( dcgmHandle_t pDcgmHandle )
Parameters
pDcgmHandle
IN: DCGM Handle
Returns

Description

This API tells DCGM to stop tracking all jobs. After this call, you will no longer be able to call dcgmJobGetStats() any jobs until you call dcgmJobStartStats again. You will be able to reuse any previously-used jobIds after this call.

dcgmReturn_t dcgmJobStartStats ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, char  jobId[64] )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
jobId
IN: User provided string to represent the job
Returns

Description

This API is used by the client to notify DCGM about the job to be started. Should be invoked as part of job prologue

dcgmReturn_t dcgmJobStopStats ( dcgmHandle_t pDcgmHandle, char  jobId[64] )
Parameters
pDcgmHandle
IN: DCGM Handle
jobId
IN: User provided string to represent the job
Returns

Description

This API is used by the clients to notify DCGM to stop collecting stats for the job represented by job id. Should be invoked as part of job epilogue. The job Id remains available to view the stats at any point but cannot be used to start a new job. You must call dcgmWatchJobFields() before this call to enable watching of job

dcgmReturn_t dcgmWatchJobFields ( dcgmHandle_t pDcgmHandle, dcgmGpuGrp_t groupId, long long updateFreq, double  maxKeepAge, int  maxKeepSamples )
Parameters
pDcgmHandle
IN: DCGM Handle
groupId
IN: Group ID representing collection of one or more GPUs. Look at dcgmGroupCreate for details on creating the group. Alternatively, pass in the group id as DCGM_GROUP_ALL_GPUS to perform operation on all the GPUs.
updateFreq
IN: How often to update this field in usec
maxKeepAge
IN: How long to keep data for this field in seconds
maxKeepSamples
IN: Maximum number of samples to keep. 0=no limit
Returns

  • DCGM_ST_OK if the call was successful
  • DCGM_ST_BADPARAM if a parameter is invalid
  • DCGM_ST_REQUIRES_ROOT if the host engine is being run as non-root, and accounting mode could not be enabled (requires root). Run "nvidia-smi -am 1" as root on the node before starting DCGM to fix this.

Description

Request that DCGM start recording stats for fields that are queried with dcgmJobGetStats()

Note that the first update of the field will not occur until the next field update cycle. To force a field update cycle, call dcgmUpdateAllFields(1).