2.2.1. Discovery
[System]
The following APIs are used to discover GPUs and their attributes on a Node.
Functions
- dcgmReturn_t dcgmGetAllDevices ( dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[DCGM_MAX_NUM_DEVICES], int* count )
- dcgmReturn_t dcgmGetAllSupportedDevices ( dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[DCGM_MAX_NUM_DEVICES], int* count )
- dcgmReturn_t dcgmGetDeviceAttributes ( dcgmHandle_t pDcgmHandle, unsigned int gpuId, dcgmDeviceAttributes_t* pDcgmAttr )
- dcgmReturn_t dcgmGetEntityGroupEntities ( dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t* entities, int* numEntities, unsigned int flags )
- dcgmReturn_t dcgmGetNvLinkLinkStatus ( dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v1* linkStatus )
Functions
- dcgmReturn_t dcgmGetAllDevices ( dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[DCGM_MAX_NUM_DEVICES], int* count )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- gpuIdList
- OUT : Array reference to fill GPU Ids present on the system.
- count
- OUT : Number of GPUs returned in gpuIdList.
Returns
- DCGM_ST_OK if the call was successful.
- DCGM_ST_BADPARAM if gpuIdList or count were not valid.
Description
This method is used to get identifiers corresponding to all the devices on the system. The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function include gpuIds of GPUs that are not supported by DCGM. To only get gpuIds of GPUs that are supported by DCGM, use dcgmGetAllSupportedDevices().
- dcgmReturn_t dcgmGetAllSupportedDevices ( dcgmHandle_t pDcgmHandle, unsigned int gpuIdList[DCGM_MAX_NUM_DEVICES], int* count )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- gpuIdList
- OUT : Array reference to fill GPU Ids present on the system.
- count
- OUT : Number of GPUs returned in gpuIdList.
Returns
- DCGM_ST_OK if the call was successful.
- DCGM_ST_BADPARAM if gpuIdList or count were not valid.
Description
This method is used to get identifiers corresponding to all the DCGM-supported devices on the system. The identifier represents DCGM GPU Id corresponding to each GPU on the system and is immutable during the lifespan of the engine. The list should be queried again if the engine is restarted.
The GPUs returned from this function ONLY includes gpuIds of GPUs that are supported by DCGM. To get gpuIds of all GPUs in the system, use dcgmGetAllDevices().
- dcgmReturn_t dcgmGetDeviceAttributes ( dcgmHandle_t pDcgmHandle, unsigned int gpuId, dcgmDeviceAttributes_t* pDcgmAttr )
-
Parameters
- pDcgmHandle
- IN : DCGM Handle
- gpuId
- IN : GPU Id corresponding to which the attributes should be fetched
- pDcgmAttr
- IN/OUT : Device attributes corresponding to gpuId. pDcgmAttr->version should be set to dcgmDeviceAttributes_version before this call.
Returns
- DCGM_ST_OK if the call was successful.
- DCGM_ST_VER_MISMATCH if pDcgmAttr->version is not set or is invalid.
Description
Gets device attributes corresponding to the gpuId. If operation is not successful for any of the requested fields then the field is populated with one of DCGM_BLANK_VALUES defined in dcgm_structs.h.
- dcgmReturn_t dcgmGetEntityGroupEntities ( dcgmHandle_t dcgmHandle, dcgm_field_entity_group_t entityGroup, dcgm_field_eid_t* entities, int* numEntities, unsigned int flags )
-
Parameters
- dcgmHandle
- IN: DCGM Handle
- entityGroup
- IN: Entity group to list entities of
- entities
- OUT: Array of entities for entityGroup
- numEntities
- IN/OUT: Upon calling, this should be the number of entities that entityList[] can hold. Upon return, this will contain the number of entities actually saved to entityList.
- flags
- IN: Flags to modify the behavior of this request. See DCGM_GEGE_FLAG_* defines in dcgm_structs.h
Returns
- DCGM_ST_OK if the call was successful.
- DCGM_ST_INSUFFICIENT_SIZE if numEntities was not large enough to hold the number of entities in the entityGroup. numEntities will contain the capacity needed to complete this request successfully.
- DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
- DCGM_ST_BADPARAM if any parameter is invalid
Description
Gets the list of entities that exist for a given entity group. This API can be used in place of dcgmGetAllDevices.
- dcgmReturn_t dcgmGetNvLinkLinkStatus ( dcgmHandle_t dcgmHandle, dcgmNvLinkStatus_v1* linkStatus )
-
Parameters
- dcgmHandle
- IN: DCGM Handle
- linkStatus
- OUT: Structure in which to store NvLink link statuses. .version should be set to dcgmNvLinkStatus_version1 before calling this.
Returns
- DCGM_ST_OK if the call was successful.
- DCGM_ST_NOT_SUPPORTED if the given entityGroup does not support enumeration.
- DCGM_ST_BADPARAM if any parameter is invalid
Description
Get the NvLink link status for every NvLink in this system. This includes the NvLinks of both GPUs and NvSwitches. Note that only NvSwitches and GPUs that are visible to the current environment will be returned in this structure.