Administrative

group DCGMAPI_Admin

This chapter describes the administration interfaces for DCGM.

It is the user’s responsibility to call dcgmInit() before calling any other methods, and dcgmShutdown() once DCGM is no longer being used. The APIs in Administrative module can be broken down into following categories:

Init and Shutdown

group DCGMAPI_Admin_InitShut

Describes APIs to Initialize and Shutdown the DCGM Engine.

Functions

dcgmReturn_t dcgmInit(void)

This method is used to initialize DCGM within this process.

This must be called before dcgmStartEmbedded() or dcgmConnect()

Returns:

dcgmReturn_t dcgmShutdown(void)

This method is used to shut down DCGM.

Any embedded host engines or remote connections will automatically be shut down as well.

Returns:

dcgmReturn_t dcgmStartEmbedded(dcgmOperationMode_t opMode, dcgmHandle_t *pDcgmHandle)

Start an embedded host engine agent within this process.

The agent is loaded as a shared library. This mode is provided to avoid any extra jitter associated with an additional autonomous agent needs to be managed. In this mode, the user has to periodically call APIs such as dcgmPolicyTrigger and dcgmUpdateAllFields which tells DCGM to wake up and perform data collection and operations needed for policy management.

Parameters:
  • opMode – IN: Collect data automatically or manually when asked by the user.

  • pDcgmHandle – OUT: DCGM Handle to use for API calls

Returns:

dcgmReturn_t dcgmStartEmbedded_v2(dcgmStartEmbeddedV2Params_v1 *params)

Start an embedded host engine agent within this process.

The agent is loaded as a shared library. This mode is provided to avoid any extra jitter associated with an additional autonomous agent needs to be managed. In this mode, the user has to periodically call APIs such as dcgmPolicyTrigger and dcgmUpdateAllFields which tells DCGM to wake up and perform data collection and operations needed for policy management.

Note

This function has a versioned argument that can be actually called with two different types. The behavior will depend on the params->version value.

Parameters:

params[inout] A pointer to either dcgmStartEmbeddedV2Params_v1 or dcgmStartEmbeddedV2Params_v2.

Returns:

DCGM_ST_OK if DCGM was started successfully within our process

Returns:

DCGM_ST_UNINITIALIZED if DCGM has not been initialized with dcgmInit yet

dcgmReturn_t dcgmStopEmbedded(dcgmHandle_t pDcgmHandle)

Stop the embedded host engine within this process that was started with dcgmStartEmbedded.

Parameters:

pDcgmHandle – IN : DCGM Handle of the embedded host engine that came from dcgmStartEmbedded

Returns:

dcgmReturn_t dcgmConnect(const char *ipAddress, dcgmHandle_t *pDcgmHandle)

This method is used to connect to a stand-alone host engine process.

Remote host engines are started by running the nv-hostengine command.

NOTE: dcgmConnect_v2 provides additional connection options.

Parameters:
  • ipAddress – IN: Valid IP address for the remote host engine to connect to. If ipAddress is specified as x.x.x.x it will attempt to connect to the default port specified by DCGM_HE_PORT_NUMBER If ipAddress is specified as x.x.x.x:yyyy it will attempt to connect to the port specified by yyyy

  • pDcgmHandle – OUT: DCGM Handle of the remote host engine

Returns:

dcgmReturn_t dcgmConnect_v2(const char *ipAddress, dcgmConnectV2Params_t *connectParams, dcgmHandle_t *pDcgmHandle)

This method is used to connect to a stand-alone host engine process.

Remote host engines are started by running the nv-hostengine command.

Parameters:
  • ipAddress – IN: Valid IP address for the remote host engine to connect to. If ipAddress is specified as x.x.x.x it will attempt to connect to the default port specified by DCGM_HE_PORT_NUMBER. If ipAddress is specified as x.x.x.x:yyyy it will attempt to connect to the port specified by yyyy

  • connectParams – IN: Additional connection parameters. See dcgmConnectV2Params_t for details.

  • pDcgmHandle – OUT: DCGM Handle of the remote host engine

Returns:

dcgmReturn_t dcgmDisconnect(dcgmHandle_t pDcgmHandle)

This method is used to disconnect from a stand-alone host engine process.

Parameters:

pDcgmHandle – IN: DCGM Handle that came from dcgmConnect

Returns:

Auxilary information about DCGM engine

group DCGMAPI_Admin_Info

Describes APIs to get generic information about the DCGM Engine.

Functions

dcgmReturn_t dcgmVersionInfo(dcgmVersionInfo_t *pVersionInfo)

This method is used to return information about the build environment where DCGM was built.

Parameters:

pVersionInfo – OUT: Build environment information

Returns:

dcgmReturn_t dcgmHostengineVersionInfo(dcgmHandle_t pDcgmHandle, dcgmVersionInfo_t *pVersionInfo)

This method is used to return information about the build environment of the hostengine.

Parameters:
  • pDcgmHandle – IN: DCGM Handle that came from dcgmConnect

  • pVersionInfo – OUT: Build environment information

Returns:

dcgmReturn_t dcgmHostengineSetLoggingSeverity(dcgmHandle_t pDcgmHandle, dcgmSettingsSetLoggingSeverity_t *logging)

This method is used to set the logging severity on HostEngine for the specified logger.

Parameters:
  • pDcgmHandle – IN: DCGM Handle

  • logging – IN: dcgmSettingsSetLoggingSeverity_t struct containing the target logger and severity

Returns:

dcgmReturn_t dcgmHostengineIsHealthy(dcgmHandle_t pDcgmHandle, dcgmHostengineHealth_t *heHealth)

This function is used to return whether or not the host engine considers itself healthy.

Parameters:
  • pDcgmHandle[in] - the handle to DCGM

  • heHealth[out] - struct describing the health of the hostengine. if heHealth.hostengineHealth is 0, then the hostengine is healthy. Non-zero indicates not healthy with error codes determining the cause.

Returns:

const char *errorString(dcgmReturn_t result)

This function describes DCGM error codes in human readable form.

Parameters:

result[in] - DCGM return code to describe

Returns:

  • Human readable string with the DCGM error code description if the code is valid.

  • nullptr if there is not such error code

dcgmReturn_t dcgmModuleIdToName(dcgmModuleId_t id, char const **name)

This function describes DCGM Module by given Module ID.

Parameters:
  • id[in] – - Module ID to name.

  • name[out] – - Module name will be provided via this argument.

Returns: