Enums and Macros

group dcgmReturnEnums

Defines

MAKE_DCGM_VERSION(typeName, ver) (unsigned int)(sizeof(typeName) | ((unsigned long)(ver) << 24U))

Creates a unique version number for each struct.

DCGM_BLANK_VALUES

Represents value of the field which can be returned by Host Engine in case the operation is not successful.

DCGM_INT32_BLANK 0x7ffffff0

Base value for 32 bits integer blank.

can be used as an unspecified blank

DCGM_INT64_BLANK 0x7ffffffffffffff0

Base value for 64 bits integer blank.

can be used as an unspecified blank

DCGM_FP64_BLANK 140737488355328.0

Base value for double blank.

2 ** 47. FP 64 has 52 bits of mantissa, so 47 bits can still increment by 1 and represent each value from 0-15

DCGM_STR_BLANK "<<<NULL>>>"

Base value for string blank.

DCGM_INT32_NOT_FOUND (DCGM_INT32_BLANK + 1)

Represents an error where INT32 data was not found.

DCGM_INT64_NOT_FOUND (DCGM_INT64_BLANK + 1)

Represents an error where INT64 data was not found.

DCGM_FP64_NOT_FOUND (DCGM_FP64_BLANK + 1.0)

Represents an error where FP64 data was not found.

DCGM_STR_NOT_FOUND "<<<NOT_FOUND>>>"

Represents an error where STR data was not found.

DCGM_INT32_NOT_SUPPORTED (DCGM_INT32_BLANK + 2)

Represents an error where fetching the INT32 value is not supported.

DCGM_INT64_NOT_SUPPORTED (DCGM_INT64_BLANK + 2)

Represents an error where fetching the INT64 value is not supported.

DCGM_FP64_NOT_SUPPORTED (DCGM_FP64_BLANK + 2.0)

Represents an error where fetching the FP64 value is not supported.

DCGM_STR_NOT_SUPPORTED "<<<NOT_SUPPORTED>>>"

Represents an error where fetching the STR value is not supported.

DCGM_INT32_NOT_PERMISSIONED (DCGM_INT32_BLANK + 3)

Represents and error where fetching the INT32 value is not allowed with our current credentials.

DCGM_INT64_NOT_PERMISSIONED (DCGM_INT64_BLANK + 3)

Represents and error where fetching the INT64 value is not allowed with our current credentials.

DCGM_FP64_NOT_PERMISSIONED (DCGM_FP64_BLANK + 3.0)

Represents and error where fetching the FP64 value is not allowed with our current credentials.

DCGM_STR_NOT_PERMISSIONED "<<<NOT_PERM>>>"

Represents and error where fetching the STR value is not allowed with our current credentials.

DCGM_INT32_IS_BLANK(val) (((val) >= DCGM_INT32_BLANK) ? 1 : 0)

Macro to check if a INT32 value is blank or not.

DCGM_INT64_IS_BLANK(val) (((val) >= DCGM_INT64_BLANK) ? 1 : 0)

Macro to check if a INT64 value is blank or not.

DCGM_FP64_IS_BLANK(val) (((val) >= DCGM_FP64_BLANK ? 1 : 0))

Macro to check if a FP64 value is blank or not.

DCGM_STR_IS_BLANK(val) (val == strstr(val, "<<<") && strstr(val, ">>>"))

Macro to check if a STR value is blank or not Works on (char *).

Looks for <<< at first position and >>> inside string

DCGM_MAX_NUM_DEVICES 32 /* DCGM 2.0 and newer = 32. DCGM 1.8 and older = 16. */

Max number of GPUs supported by DCGM.

Number of NvLink links per GPU supported by DCGM 18 for Hopper, 12 for Ampere, 6 for Volta, and 4 for Pascal.

Number of nvlink errors supported by DCGM.

NVML_NVLINK_ERROR_DL_ECC_DATA not currently supported

See also

NVML_NVLINK_ERROR_COUNT

Number of nvlink error types:

See also

NVML_NVLINK_ERROR_COUNT TODO: update with refactor of ampere-next nvlink APIs (JIRA DCGM-2628)

Maximum NvLink links pre-Ampere.

Maximum NvLink links pre-Hopper.

DCGM_MAX_NUM_SWITCHES 12

Max number of NvSwitches supported by DCGM.

DCGM_MAX_XID_INFO 10

Max number of XID info to store.

Number of NvLink links per NvSwitch supported by DCGM.

Number of Lines per NvSwitch NvLink supported by DCGM.

DCGM_MAX_VGPU_INSTANCES_PER_PGPU 32

Maximum number of vGPU instances per physical GPU.

DCGM_MAX_NUM_CPUS 8

Max number of CPU nodes.

DCGM_MAX_NUM_CPU_CORES 1024

Max number of CPUs.

DCGM_MAX_STR_LENGTH 256

Max length of the DCGM string field.

DCGM_MAX_AGE_USEC_DEFAULT 30000000

Default maximum age of samples kept (usec)

DCGM_MAX_CLOCKS 256

Max number of clocks supported for a device.

DCGM_MAX_NUM_GROUPS 64

Max limit on the number of groups supported by DCGM.

DCGM_MAX_FBC_SESSIONS 256

Max number of active FBC sessions.

DCGM_VGPU_NAME_BUFFER_SIZE 64

Represents the size of a buffer that holds a vGPU type Name or vGPU class type or name of process running on vGPU instance.

DCGM_GRID_LICENSE_BUFFER_SIZE 128

Represents the size of a buffer that holds a vGPU license string.

DCGM_CONFIG_COMPUTEMODE_DEFAULT 0

Default compute mode &#8212; multiple contexts per device.

DCGM_CONFIG_COMPUTEMODE_PROHIBITED 1

Compute-prohibited mode &#8212; no contexts per device.

DCGM_CONFIG_COMPUTEMODE_EXCLUSIVE_PROCESS 2

Compute-exclusive-process mode &#8212; only one context per device, usable from multiple threads at a time.

DCGM_HE_PORT_NUMBER 5555

Default Port Number for DCGM Host Engine.

DCGM_GROUP_ALL_GPUS 0x7fffffff

Identifies for special DCGM groups.

DCGM_GROUP_ALL_NVSWITCHES 0x7ffffffe
DCGM_GROUP_ALL_INSTANCES 0x7ffffffd
DCGM_GROUP_ALL_COMPUTE_INSTANCES 0x7ffffffc
DCGM_GROUP_ALL_ENTITIES 0x7ffffffb
DCGM_GROUP_MAX_ENTITIES 64

Maximum number of entities per entity group.

Typedefs

typedef enum dcgmOperationMode_enum dcgmOperationMode_t

Operation mode for DCGM.

DCGM can run in auto-mode where it runs additional threads in the background to collect any metrics of interest and auto manages any operations needed for policy management.

DCGM can also operate in manual-mode where it’s execution is controlled by the user. In this mode, the user has to periodically call APIs such as dcgmPolicyTrigger and dcgmUpdateAllFields which tells DCGM to wake up and perform data collection and operations needed for policy management.

typedef enum dcgmOrder_enum dcgmOrder_t

When more than one value is returned from a query, which order should it be returned in?

typedef enum dcgmReturn_enum dcgmReturn_t

Return values for DCGM API calls.

typedef enum dcgmGroupType_enum dcgmGroupType_t

Type of GPU groups.

typedef enum dcgmChipArchitecture_enum dcgmChipArchitecture_t

Simplified chip architecture.

Note that these are made to match nvmlChipArchitecture_t and thus do not start at 0.

typedef enum dcgmConfigType_enum dcgmConfigType_t

Represents the type of configuration to be fetched from the GPUs.

typedef enum dcgmConfigPowerLimitType_enum dcgmConfigPowerLimitType_t

Represents the power cap for each member of the group.

Enums

enum dcgmOperationMode_enum

Operation mode for DCGM.

DCGM can run in auto-mode where it runs additional threads in the background to collect any metrics of interest and auto manages any operations needed for policy management.

DCGM can also operate in manual-mode where it’s execution is controlled by the user. In this mode, the user has to periodically call APIs such as dcgmPolicyTrigger and dcgmUpdateAllFields which tells DCGM to wake up and perform data collection and operations needed for policy management.

Values:

enumerator DCGM_OPERATION_MODE_AUTO
enumerator DCGM_OPERATION_MODE_MANUAL
enum dcgmOrder_enum

When more than one value is returned from a query, which order should it be returned in?

Values:

enumerator DCGM_ORDER_ASCENDING

Data with earliest (lowest) timestamps returned first.

enumerator DCGM_ORDER_DESCENDING

Data with latest (highest) timestamps returned first.

enum dcgmReturn_enum

Return values for DCGM API calls.

Values:

enumerator DCGM_ST_OK

Success.

enumerator DCGM_ST_BADPARAM

A bad parameter was passed to a function.

enumerator DCGM_ST_GENERIC_ERROR

A generic, unspecified error.

enumerator DCGM_ST_MEMORY

An out of memory error occurred.

enumerator DCGM_ST_NOT_CONFIGURED

Setting not configured.

enumerator DCGM_ST_NOT_SUPPORTED

Feature not supported.

enumerator DCGM_ST_INIT_ERROR

DCGM Init error.

enumerator DCGM_ST_NVML_ERROR

When NVML returns error.

enumerator DCGM_ST_PENDING

Object is in pending state of something else.

enumerator DCGM_ST_UNINITIALIZED

Object is in undefined state.

enumerator DCGM_ST_TIMEOUT

Requested operation timed out.

enumerator DCGM_ST_VER_MISMATCH

Version mismatch between received and understood API.

enumerator DCGM_ST_UNKNOWN_FIELD

Unknown field id.

enumerator DCGM_ST_NO_DATA

No data is available.

enumerator DCGM_ST_STALE_DATA

Data is considered stale.

enumerator DCGM_ST_NOT_WATCHED

The given field id is not being updated by the cache manager.

enumerator DCGM_ST_NO_PERMISSION

Do not have permission to perform the desired action.

enumerator DCGM_ST_GPU_IS_LOST

GPU is no longer reachable.

enumerator DCGM_ST_RESET_REQUIRED

GPU requires a reset.

enumerator DCGM_ST_FUNCTION_NOT_FOUND

The function that was requested was not found (bindings only error)

enumerator DCGM_ST_CONNECTION_NOT_VALID

The connection to the host engine is not valid any longer.

enumerator DCGM_ST_GPU_NOT_SUPPORTED

This GPU is not supported by DCGM.

enumerator DCGM_ST_GROUP_INCOMPATIBLE

The GPUs of the provided group are not compatible with each other for the requested operation.

enumerator DCGM_ST_MAX_LIMIT

Max limit reached for the object.

enumerator DCGM_ST_LIBRARY_NOT_FOUND

DCGM library could not be found.

enumerator DCGM_ST_DUPLICATE_KEY

Duplicate key passed to a function.

enumerator DCGM_ST_GPU_IN_SYNC_BOOST_GROUP

GPU is already a part of a sync boost group.

enumerator DCGM_ST_GPU_NOT_IN_SYNC_BOOST_GROUP

GPU is not a part of a sync boost group.

enumerator DCGM_ST_REQUIRES_ROOT

This operation cannot be performed when the host engine is running as non-root.

enumerator DCGM_ST_NVVS_ERROR

DCGM GPU Diagnostic was successfully executed, but reported an error.

enumerator DCGM_ST_INSUFFICIENT_SIZE

An input argument is not large enough.

enumerator DCGM_ST_FIELD_UNSUPPORTED_BY_API

The given field ID is not supported by the API being called.

enumerator DCGM_ST_MODULE_NOT_LOADED

This request is serviced by a module of DCGM that is not currently loaded.

enumerator DCGM_ST_IN_USE

The requested operation could not be completed because the affected resource is in use.

enumerator DCGM_ST_GROUP_IS_EMPTY

This group is empty and the requested operation is not valid on an empty group.

enumerator DCGM_ST_PROFILING_NOT_SUPPORTED

Profiling is not supported for this group of GPUs or GPU.

enumerator DCGM_ST_PROFILING_LIBRARY_ERROR

The third-party Profiling module returned an unrecoverable error.

enumerator DCGM_ST_PROFILING_MULTI_PASS

The requested profiling metrics cannot be collected in a single pass.

enumerator DCGM_ST_DIAG_ALREADY_RUNNING

A diag instance is already running, cannot run a new diag until the current one finishes.

enumerator DCGM_ST_DIAG_BAD_JSON

The DCGM GPU Diagnostic returned JSON that cannot be parsed.

enumerator DCGM_ST_DIAG_BAD_LAUNCH

Error while launching the DCGM GPU Diagnostic.

enumerator DCGM_ST_DIAG_UNUSED

Unused.

enumerator DCGM_ST_DIAG_THRESHOLD_EXCEEDED

A field value met or exceeded the error threshold.

enumerator DCGM_ST_INSUFFICIENT_DRIVER_VERSION

The installed driver version is insufficient for this API.

enumerator DCGM_ST_INSTANCE_NOT_FOUND

The specified GPU instance does not exist.

enumerator DCGM_ST_COMPUTE_INSTANCE_NOT_FOUND

The specified GPU compute instance does not exist.

enumerator DCGM_ST_CHILD_NOT_KILLED

Couldn’t kill a child process within the retries.

enumerator DCGM_ST_3RD_PARTY_LIBRARY_ERROR

Detected an error in a 3rd-party library.

enumerator DCGM_ST_INSUFFICIENT_RESOURCES

Not enough resources available.

enumerator DCGM_ST_PLUGIN_EXCEPTION

Exception thrown from a diagnostic plugin.

enumerator DCGM_ST_NVVS_ISOLATE_ERROR

The diagnostic returned an error that indicates the need for isolation.

enumerator DCGM_ST_NVVS_BINARY_NOT_FOUND

The NVVS binary was not found in the specified location.

enumerator DCGM_ST_NVVS_KILLED

The NVVS process was killed by a signal.

enumerator DCGM_ST_PAUSED

The hostengine and all modules are paused.

enumerator DCGM_ST_ALREADY_INITIALIZED

The object is already initialized.

enum dcgmGroupType_enum

Type of GPU groups.

Values:

enumerator DCGM_GROUP_DEFAULT

All the GPUs on the node are added to the group.

enumerator DCGM_GROUP_EMPTY

Creates an empty group.

enumerator DCGM_GROUP_DEFAULT_NVSWITCHES

All NvSwitches of the node are added to the group.

enumerator DCGM_GROUP_DEFAULT_INSTANCES

All GPU instances of the node are added to the group.

enumerator DCGM_GROUP_DEFAULT_COMPUTE_INSTANCES

All compute instances of the node are added to the group.

enumerator DCGM_GROUP_DEFAULT_EVERYTHING

All entities are added to this default group.

enum dcgmChipArchitecture_enum

Simplified chip architecture.

Note that these are made to match nvmlChipArchitecture_t and thus do not start at 0.

Values:

enumerator DCGM_CHIP_ARCH_OLDER

All GPUs older than Kepler.

enumerator DCGM_CHIP_ARCH_KEPLER

All Kepler-architecture parts.

enumerator DCGM_CHIP_ARCH_MAXWELL

All Maxwell-architecture parts.

enumerator DCGM_CHIP_ARCH_PASCAL

All Pascal-architecture parts.

enumerator DCGM_CHIP_ARCH_VOLTA

All Volta-architecture parts.

enumerator DCGM_CHIP_ARCH_TURING

All Turing-architecture parts.

enumerator DCGM_CHIP_ARCH_AMPERE

All Ampere-architecture parts.

enumerator DCGM_CHIP_ARCH_ADA

All Ada-architecture parts.

enumerator DCGM_CHIP_ARCH_HOPPER

All Hopper-architecture parts.

enumerator DCGM_CHIP_ARCH_COUNT

Keep this second to last, exclude unknown.

enumerator DCGM_CHIP_ARCH_UNKNOWN

Anything else, presumably something newer.

enum dcgmConfigType_enum

Represents the type of configuration to be fetched from the GPUs.

Values:

enumerator DCGM_CONFIG_TARGET_STATE

The target configuration values to be applied.

enumerator DCGM_CONFIG_CURRENT_STATE

The current configuration state.

enum dcgmConfigPowerLimitType_enum

Represents the power cap for each member of the group.

Values:

enumerator DCGM_CONFIG_POWER_CAP_INDIVIDUAL

Represents the power cap to be applied for each member of the group.

enumerator DCGM_CONFIG_POWER_BUDGET_GROUP

Represents the power budget for the entire group.

Functions

const char *errorString(dcgmReturn_t result)