Enums and Macros

group Enums and Macros

Defines

MAKE_DCGM_VERSION(typeName, ver) (unsigned int)(sizeof(typeName) | ((unsigned long)(ver) << 24U)): Creates a unique version number for each struct.

DCGM_BLANK_VALUES: Represents value of the field which can be returned by Host Engine in case the operation is not successful.

DCGM_INT8_BLANK 0x70

Base value for 8 bits integer blank.

can be used as an unspecified blank

DCGM_INT32_BLANK 0x7ffffff0

Base value for 32 bits integer blank.

can be used as an unspecified blank

DCGM_INT64_BLANK 0x7ffffffffffffff0

Base value for 64 bits integer blank.

can be used as an unspecified blank

DCGM_FP64_BLANK 140737488355328.0

Base value for double blank.

2 ** 47. FP 64 has 52 bits of mantissa, so 47 bits can still increment by 1 and represent each value from 0-15

DCGM_STR_BLANK "<<<NULL>>>": Base value for string blank.

DCGM_INT32_NOT_FOUND (DCGM_INT32_BLANK + 1): Represents an error where INT32 data was not found.

DCGM_INT64_NOT_FOUND (DCGM_INT64_BLANK + 1): Represents an error where INT64 data was not found.

DCGM_FP64_NOT_FOUND (DCGM_FP64_BLANK + 1.0): Represents an error where FP64 data was not found.

DCGM_STR_NOT_FOUND "<<<NOT_FOUND>>>": Represents an error where STR data was not found.

DCGM_INT32_NOT_SUPPORTED (DCGM_INT32_BLANK + 2): Represents an error where fetching the INT32 value is not supported.

DCGM_INT64_NOT_SUPPORTED (DCGM_INT64_BLANK + 2): Represents an error where fetching the INT64 value is not supported.

DCGM_FP64_NOT_SUPPORTED (DCGM_FP64_BLANK + 2.0): Represents an error where fetching the FP64 value is not supported.

DCGM_STR_NOT_SUPPORTED "<<<NOT_SUPPORTED>>>": Represents an error where fetching the STR value is not supported.

DCGM_INT32_NOT_PERMISSIONED (DCGM_INT32_BLANK + 3): Represents and error where fetching the INT32 value is not allowed with our current credentials.

DCGM_INT64_NOT_PERMISSIONED (DCGM_INT64_BLANK + 3): Represents and error where fetching the INT64 value is not allowed with our current credentials.

DCGM_FP64_NOT_PERMISSIONED (DCGM_FP64_BLANK + 3.0): Represents and error where fetching the FP64 value is not allowed with our current credentials.

DCGM_STR_NOT_PERMISSIONED "<<<NOT_PERM>>>": Represents and error where fetching the STR value is not allowed with our current credentials.

DCGM_INT8_IS_BLANK(val) (((val) >= DCGM_INT8_BLANK) ? 1 : 0): Macro to check if a INT8 value is blank or not.

DCGM_INT32_IS_BLANK(val) (((val) >= DCGM_INT32_BLANK) ? 1 : 0): Macro to check if a INT32 value is blank or not.

DCGM_INT64_IS_BLANK(val) (((val) >= DCGM_INT64_BLANK) ? 1 : 0): Macro to check if a INT64 value is blank or not.

DCGM_FP64_IS_BLANK(val) (((val) >= DCGM_FP64_BLANK ? 1 : 0)): Macro to check if a FP64 value is blank or not.

DCGM_STR_IS_BLANK(val) (val == strstr(val, "<<<") && strstr(val, ">>>"))

Macro to check if a STR value is blank or not Works on (char *).

Looks for <<< at first position and >>> inside string

DCGM_MAX_NUM_DEVICES 32 /* DCGM 2.0 and newer = 32. DCGM 1.8 and older = 16. */: Max number of GPUs supported by DCGM.

DCGM_NVLINK_MAX_LINKS_PER_GPU 18: Number of NvLink links per GPU supported by DCGM 18 for Hopper, 12 for Ampere, 6 for Volta, and 4 for Pascal.

DCGM_NVLINK_ERROR_COUNT 4

Number of nvlink errors supported by DCGM.

NVML_NVLINK_ERROR_DL_ECC_DATA not currently supported

See also

NVML_NVLINK_ERROR_COUNT TODO: update with refactor of ampere-next nvlink APIs (JIRA DCGM-2628)

DCGM_NVLINK_MAX_LINKS_PER_GPU_LEGACY1 6: Maximum NvLink links pre-Ampere.

DCGM_NVLINK_MAX_LINKS_PER_GPU_LEGACY2 12: Maximum NvLink links pre-Hopper.

DCGM_MAX_NUM_SWITCHES 12: Max number of NvSwitches supported by DCGM.

DCGM_MAX_XID_INFO 10: Max number of XID info to store.

DCGM_NVLINK_MAX_LINKS_PER_NVSWITCH 256: Number of NvLink links per NvSwitch supported by DCGM.

DCGM_LANE_MAX_LANES_PER_NVSWICH_LINK 4: Number of Lanes per NvSwitch NvLink supported by DCGM.

DCGM_MAX_VGPU_INSTANCES_PER_PGPU 32: Maximum number of vGPU instances per physical GPU.

DCGM_MAX_NUM_CPUS 8: Max number of CPU nodes.

DCGM_MAX_NUM_CPU_CORES 1024: Max number of CPUs.

DCGM_MAX_STR_LENGTH 256: Max length of the DCGM string field.

DCGM_MAX_AGE_USEC_DEFAULT 30000000: Default maximum age of samples kept (usec)

DCGM_MAX_CLOCKS 256: Max number of clocks supported for a device.

DCGM_MAX_NUM_GROUPS 64: Max limit on the number of groups supported by DCGM.

DCGM_MAX_FBC_SESSIONS 256: Max number of active FBC sessions.

DCGM_VGPU_NAME_BUFFER_SIZE 64: Represents the size of a buffer that holds a vGPU type Name or vGPU class type or name of process running on vGPU instance.

DCGM_GRID_LICENSE_BUFFER_SIZE 128: Represents the size of a buffer that holds a vGPU license string.

DCGM_CONFIG_COMPUTEMODE_DEFAULT 0: Default compute mode — multiple contexts per device.

DCGM_CONFIG_COMPUTEMODE_PROHIBITED 1: Compute-prohibited mode — no contexts per device.

DCGM_CONFIG_COMPUTEMODE_EXCLUSIVE_PROCESS 2: Compute-exclusive-process mode — only one context per device, usable from multiple threads at a time.

DCGM_HE_PORT_NUMBER 5555: Default Port Number for DCGM Host Engine.

DCGM_DEFAULT_SOCKET_PATH "/tmp/nv-hostengine": Default socket path for DCGM Host Engine.

DCGM_UNIX_SOCKET_PREFIX "unix://": Unix socket prefix for DCGM Host Engine.

DCGM_GROUP_ALL_GPUS 0x7fffffff: Identifies for special DCGM groups.

DCGM_GROUP_ALL_NVSWITCHES 0x7ffffffe

DCGM_GROUP_ALL_INSTANCES 0x7ffffffd

DCGM_GROUP_ALL_COMPUTE_INSTANCES 0x7ffffffc

DCGM_GROUP_ALL_ENTITIES 0x7ffffffb

DCGM_GROUP_NULL 0x7ffffffa

DCGM_GROUP_MAX_ENTITIES_V1 64: Maximum number of entities per entity group.

DCGM_GROUP_MAX_ENTITIES_V2 1024

Typedefs

typedef enum dcgmOperationMode_enum dcgmOperationMode_t

Operation mode for DCGM.

DCGM can run in auto-mode where it runs additional threads in the background to collect any metrics of interest and auto manages any operations needed for policy management.

DCGM can also operate in manual-mode where it’s execution is controlled by the user. In this mode, the user has to periodically call APIs such as dcgmPolicyTrigger and dcgmUpdateAllFields which tells DCGM to wake up and perform data collection and operations needed for policy management.

typedef enum dcgmOrder_enum dcgmOrder_t: When more than one value is returned from a query, which order should it be returned in?

typedef enum dcgmReturn_enum dcgmReturn_t: Return values for DCGM API calls.

typedef enum dcgmGroupType_enum dcgmGroupType_t: Type of GPU groups.

typedef enum dcgmChipArchitecture_enum dcgmChipArchitecture_t

Simplified chip architecture.

Note that these are made to match nvmlChipArchitecture_t and thus do not start at 0.

typedef enum dcgmConfigType_enum dcgmConfigType_t: Represents the type of configuration to be fetched from the GPUs.

typedef enum dcgmConfigPowerLimitType_enum dcgmConfigPowerLimitType_t: Represents the power cap for each member of the group.

Enums

enum dcgmOperationMode_enum

Operation mode for DCGM.

DCGM can run in auto-mode where it runs additional threads in the background to collect any metrics of interest and auto manages any operations needed for policy management.

DCGM can also operate in manual-mode where it’s execution is controlled by the user. In this mode, the user has to periodically call APIs such as dcgmPolicyTrigger and dcgmUpdateAllFields which tells DCGM to wake up and perform data collection and operations needed for policy management.

Values:

enumerator DCGM_OPERATION_MODE_AUTO

enumerator DCGM_OPERATION_MODE_MANUAL

enum dcgmOrder_enum

When more than one value is returned from a query, which order should it be returned in?

Values:

enumerator DCGM_ORDER_ASCENDING: Data with earliest (lowest) timestamps returned first.

enumerator DCGM_ORDER_DESCENDING: Data with latest (highest) timestamps returned first.

enum dcgmReturn_enum

Return values for DCGM API calls.

Values:

enumerator DCGM_ST_OK: Success.

enumerator DCGM_ST_BADPARAM: A bad parameter was passed to a function.

enumerator DCGM_ST_GENERIC_ERROR: A generic, unspecified error.

enumerator DCGM_ST_MEMORY: An out of memory error occurred.

enumerator DCGM_ST_NOT_CONFIGURED: Setting not configured.

enumerator DCGM_ST_NOT_SUPPORTED: Feature not supported.

enumerator DCGM_ST_INIT_ERROR: DCGM Init error.

enumerator DCGM_ST_NVML_ERROR: When NVML returns error.

enumerator DCGM_ST_PENDING: Object is in pending state of something else.

enumerator DCGM_ST_UNINITIALIZED: Object is in undefined state.

enumerator DCGM_ST_TIMEOUT: Requested operation timed out.

enumerator DCGM_ST_VER_MISMATCH: Version mismatch between received and understood API.

enumerator DCGM_ST_UNKNOWN_FIELD: Unknown field id.

enumerator DCGM_ST_NO_DATA: No data is available.

enumerator DCGM_ST_STALE_DATA: Data is considered stale.

enumerator DCGM_ST_NOT_WATCHED: The given field id is not being updated by the cache manager.

enumerator DCGM_ST_NO_PERMISSION: Do not have permission to perform the desired action.

enumerator DCGM_ST_GPU_IS_LOST: GPU is no longer reachable.

enumerator DCGM_ST_RESET_REQUIRED: GPU requires a reset.

enumerator DCGM_ST_FUNCTION_NOT_FOUND: The function that was requested was not found (bindings only error)

enumerator DCGM_ST_CONNECTION_NOT_VALID: The connection to the host engine is not valid any longer.

enumerator DCGM_ST_GPU_NOT_SUPPORTED: This GPU is not supported by DCGM.

enumerator DCGM_ST_GROUP_INCOMPATIBLE: The GPUs of the provided group are not compatible with each other for the requested operation.

enumerator DCGM_ST_MAX_LIMIT: Max limit reached for the object.

enumerator DCGM_ST_LIBRARY_NOT_FOUND: DCGM library could not be found.

enumerator DCGM_ST_DUPLICATE_KEY: Duplicate key passed to a function.

enumerator DCGM_ST_GPU_IN_SYNC_BOOST_GROUP: GPU is already a part of a sync boost group.

enumerator DCGM_ST_GPU_NOT_IN_SYNC_BOOST_GROUP: GPU is not a part of a sync boost group.

enumerator DCGM_ST_REQUIRES_ROOT: This operation cannot be performed when the host engine is running as non-root.

enumerator DCGM_ST_NVVS_ERROR: DCGM GPU Diagnostic was successfully executed, but reported an error.

enumerator DCGM_ST_INSUFFICIENT_SIZE: An input argument is not large enough.

enumerator DCGM_ST_FIELD_UNSUPPORTED_BY_API: The given field ID is not supported by the API being called.

enumerator DCGM_ST_MODULE_NOT_LOADED: This request is serviced by a module of DCGM that is not currently loaded.

enumerator DCGM_ST_IN_USE: The requested operation could not be completed because the affected resource is in use.

enumerator DCGM_ST_GROUP_IS_EMPTY: This group is empty and the requested operation is not valid on an empty group.

enumerator DCGM_ST_PROFILING_NOT_SUPPORTED: Profiling is not supported for this group of GPUs or GPU.

enumerator DCGM_ST_PROFILING_LIBRARY_ERROR: The third-party Profiling module returned an unrecoverable error.

enumerator DCGM_ST_PROFILING_MULTI_PASS: The requested profiling metrics cannot be collected in a single pass.

enumerator DCGM_ST_DIAG_ALREADY_RUNNING: A diag instance is already running, cannot run a new diag until the current one finishes.

enumerator DCGM_ST_DIAG_BAD_JSON: The DCGM GPU Diagnostic returned JSON that cannot be parsed.

enumerator DCGM_ST_DIAG_BAD_LAUNCH: Error while launching the DCGM GPU Diagnostic.

enumerator DCGM_ST_DIAG_UNUSED: Unused.

enumerator DCGM_ST_DIAG_THRESHOLD_EXCEEDED: A field value met or exceeded the error threshold.

enumerator DCGM_ST_INSUFFICIENT_DRIVER_VERSION: The installed driver version is insufficient for this API.

enumerator DCGM_ST_INSTANCE_NOT_FOUND: The specified GPU instance does not exist.

enumerator DCGM_ST_COMPUTE_INSTANCE_NOT_FOUND: The specified GPU compute instance does not exist.

enumerator DCGM_ST_CHILD_NOT_KILLED: Couldn’t kill a child process within the retries.

enumerator DCGM_ST_3RD_PARTY_LIBRARY_ERROR: Detected an error in a 3rd-party library.

enumerator DCGM_ST_INSUFFICIENT_RESOURCES: Not enough resources available.

enumerator DCGM_ST_PLUGIN_EXCEPTION: Exception thrown from a diagnostic plugin.

enumerator DCGM_ST_NVVS_ISOLATE_ERROR: The diagnostic returned an error that indicates the need for isolation.

enumerator DCGM_ST_NVVS_BINARY_NOT_FOUND: The NVVS binary was not found in the specified location.

enumerator DCGM_ST_NVVS_KILLED: The NVVS process was killed by a signal.

enumerator DCGM_ST_PAUSED: The hostengine and all modules are paused.

enumerator DCGM_ST_ALREADY_INITIALIZED: The object is already initialized.

enumerator DCGM_ST_NVML_NOT_LOADED: Cannot perform operation because NVML isn’t loaded.

enumerator DCGM_ST_NVML_DRIVER_TIMEOUT: Cannot perform operation because an NVML driver timeout error was detected.

enumerator DCGM_ST_NVVS_NO_AVAILABLE_TEST: The NVVS returns no available tests (NVVS_ST_TEST_NOT_FOUND)

enumerator DCGM_ST_MNDIAG_CONNECTION_NOT_AVAILABLE: No connection is currently authorized for mndiag.

enumerator DCGM_ST_MNDIAG_CONNECTION_UNAUTHORIZED: The connection is not authorized for mndiag operations.

enumerator DCGM_ST_REMOTE_SSH_CONNECTION_FAILED: An SSH connection to a remote hostengine failed.

enumerator DCGM_ST_CHILD_SPAWN_FAILED: A child process could not be spawned.

enumerator DCGM_ST_FILE_IO_ERROR: A file operation failed.

enumerator DCGM_ST_CHILD_SIGNAL_RECEIVED: A child process received a signal.

enumerator DCGM_ST_CALLER_ALREADY_STOPPED: The caller is already stopped.

enum dcgmGroupType_enum

Type of GPU groups.

Values:

enumerator DCGM_GROUP_DEFAULT: All the GPUs on the node are added to the group.

enumerator DCGM_GROUP_EMPTY: Creates an empty group.

enumerator DCGM_GROUP_DEFAULT_NVSWITCHES: All NvSwitches of the node are added to the group.

enumerator DCGM_GROUP_DEFAULT_INSTANCES: All GPU instances of the node are added to the group.

enumerator DCGM_GROUP_DEFAULT_COMPUTE_INSTANCES: All compute instances of the node are added to the group.

enumerator DCGM_GROUP_DEFAULT_EVERYTHING: All entities are added to this default group.

enum dcgmChipArchitecture_enum

Simplified chip architecture.

Note that these are made to match nvmlChipArchitecture_t and thus do not start at 0.

Values:

enumerator DCGM_CHIP_ARCH_OLDER: All GPUs older than Kepler.

enumerator DCGM_CHIP_ARCH_KEPLER: All Kepler-architecture parts.

enumerator DCGM_CHIP_ARCH_MAXWELL: All Maxwell-architecture parts.

enumerator DCGM_CHIP_ARCH_PASCAL: All Pascal-architecture parts.

enumerator DCGM_CHIP_ARCH_VOLTA: All Volta-architecture parts.

enumerator DCGM_CHIP_ARCH_TURING: All Turing-architecture parts.

enumerator DCGM_CHIP_ARCH_AMPERE: All Ampere-architecture parts.

enumerator DCGM_CHIP_ARCH_ADA: All Ada-architecture parts.

enumerator DCGM_CHIP_ARCH_HOPPER: All Hopper-architecture parts.

enumerator DCGM_CHIP_ARCH_BLACKWELL: All Blackwell-architecture parts.

enumerator DCGM_CHIP_ARCH_COUNT: Keep this 2nd to last, exclude unknown.

enumerator DCGM_CHIP_ARCH_UNKNOWN: Anything else, presumably something newer.

enum dcgmConfigType_enum

Represents the type of configuration to be fetched from the GPUs.

Values:

enumerator DCGM_CONFIG_TARGET_STATE: The target configuration values to be applied.

enumerator DCGM_CONFIG_CURRENT_STATE: The current configuration state.

enum dcgmConfigPowerLimitType_enum

Represents the power cap for each member of the group.

Values:

enumerator DCGM_CONFIG_POWER_CAP_INDIVIDUAL: Represents the power cap to be applied for each member of the group.

enumerator DCGM_CONFIG_POWER_BUDGET_GROUP: Represents the power budget for the entire group.

Functions

const char *errorString(dcgmReturn_t result)