Enums and Macros
- group dcgmReturnEnums
Defines
-
MAKE_DCGM_VERSION(typeName, ver) (unsigned int)(sizeof(typeName) | ((unsigned long)(ver) << 24U))
Creates a unique version number for each struct.
-
DCGM_BLANK_VALUES
Represents value of the field which can be returned by Host Engine in case the operation is not successful.
-
DCGM_INT8_BLANK 0x70
Base value for 8 bits integer blank.
can be used as an unspecified blank
-
DCGM_INT32_BLANK 0x7ffffff0
Base value for 32 bits integer blank.
can be used as an unspecified blank
-
DCGM_INT64_BLANK 0x7ffffffffffffff0
Base value for 64 bits integer blank.
can be used as an unspecified blank
-
DCGM_FP64_BLANK 140737488355328.0
Base value for double blank.
2 ** 47. FP 64 has 52 bits of mantissa, so 47 bits can still increment by 1 and represent each value from 0-15
-
DCGM_STR_BLANK "<<<NULL>>>"
Base value for string blank.
-
DCGM_INT32_NOT_FOUND (DCGM_INT32_BLANK + 1)
Represents an error where INT32 data was not found.
-
DCGM_INT64_NOT_FOUND (DCGM_INT64_BLANK + 1)
Represents an error where INT64 data was not found.
-
DCGM_FP64_NOT_FOUND (DCGM_FP64_BLANK + 1.0)
Represents an error where FP64 data was not found.
-
DCGM_STR_NOT_FOUND "<<<NOT_FOUND>>>"
Represents an error where STR data was not found.
-
DCGM_INT32_NOT_SUPPORTED (DCGM_INT32_BLANK + 2)
Represents an error where fetching the INT32 value is not supported.
-
DCGM_INT64_NOT_SUPPORTED (DCGM_INT64_BLANK + 2)
Represents an error where fetching the INT64 value is not supported.
-
DCGM_FP64_NOT_SUPPORTED (DCGM_FP64_BLANK + 2.0)
Represents an error where fetching the FP64 value is not supported.
-
DCGM_STR_NOT_SUPPORTED "<<<NOT_SUPPORTED>>>"
Represents an error where fetching the STR value is not supported.
-
DCGM_INT32_NOT_PERMISSIONED (DCGM_INT32_BLANK + 3)
Represents and error where fetching the INT32 value is not allowed with our current credentials.
-
DCGM_INT64_NOT_PERMISSIONED (DCGM_INT64_BLANK + 3)
Represents and error where fetching the INT64 value is not allowed with our current credentials.
-
DCGM_FP64_NOT_PERMISSIONED (DCGM_FP64_BLANK + 3.0)
Represents and error where fetching the FP64 value is not allowed with our current credentials.
-
DCGM_STR_NOT_PERMISSIONED "<<<NOT_PERM>>>"
Represents and error where fetching the STR value is not allowed with our current credentials.
-
DCGM_INT8_IS_BLANK(val) (((val) >= DCGM_INT8_BLANK) ? 1 : 0)
Macro to check if a INT8 value is blank or not.
-
DCGM_INT32_IS_BLANK(val) (((val) >= DCGM_INT32_BLANK) ? 1 : 0)
Macro to check if a INT32 value is blank or not.
-
DCGM_INT64_IS_BLANK(val) (((val) >= DCGM_INT64_BLANK) ? 1 : 0)
Macro to check if a INT64 value is blank or not.
-
DCGM_FP64_IS_BLANK(val) (((val) >= DCGM_FP64_BLANK ? 1 : 0))
Macro to check if a FP64 value is blank or not.
-
DCGM_STR_IS_BLANK(val) (val == strstr(val, "<<<") && strstr(val, ">>>"))
Macro to check if a STR value is blank or not Works on (char *).
Looks for <<< at first position and >>> inside string
-
DCGM_MAX_NUM_DEVICES 32 /* DCGM 2.0 and newer = 32. DCGM 1.8 and older = 16. */
Max number of GPUs supported by DCGM.
-
DCGM_NVLINK_MAX_LINKS_PER_GPU 18
Number of NvLink links per GPU supported by DCGM 18 for Hopper, 12 for Ampere, 6 for Volta, and 4 for Pascal.
-
DCGM_NVLINK_ERROR_COUNT 4
Number of nvlink errors supported by DCGM.
NVML_NVLINK_ERROR_DL_ECC_DATA not currently supported
See also
NVML_NVLINK_ERROR_COUNT
-
DCGM_HEALTH_WATCH_NVLINK_ERROR_NUM_FIELDS 4
Number of nvlink error types:
See also
NVML_NVLINK_ERROR_COUNT TODO: update with refactor of ampere-next nvlink APIs (JIRA DCGM-2628)
-
DCGM_NVLINK_MAX_LINKS_PER_GPU_LEGACY1 6
Maximum NvLink links pre-Ampere.
-
DCGM_NVLINK_MAX_LINKS_PER_GPU_LEGACY2 12
Maximum NvLink links pre-Hopper.
-
DCGM_MAX_NUM_SWITCHES 12
Max number of NvSwitches supported by DCGM.
-
DCGM_MAX_XID_INFO 10
Max number of XID info to store.
-
DCGM_NVLINK_MAX_LINKS_PER_NVSWITCH 256
Number of NvLink links per NvSwitch supported by DCGM.
-
DCGM_LANE_MAX_LANES_PER_NVSWICH_LINK 4
Number of Lanes per NvSwitch NvLink supported by DCGM.
-
DCGM_MAX_VGPU_INSTANCES_PER_PGPU 32
Maximum number of vGPU instances per physical GPU.
-
DCGM_MAX_NUM_CPUS 8
Max number of CPU nodes.
-
DCGM_MAX_NUM_CPU_CORES 1024
Max number of CPUs.
-
DCGM_MAX_STR_LENGTH 256
Max length of the DCGM string field.
-
DCGM_MAX_AGE_USEC_DEFAULT 30000000
Default maximum age of samples kept (usec)
-
DCGM_MAX_CLOCKS 256
Max number of clocks supported for a device.
-
DCGM_MAX_NUM_GROUPS 64
Max limit on the number of groups supported by DCGM.
-
DCGM_MAX_FBC_SESSIONS 256
Max number of active FBC sessions.
-
DCGM_VGPU_NAME_BUFFER_SIZE 64
Represents the size of a buffer that holds a vGPU type Name or vGPU class type or name of process running on vGPU instance.
-
DCGM_GRID_LICENSE_BUFFER_SIZE 128
Represents the size of a buffer that holds a vGPU license string.
-
DCGM_CONFIG_COMPUTEMODE_DEFAULT 0
Default compute mode — multiple contexts per device.
-
DCGM_CONFIG_COMPUTEMODE_PROHIBITED 1
Compute-prohibited mode — no contexts per device.
-
DCGM_CONFIG_COMPUTEMODE_EXCLUSIVE_PROCESS 2
Compute-exclusive-process mode — only one context per device, usable from multiple threads at a time.
-
DCGM_HE_PORT_NUMBER 5555
Default Port Number for DCGM Host Engine.
-
DCGM_GROUP_ALL_GPUS 0x7fffffff
Identifies for special DCGM groups.
-
DCGM_GROUP_ALL_NVSWITCHES 0x7ffffffe
-
DCGM_GROUP_ALL_INSTANCES 0x7ffffffd
-
DCGM_GROUP_ALL_COMPUTE_INSTANCES 0x7ffffffc
-
DCGM_GROUP_ALL_ENTITIES 0x7ffffffb
-
DCGM_GROUP_NULL 0x7ffffffa
-
DCGM_GROUP_MAX_ENTITIES_V1 64
Maximum number of entities per entity group.
-
DCGM_GROUP_MAX_ENTITIES_V2 1024
Typedefs
-
typedef enum dcgmOperationMode_enum dcgmOperationMode_t
Operation mode for DCGM.
DCGM can run in auto-mode where it runs additional threads in the background to collect any metrics of interest and auto manages any operations needed for policy management.
DCGM can also operate in manual-mode where it’s execution is controlled by the user. In this mode, the user has to periodically call APIs such as dcgmPolicyTrigger and dcgmUpdateAllFields which tells DCGM to wake up and perform data collection and operations needed for policy management.
-
typedef enum dcgmOrder_enum dcgmOrder_t
When more than one value is returned from a query, which order should it be returned in?
-
typedef enum dcgmReturn_enum dcgmReturn_t
Return values for DCGM API calls.
-
typedef enum dcgmGroupType_enum dcgmGroupType_t
Type of GPU groups.
-
typedef enum dcgmChipArchitecture_enum dcgmChipArchitecture_t
Simplified chip architecture.
Note that these are made to match nvmlChipArchitecture_t and thus do not start at 0.
-
typedef enum dcgmConfigType_enum dcgmConfigType_t
Represents the type of configuration to be fetched from the GPUs.
-
typedef enum dcgmConfigPowerLimitType_enum dcgmConfigPowerLimitType_t
Represents the power cap for each member of the group.
Enums
-
enum dcgmOperationMode_enum
Operation mode for DCGM.
DCGM can run in auto-mode where it runs additional threads in the background to collect any metrics of interest and auto manages any operations needed for policy management.
DCGM can also operate in manual-mode where it’s execution is controlled by the user. In this mode, the user has to periodically call APIs such as dcgmPolicyTrigger and dcgmUpdateAllFields which tells DCGM to wake up and perform data collection and operations needed for policy management.
Values:
-
enumerator DCGM_OPERATION_MODE_AUTO
-
enumerator DCGM_OPERATION_MODE_MANUAL
-
enumerator DCGM_OPERATION_MODE_AUTO
-
enum dcgmOrder_enum
When more than one value is returned from a query, which order should it be returned in?
Values:
-
enumerator DCGM_ORDER_ASCENDING
Data with earliest (lowest) timestamps returned first.
-
enumerator DCGM_ORDER_DESCENDING
Data with latest (highest) timestamps returned first.
-
enumerator DCGM_ORDER_ASCENDING
-
enum dcgmReturn_enum
Return values for DCGM API calls.
Values:
-
enumerator DCGM_ST_OK
Success.
-
enumerator DCGM_ST_BADPARAM
A bad parameter was passed to a function.
-
enumerator DCGM_ST_GENERIC_ERROR
A generic, unspecified error.
-
enumerator DCGM_ST_MEMORY
An out of memory error occurred.
-
enumerator DCGM_ST_NOT_CONFIGURED
Setting not configured.
-
enumerator DCGM_ST_NOT_SUPPORTED
Feature not supported.
-
enumerator DCGM_ST_INIT_ERROR
DCGM Init error.
-
enumerator DCGM_ST_NVML_ERROR
When NVML returns error.
-
enumerator DCGM_ST_PENDING
Object is in pending state of something else.
-
enumerator DCGM_ST_UNINITIALIZED
Object is in undefined state.
-
enumerator DCGM_ST_TIMEOUT
Requested operation timed out.
-
enumerator DCGM_ST_VER_MISMATCH
Version mismatch between received and understood API.
-
enumerator DCGM_ST_UNKNOWN_FIELD
Unknown field id.
-
enumerator DCGM_ST_NO_DATA
No data is available.
-
enumerator DCGM_ST_STALE_DATA
Data is considered stale.
-
enumerator DCGM_ST_NOT_WATCHED
The given field id is not being updated by the cache manager.
-
enumerator DCGM_ST_NO_PERMISSION
Do not have permission to perform the desired action.
-
enumerator DCGM_ST_GPU_IS_LOST
GPU is no longer reachable.
-
enumerator DCGM_ST_RESET_REQUIRED
GPU requires a reset.
-
enumerator DCGM_ST_FUNCTION_NOT_FOUND
The function that was requested was not found (bindings only error)
-
enumerator DCGM_ST_CONNECTION_NOT_VALID
The connection to the host engine is not valid any longer.
-
enumerator DCGM_ST_GPU_NOT_SUPPORTED
This GPU is not supported by DCGM.
-
enumerator DCGM_ST_GROUP_INCOMPATIBLE
The GPUs of the provided group are not compatible with each other for the requested operation.
-
enumerator DCGM_ST_MAX_LIMIT
Max limit reached for the object.
-
enumerator DCGM_ST_LIBRARY_NOT_FOUND
DCGM library could not be found.
-
enumerator DCGM_ST_DUPLICATE_KEY
Duplicate key passed to a function.
-
enumerator DCGM_ST_GPU_IN_SYNC_BOOST_GROUP
GPU is already a part of a sync boost group.
-
enumerator DCGM_ST_GPU_NOT_IN_SYNC_BOOST_GROUP
GPU is not a part of a sync boost group.
-
enumerator DCGM_ST_REQUIRES_ROOT
This operation cannot be performed when the host engine is running as non-root.
-
enumerator DCGM_ST_NVVS_ERROR
DCGM GPU Diagnostic was successfully executed, but reported an error.
-
enumerator DCGM_ST_INSUFFICIENT_SIZE
An input argument is not large enough.
-
enumerator DCGM_ST_FIELD_UNSUPPORTED_BY_API
The given field ID is not supported by the API being called.
-
enumerator DCGM_ST_MODULE_NOT_LOADED
This request is serviced by a module of DCGM that is not currently loaded.
-
enumerator DCGM_ST_IN_USE
The requested operation could not be completed because the affected resource is in use.
-
enumerator DCGM_ST_GROUP_IS_EMPTY
This group is empty and the requested operation is not valid on an empty group.
-
enumerator DCGM_ST_PROFILING_NOT_SUPPORTED
Profiling is not supported for this group of GPUs or GPU.
-
enumerator DCGM_ST_PROFILING_LIBRARY_ERROR
The third-party Profiling module returned an unrecoverable error.
-
enumerator DCGM_ST_PROFILING_MULTI_PASS
The requested profiling metrics cannot be collected in a single pass.
-
enumerator DCGM_ST_DIAG_ALREADY_RUNNING
A diag instance is already running, cannot run a new diag until the current one finishes.
-
enumerator DCGM_ST_DIAG_BAD_JSON
The DCGM GPU Diagnostic returned JSON that cannot be parsed.
-
enumerator DCGM_ST_DIAG_BAD_LAUNCH
Error while launching the DCGM GPU Diagnostic.
-
enumerator DCGM_ST_DIAG_UNUSED
Unused.
-
enumerator DCGM_ST_DIAG_THRESHOLD_EXCEEDED
A field value met or exceeded the error threshold.
-
enumerator DCGM_ST_INSUFFICIENT_DRIVER_VERSION
The installed driver version is insufficient for this API.
-
enumerator DCGM_ST_INSTANCE_NOT_FOUND
The specified GPU instance does not exist.
-
enumerator DCGM_ST_COMPUTE_INSTANCE_NOT_FOUND
The specified GPU compute instance does not exist.
-
enumerator DCGM_ST_CHILD_NOT_KILLED
Couldn’t kill a child process within the retries.
-
enumerator DCGM_ST_3RD_PARTY_LIBRARY_ERROR
Detected an error in a 3rd-party library.
-
enumerator DCGM_ST_INSUFFICIENT_RESOURCES
Not enough resources available.
-
enumerator DCGM_ST_PLUGIN_EXCEPTION
Exception thrown from a diagnostic plugin.
-
enumerator DCGM_ST_NVVS_ISOLATE_ERROR
The diagnostic returned an error that indicates the need for isolation.
-
enumerator DCGM_ST_NVVS_BINARY_NOT_FOUND
The NVVS binary was not found in the specified location.
-
enumerator DCGM_ST_NVVS_KILLED
The NVVS process was killed by a signal.
-
enumerator DCGM_ST_PAUSED
The hostengine and all modules are paused.
-
enumerator DCGM_ST_ALREADY_INITIALIZED
The object is already initialized.
-
enumerator DCGM_ST_NVML_NOT_LOADED
Cannot perform operation because NVML isn’t loaded.
-
enumerator DCGM_ST_NVML_DRIVER_TIMEOUT
Cannot perform operation because an NVML driver timeout error was detected.
-
enumerator DCGM_ST_NVVS_NO_AVAILABLE_TEST
The NVVS returns no available tests (NVVS_ST_TEST_NOT_FOUND)
-
enumerator DCGM_ST_OK
-
enum dcgmGroupType_enum
Type of GPU groups.
Values:
-
enumerator DCGM_GROUP_DEFAULT
All the GPUs on the node are added to the group.
-
enumerator DCGM_GROUP_EMPTY
Creates an empty group.
-
enumerator DCGM_GROUP_DEFAULT_NVSWITCHES
All NvSwitches of the node are added to the group.
-
enumerator DCGM_GROUP_DEFAULT_INSTANCES
All GPU instances of the node are added to the group.
-
enumerator DCGM_GROUP_DEFAULT_COMPUTE_INSTANCES
All compute instances of the node are added to the group.
-
enumerator DCGM_GROUP_DEFAULT_EVERYTHING
All entities are added to this default group.
-
enumerator DCGM_GROUP_DEFAULT
-
enum dcgmChipArchitecture_enum
Simplified chip architecture.
Note that these are made to match nvmlChipArchitecture_t and thus do not start at 0.
Values:
-
enumerator DCGM_CHIP_ARCH_OLDER
All GPUs older than Kepler.
-
enumerator DCGM_CHIP_ARCH_KEPLER
All Kepler-architecture parts.
-
enumerator DCGM_CHIP_ARCH_MAXWELL
All Maxwell-architecture parts.
-
enumerator DCGM_CHIP_ARCH_PASCAL
All Pascal-architecture parts.
-
enumerator DCGM_CHIP_ARCH_VOLTA
All Volta-architecture parts.
-
enumerator DCGM_CHIP_ARCH_TURING
All Turing-architecture parts.
-
enumerator DCGM_CHIP_ARCH_AMPERE
All Ampere-architecture parts.
-
enumerator DCGM_CHIP_ARCH_ADA
All Ada-architecture parts.
-
enumerator DCGM_CHIP_ARCH_HOPPER
All Hopper-architecture parts.
-
enumerator DCGM_CHIP_ARCH_BLACKWELL
All Blackwell-architecture parts.
-
enumerator DCGM_CHIP_ARCH_COUNT
Keep this second to last, exclude unknown.
-
enumerator DCGM_CHIP_ARCH_UNKNOWN
Anything else, presumably something newer.
-
enumerator DCGM_CHIP_ARCH_OLDER
-
enum dcgmConfigType_enum
Represents the type of configuration to be fetched from the GPUs.
Values:
-
enumerator DCGM_CONFIG_TARGET_STATE
The target configuration values to be applied.
-
enumerator DCGM_CONFIG_CURRENT_STATE
The current configuration state.
-
enumerator DCGM_CONFIG_TARGET_STATE
-
enum dcgmConfigPowerLimitType_enum
Represents the power cap for each member of the group.
Values:
-
enumerator DCGM_CONFIG_POWER_CAP_INDIVIDUAL
Represents the power cap to be applied for each member of the group.
-
enumerator DCGM_CONFIG_POWER_BUDGET_GROUP
Represents the power budget for the entire group.
-
enumerator DCGM_CONFIG_POWER_CAP_INDIVIDUAL
Functions
-
const char *errorString(dcgmReturn_t result)
-
MAKE_DCGM_VERSION(typeName, ver) (unsigned int)(sizeof(typeName) | ((unsigned long)(ver) << 24U))