4.21. Drain states
This chapter describes methods that NVML can perform against each device to control their drain state and recognition by NVML and NVIDIA kernel driver. These methods can be used with out-of-band tools to power on/off GPUs, enable robust reset scenarios, etc.
Functions
- nvmlReturn_t nvmlDeviceDiscoverGpus ( nvmlPciInfo_t* pciInfo )
- nvmlReturn_t nvmlDeviceModifyDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t newState )
- nvmlReturn_t nvmlDeviceQueryDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t* currentState )
- nvmlReturn_t nvmlDeviceRemoveGpu_v2 ( nvmlPciInfo_t* pciInfo, nvmlDetachGpuState_t gpuState, nvmlPcieLinkState_t linkState )
Functions
- nvmlReturn_t nvmlDeviceDiscoverGpus ( nvmlPciInfo_t* pciInfo )
-
Parameters
- pciInfo
- The PCI tree to be searched. Only the domain, bus, and device fields are used in this call.
Returns
- NVML_SUCCESS if counters were successfully reset
- NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized
- NVML_ERROR_INVALID_ARGUMENT if pciInfo is invalid
- NVML_ERROR_NOT_SUPPORTED if the operating system does not support this feature
- NVML_ERROR_OPERATING_SYSTEM if the operating system is denying this feature
- NVML_ERROR_NO_PERMISSION if the calling process has insufficient permissions to perform operation
- NVML_ERROR_UNKNOWN on any unexpected error
Description
Request the OS and the NVIDIA kernel driver to rediscover a portion of the PCI subsystem looking for GPUs that were previously removed. The portion of the PCI tree can be narrowed by specifying a domain, bus, and device. If all are zeroes then the entire PCI tree will be searched. Please note that for long-running NVML processes the enumeration will change based on how many GPUs are discovered and where they are inserted in bus order.
In addition, all newly discovered GPUs will be initialized and their ECC scrubbed which may take several seconds per GPU. Also, all device handles are no longer guaranteed to be valid post discovery.
Must be run as administrator. For Linux only.
For Pascal or newer fully supported devices. Some Kepler devices supported.
- nvmlReturn_t nvmlDeviceModifyDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t newState )
-
Parameters
- pciInfo
- The PCI address of the GPU drain state to be modified
- newState
- The drain state that should be entered, see nvmlEnableState_t
Returns
- NVML_SUCCESS if counters were successfully reset
- NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized
- NVML_ERROR_INVALID_ARGUMENT if nvmlIndex or newState is invalid
- NVML_ERROR_NOT_SUPPORTED if the device doesn't support this feature
- NVML_ERROR_NO_PERMISSION if the calling process has insufficient permissions to perform operation
- NVML_ERROR_IN_USE if the device has persistence mode turned on
- NVML_ERROR_UNKNOWN on any unexpected error
Description
Modify the drain state of a GPU. This method forces a GPU to no longer accept new incoming requests. Any new NVML process will no longer see this GPU. Persistence mode for this GPU must be turned off before this call is made. Must be called as administrator. For Linux only.
For Pascal or newer fully supported devices. Some Kepler devices supported.
- nvmlReturn_t nvmlDeviceQueryDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t* currentState )
-
Parameters
- pciInfo
- The PCI address of the GPU drain state to be queried
- currentState
- The current drain state for this GPU, see nvmlEnableState_t
Returns
- NVML_SUCCESS if counters were successfully reset
- NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized
- NVML_ERROR_INVALID_ARGUMENT if nvmlIndex or currentState is invalid
- NVML_ERROR_NOT_SUPPORTED if the device doesn't support this feature
- NVML_ERROR_UNKNOWN on any unexpected error
Description
Query the drain state of a GPU. This method is used to check if a GPU is in a currently draining state. For Linux only.
For Pascal or newer fully supported devices. Some Kepler devices supported.
- nvmlReturn_t nvmlDeviceRemoveGpu_v2 ( nvmlPciInfo_t* pciInfo, nvmlDetachGpuState_t gpuState, nvmlPcieLinkState_t linkState )
-
Parameters
- pciInfo
- The PCI address of the GPU to be removed
- gpuState
- Whether the GPU is to be removed, from the OS see nvmlDetachGpuState_t
- linkState
- Requested upstream PCIe link state, see nvmlPcieLinkState_t
Returns
- NVML_SUCCESS if counters were successfully reset
- NVML_ERROR_UNINITIALIZED if the library has not been successfully initialized
- NVML_ERROR_INVALID_ARGUMENT if nvmlIndex is invalid
- NVML_ERROR_NOT_SUPPORTED if the device doesn't support this feature
- NVML_ERROR_IN_USE if the device is still in use and cannot be removed
Description
This method will remove the specified GPU from the view of both NVML and the NVIDIA kernel driver as long as no other processes are attached. If other processes are attached, this call will return NVML_ERROR_IN_USE and the GPU will be returned to its original "draining" state. Note: the only situation where a process can still be attached after nvmlDeviceModifyDrainState() is called to initiate the draining state is if that process was using, and is still using, a GPU before the call was made. Also note, persistence mode counts as an attachment to the GPU thus it must be disabled prior to this call.
For long-running NVML processes please note that this will change the enumeration of current GPUs. For example, if there are four GPUs present and GPU1 is removed, the new enumeration will be 0-2. Also, device handles after the removed GPU will not be valid and must be re-established. Must be run as administrator. For Linux only.
For Pascal or newer fully supported devices. Some Kepler devices supported.