NVML API Reference Guide (PDF) - v535 (older) - Last updated January 31, 2024 - Send Feedback

4.20. Drain states

This chapter describes methods that NVML can perform against each device to control their drain state and recognition by NVML and NVIDIA kernel driver. These methods can be used with out-of-band tools to power on/off GPUs, enable robust reset scenarios, etc.

Functions

nvmlReturn_t nvmlDeviceDiscoverGpus ( nvmlPciInfo_t* pciInfo )
nvmlReturn_t nvmlDeviceModifyDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t newState )
nvmlReturn_t nvmlDeviceQueryDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t* currentState )
nvmlReturn_t nvmlDeviceRemoveGpu_v2 ( nvmlPciInfo_t* pciInfo, nvmlDetachGpuState_t gpuState, nvmlPcieLinkState_t linkState )

Functions

nvmlReturn_t nvmlDeviceDiscoverGpus ( nvmlPciInfo_t* pciInfo )
Parameters
pciInfo
The PCI tree to be searched. Only the domain, bus, and device fields are used in this call.
Returns

Description

Request the OS and the NVIDIA kernel driver to rediscover a portion of the PCI subsystem looking for GPUs that were previously removed. The portion of the PCI tree can be narrowed by specifying a domain, bus, and device. If all are zeroes then the entire PCI tree will be searched. Please note that for long-running NVML processes the enumeration will change based on how many GPUs are discovered and where they are inserted in bus order.

In addition, all newly discovered GPUs will be initialized and their ECC scrubbed which may take several seconds per GPU. Also, all device handles are no longer guaranteed to be valid post discovery.

Must be run as administrator. For Linux only.

For Pascal or newer fully supported devices. Some Kepler devices supported.

nvmlReturn_t nvmlDeviceModifyDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t newState )
Parameters
pciInfo
The PCI address of the GPU drain state to be modified
newState
The drain state that should be entered, see nvmlEnableState_t
Returns

Description

Modify the drain state of a GPU. This method forces a GPU to no longer accept new incoming requests. Any new NVML process will no longer see this GPU. Persistence mode for this GPU must be turned off before this call is made. Must be called as administrator. For Linux only.

For Pascal or newer fully supported devices. Some Kepler devices supported.

nvmlReturn_t nvmlDeviceQueryDrainState ( nvmlPciInfo_t* pciInfo, nvmlEnableState_t* currentState )
Parameters
pciInfo
The PCI address of the GPU drain state to be queried
currentState
The current drain state for this GPU, see nvmlEnableState_t
Returns

Description

Query the drain state of a GPU. This method is used to check if a GPU is in a currently draining state. For Linux only.

For Pascal or newer fully supported devices. Some Kepler devices supported.

nvmlReturn_t nvmlDeviceRemoveGpu_v2 ( nvmlPciInfo_t* pciInfo, nvmlDetachGpuState_t gpuState, nvmlPcieLinkState_t linkState )
Parameters
pciInfo
The PCI address of the GPU to be removed
gpuState
Whether the GPU is to be removed, from the OS see nvmlDetachGpuState_t
linkState
Requested upstream PCIe link state, see nvmlPcieLinkState_t
Returns

Description

This method will remove the specified GPU from the view of both NVML and the NVIDIA kernel driver as long as no other processes are attached. If other processes are attached, this call will return NVML_ERROR_IN_USE and the GPU will be returned to its original "draining" state. Note: the only situation where a process can still be attached after nvmlDeviceModifyDrainState() is called to initiate the draining state is if that process was using, and is still using, a GPU before the call was made. Also note, persistence mode counts as an attachment to the GPU thus it must be disabled prior to this call.

For long-running NVML processes please note that this will change the enumeration of current GPUs. For example, if there are four GPUs present and GPU1 is removed, the new enumeration will be 0-2. Also, device handles after the removed GPU will not be valid and must be re-established. Must be run as administrator. For Linux only.

For Pascal or newer fully supported devices. Some Kepler devices supported.


NVML API Reference Guide (PDF) - v535 (older) - Last updated January 31, 2024 - Send Feedback