Detached GPUs
Overview
Beginning with DCGM 4.5 and driver 590, DCGM supports GPU binding and unbinding. DCGM can launch even when no GPU is detected and admin can attach GPUs to the system on the fly. It monitors events and automatically recognizes any newly attached GPUs or detached GPUs. In addition, DCGM supports dynamic attach and detach of driver, allowing driver updates without stopping DCGM.
Ways to Detach GPUs
Use DCGM Public APIs
DCGM offers two public APIs for detaching and attaching drivers: dcgmDetachDriver and dcgmAttachDriver.
When dcgmDetachDriver is called, it detaches NVML from DCGM and sets all GPUs to a detached state. When dcgmAttachDriver is called, DCGM attaches NVML, setting all found GPUs to an OK state and removing any still-detached GPUs from groups along with their watched fields.
dcgmi offers command-line interfaces for these two APIs.
$ dcgmi set --attach-driver
$ dcgmi set --detach-driver
Use sysfs
# echo "<PCI Bus ID>" > /sys/bus/pci/devices/<PCI Bus ID>/driver/unbind
# echo "<PCI Bus ID>" > /sys/bus/pci/drivers/nvidia/bind
We can use these sysfs entries to detach and attach a GPU.
For example:
# echo "0000:65:00.0" > /sys/bus/pci/devices/0000\:65\:00.0/driver/unbind
# echo "0000:65:00.0" > /sys/bus/pci/drivers/nvidia/bind
Behavior of Detached GPUs
GPU Entity Id
The DCGM GPU ID stays the same throughout the entire life cycle of the DCGM main process (nv‑hostengine), even after attach and detach operations.
Group
The detached GPUs and their associated MIG instances will be removed from the groups.
It is not allowed to add detached GPUs to a group.
Configuration
The DCGM applied configuration will be reset upon GPU detachment. Therefore, the config will not be automatically reapplied upon reattachment.
DCGM cannot set configuration on detached GPUs.
DCGM cannot acquire configuration on detached GPUs.
Watches and Monitoring
DCGM cannot watch fields on detached GPUs.
All cached values and watched fields are removed from detached GPUs.
Watched fields on group
DCGM_GROUP_ALL_GPUSwill be automatically apply to newly attached GPUs.
Topology
dcgmGetDeviceTopologyOn detached GPU: The returns will include 0 GPUs that is reachable.
On normal GPU: The detached GPU will be excluded from the paths.
dcgmSelectGpusByTopologywill not include detached GPUs in the output.
Policy
Policies for group
DCGM_GROUP_ALL_GPUSwill be automatically applied to newly attached GPUs.Registered policies will not be triggered on detached GPUs.
Registered policies will be removed from detached GPUs.
Policy is unable to be set on detached GPUs.
Profiling
DCGM cannot watch profiling fields on detached GPUs.
On profiling module, all watches on detached GPUs will be removed.
Watched profiling fields on group
DCGM_GROUP_ALL_GPUSwill be automatically applied to newly attached GPUs.
Diagnostic
Running diagnostic will be stopped during attach and detach process.
When specifying a group, only active GPUs will be tested.
Detached GPUs will not be tested.
If only detached GPUs are specified, the diag will fail directly.
If including both detached GPUs and active GPUs, only active GPUs will be tested.
Multi-Node Diagnostics
When detaching or attaching GPUs during MnDiag operation, the attach and detach processes will wait until MnDiag is complete.
Health
Health watches on group
DCGM_GROUP_ALL_GPUSwill be automatically applied to newly attached GPUs.Health checks will not report incidents on detached GPUs.