Detached GPUs

Overview

Beginning with DCGM 4.5 and driver 590, DCGM supports GPU binding and unbinding. DCGM can launch even when no GPU is detected and admin can attach GPUs to the system on the fly. It monitors events and automatically recognizes any newly attached GPUs or detached GPUs. In addition, DCGM supports dynamic attach and detach of driver, allowing driver updates without stopping DCGM.

Ways to Detach GPUs

Use DCGM Public APIs

DCGM offers two public APIs for detaching and attaching drivers: dcgmDetachDriver and dcgmAttachDriver.

When dcgmDetachDriver is called, it detaches NVML from DCGM and sets all GPUs to a detached state. When dcgmAttachDriver is called, DCGM attaches NVML, setting all found GPUs to an OK state and removing any still-detached GPUs from groups along with their watched fields.

dcgmi offers command-line interfaces for these two APIs.

$ dcgmi set --attach-driver
$ dcgmi set --detach-driver

Use sysfs

# echo "<PCI Bus ID>" > /sys/bus/pci/devices/<PCI Bus ID>/driver/unbind
# echo "<PCI Bus ID>" > /sys/bus/pci/drivers/nvidia/bind

We can use these sysfs entries to detach and attach a GPU.

For example:

# echo "0000:65:00.0" > /sys/bus/pci/devices/0000\:65\:00.0/driver/unbind
# echo "0000:65:00.0" > /sys/bus/pci/drivers/nvidia/bind

Behavior of Detached GPUs

GPU Entity Id

The DCGM GPU ID stays the same throughout the entire life cycle of the DCGM main process (nv‑hostengine), even after attach and detach operations.

Group

  • The detached GPUs and their associated MIG instances will be removed from the groups.

  • It is not allowed to add detached GPUs to a group.

Configuration

  • The DCGM applied configuration will be reset upon GPU detachment. Therefore, the config will not be automatically reapplied upon reattachment.

  • DCGM cannot set configuration on detached GPUs.

  • DCGM cannot acquire configuration on detached GPUs.

Watches and Monitoring

  • DCGM cannot watch fields on detached GPUs.

  • All cached values and watched fields are removed from detached GPUs.

  • Watched fields on group DCGM_GROUP_ALL_GPUS will be automatically apply to newly attached GPUs.

Topology

  • dcgmGetDeviceTopology

    • On detached GPU: The returns will include 0 GPUs that is reachable.

    • On normal GPU: The detached GPU will be excluded from the paths.

  • dcgmSelectGpusByTopology will not include detached GPUs in the output.

Policy

  • Policies for group DCGM_GROUP_ALL_GPUS will be automatically applied to newly attached GPUs.

  • Registered policies will not be triggered on detached GPUs.

  • Registered policies will be removed from detached GPUs.

  • Policy is unable to be set on detached GPUs.

Profiling

  • DCGM cannot watch profiling fields on detached GPUs.

  • On profiling module, all watches on detached GPUs will be removed.

  • Watched profiling fields on group DCGM_GROUP_ALL_GPUS will be automatically applied to newly attached GPUs.

Diagnostic

  • Running diagnostic will be stopped during attach and detach process.

  • When specifying a group, only active GPUs will be tested.

  • Detached GPUs will not be tested.

    • If only detached GPUs are specified, the diag will fail directly.

    • If including both detached GPUs and active GPUs, only active GPUs will be tested.

Multi-Node Diagnostics

When detaching or attaching GPUs during MnDiag operation, the attach and detach processes will wait until MnDiag is complete.

Health

  • Health watches on group DCGM_GROUP_ALL_GPUS will be automatically applied to newly attached GPUs.

  • Health checks will not report incidents on detached GPUs.