The DRA provider discovers NVLink domain membership by reading nvidia.com/gpu.clique node labels set by the NVIDIA GPU Operator’s Dynamic Resource Allocation (DRA) driver. It is a Kubernetes-only provider that uses in-cluster service account auth — no credentials are required.
Important: The DRA provider produces block topology only (topology/block — NVLink domain membership). It does not discover switch tree topology. If you need both switch tree and NVLink domain topology, use the InfiniBand or NetQ provider instead.
On GB200 NVL72 and similar Multi-Node NVLink (MNNVL) hardware, groups of nodes share a high-bandwidth NVLink fabric (1.8 TB/s chip-to-chip). Workloads that span these nodes — distributed training, disaggregated inference — benefit significantly from being placed within the same NVLink domain.
Kubernetes exposes this through ComputeDomains, a DRA-based abstraction that represents a set of nodes sharing an NVLink/MNNVL domain as a first-class scheduling object. The GPU Operator’s DRA driver labels each node with nvidia.com/gpu.clique to encode its NVLink clique membership. Schedulers like KAI Scheduler consume these labels — via Topograph — to make topology-aware placement decisions.
The DRA provider is Topograph’s integration point for this ecosystem. For more background, see:
Use the DRA provider when:
If you also need switch tree topology — for example to express the full fabric hierarchy for topology/tree scheduling — use the InfiniBand or NetQ provider instead.
The nvidia.com/gpu.clique labels are applied automatically by the GPU Operator’s DRA driver — these are not manually configured by users.
Topograph reads these labels from the Kubernetes API:
nodeSelector if provided)nvidia.com/gpu.clique label, reads the clique ID and groups nodes by domainIf no nodes with matching labels are found, Topograph returns a 502 error with a diagnostic message indicating which label and annotations to check.
nvidia.com/gpu.clique labels — applied automatically by the DRA driverNo credentials are required. The provider uses the in-cluster service account automatically.
Set provider: dra in your Topograph config:
For Slinky (Slurm-on-Kubernetes) deployments:
To filter participating nodes via nodeSelector, pass parameters in the topology request payload:
After triggering topology generation, inspect the node labels applied by Topograph:
If topology generation returns a 502 error, check that the expected nodes have the nvidia.com/gpu.clique label and the topograph.nvidia.com/region / topograph.nvidia.com/instance annotations (the latter two are set by Topograph itself during topology discovery):
See the Kubernetes engine documentation for details on the label schema.