NetQ Topology Provider
NVIDIA NetQ collects telemetry from Ethernet switches, DPUs, hosts, and NVLink fabrics, normalizes it, and streams it into a central analytics layer where metrics, events, and topology data are correlated in near real time. It exposes this processed data through a unified monitoring and operations platform supporting alerting, validation, and visibility for large-scale GPU and Ethernet/NVLink environments.
The Topograph NetQ provider queries the NetQ API to extract fabric topology and NVLink domain data, translating it into the format expected by your workload manager. It builds both a switch tree (for Slurm topology/tree or Kubernetes labels) and an NVLink domain map (for topology/block).
Topology discovery scope is determined by the NetQ server and the premises accessible to the configured account. No CSP credentials are required.
When to Use This Provider
Spectrum-X environments: NetQ is the standard management plane for Spectrum-X and is the recommended provider — it has authoritative, real-time visibility into the fabric that ibnetdiscover-based approaches cannot provide.
Multi-node NVLink (MNNVL) environments: NetQ includes NVLink Management (previously packaged as NMX-M), which provides native visibility into NVLink fabric topology, domain membership, and partitions at the infrastructure level. Note that for Kubernetes MNNVL scheduling, the DRA provider is the appropriate Topograph integration — it reads nvidia.com/gpu.clique labels set by the GPU Operator’s DRA driver and feeds them directly to Kubernetes schedulers. NetQ and DRA operate at different layers and can coexist.
Traditional IB environments: If NetQ is already deployed and managing your IB fabric, use this provider to leverage its existing topology data. If NetQ is not present, use the InfiniBand provider instead.
Observed vs. Intended Topology
The NetQ provider reports what the fabric actually looks like right now, not what configuration files say it should look like. Because NetQ exposes live telemetry, Topograph can observe link states below the hard-failure threshold — degraded links that are still technically up but impacting performance. That signal is invisible to ibnetdiscover-based discovery and unreported by any cloud placement API. At scale, where nodes cycle continuously and link degradation is a constant background rate, this observed-topology view is substantively different from the static view that hand-maintained labels or placement snapshots provide.
Output
The NetQ provider produces the same topology representation as the InfiniBand providers, consumed by whichever engine you configure:
- Slurm engine (
engine: slurm) — writes atopology.conffile for Slurm topology-aware scheduling - Kubernetes engine (
engine: k8s) — appliesnetwork.topology.nvidia.com/labels to nodes - Slinky engine (
engine: slinky) — writes topology data to a Kubernetes ConfigMap
See the engine documentation (docs/engines/) for details on each output format.
Prerequisites
- A running NetQ server accessible from the Topograph host
- A NetQ account with access to at least one premises with topology data
Credentials
Parameters
Configuration
Credentials via File
Store credentials in a YAML file:
Reference the file in your Topograph config:
Credentials via API Request Payload
Pass credentials directly in the topology request:
How It Works
The provider makes two independent API calls and combines their results:
Switch tree (topology/tree):
- Authenticates via
POST api/netq/auth/v1/loginto obtain an access token and list of premises - For each premises with topology data, selects it via
GET api/netq/auth/v1/select/opid/{opid} - Fetches the fabric topology graph via
POST api/netq/telemetry/v1/object/topologygraph/fetch-topology - Parses the tier-based node and link graph into a switch tree; Clos topologies are reduced to a canonical tree representation
NVLink domains (topology/block):
- Fetches compute node records via
GET nmx/v1/compute-nodesusing Basic auth — thenmxpath reflects the NetQ NVLink Management API, previously known as NMX-M - Groups nodes by
DomainUUIDto build the NVLink domain map. TheDomainUUIDis a NetQ/NMX identifier and differs in format from theClusterUUID.CliqueIdvalue used bynvidia.com/gpu.cliqueand the InfiniBand provider — both identify the same physical NVLink domain, but the values are not directly comparable as strings.
NVLink domain discovery is best-effort — if it fails, Topograph logs a warning and returns the switch tree only. Multi-premises environments are supported: Topograph iterates over all accessible premises and merges their topology graphs.
Verifying the Output
After triggering topology generation, query the result endpoint:
For the Slurm engine, verify the generated topology.conf reflects the expected switch hierarchy. See the Slurm engine documentation for details.