NetQ Topology Provider

View as Markdown

NVIDIA NetQ collects telemetry from Ethernet switches, DPUs, hosts, and NVLink fabrics, normalizes it, and streams it into a central analytics layer where metrics, events, and topology data are correlated in near real time. It exposes this processed data through a unified monitoring and operations platform supporting alerting, validation, and visibility for large-scale GPU and Ethernet/NVLink environments.

The Topograph NetQ provider queries the NetQ API to extract fabric topology and NVLink domain data, translating it into the format expected by your workload manager. It builds both a switch tree (for Slurm topology/tree or Kubernetes labels) and an NVLink domain map (for topology/block).

Topology discovery scope is determined by the NetQ server and the premises accessible to the configured account. No CSP credentials are required.

When to Use This Provider

Spectrum-X environments: NetQ is the standard management plane for Spectrum-X and is the recommended provider — it has authoritative, real-time visibility into the fabric that ibnetdiscover-based approaches cannot provide.

Multi-node NVLink (MNNVL) environments: NetQ includes NVLink Management (previously packaged as NMX-M), which provides native visibility into NVLink fabric topology, domain membership, and partitions at the infrastructure level. Note that for Kubernetes MNNVL scheduling, the DRA provider is the appropriate Topograph integration — it reads nvidia.com/gpu.clique labels set by the GPU Operator’s DRA driver and feeds them directly to Kubernetes schedulers. NetQ and DRA operate at different layers and can coexist.

Traditional IB environments: If NetQ is already deployed and managing your IB fabric, use this provider to leverage its existing topology data. If NetQ is not present, use the InfiniBand provider instead.

ScenarioRecommended Topograph provider
Spectrum-X fabricNetQ
MNNVL fabric, infrastructure visibilityNetQ
MNNVL fabric, Kubernetes schedulingDRA
Traditional IB fabric, NetQ deployedNetQ
Traditional IB fabric, no NetQinfiniband-bm or infiniband-k8s

Observed vs. Intended Topology

The NetQ provider reports what the fabric actually looks like right now, not what configuration files say it should look like. Because NetQ exposes live telemetry, Topograph can observe link states below the hard-failure threshold — degraded links that are still technically up but impacting performance. That signal is invisible to ibnetdiscover-based discovery and unreported by any cloud placement API. At scale, where nodes cycle continuously and link degradation is a constant background rate, this observed-topology view is substantively different from the static view that hand-maintained labels or placement snapshots provide.

Output

The NetQ provider produces the same topology representation as the InfiniBand providers, consumed by whichever engine you configure:

  • Slurm engine (engine: slurm) — writes a topology.conf file for Slurm topology-aware scheduling
  • Kubernetes engine (engine: k8s) — applies network.topology.nvidia.com/ labels to nodes
  • Slinky engine (engine: slinky) — writes topology data to a Kubernetes ConfigMap

See the engine documentation (docs/engines/) for details on each output format.

Prerequisites

  • A running NetQ server accessible from the Topograph host
  • A NetQ account with access to at least one premises with topology data

Credentials

FieldRequiredDescription
usernameYesNetQ account username
passwordYesNetQ account password

Parameters

FieldRequiredDescription
apiUrlYesBase URL of the NetQ server (e.g. https://netq.example.com)

Configuration

Credentials via File

Store credentials in a YAML file:

1username: <USERNAME>
2password: <PASSWORD>

Reference the file in your Topograph config:

1http:
2 port: 49021
3 ssl: false
4
5provider: netq
6engine: slurm
7
8credentialsPath: /path/to/credentials.yaml

Credentials via API Request Payload

Pass credentials directly in the topology request:

1{
2 "provider": {
3 "name": "netq",
4 "creds": {
5 "username": "<USERNAME>",
6 "password": "<PASSWORD>"
7 },
8 "params": {
9 "apiUrl": "https://netq.example.com"
10 }
11 },
12 "engine": {
13 "name": "slurm"
14 }
15}

How It Works

The provider makes two independent API calls and combines their results:

Switch tree (topology/tree):

  1. Authenticates via POST api/netq/auth/v1/login to obtain an access token and list of premises
  2. For each premises with topology data, selects it via GET api/netq/auth/v1/select/opid/{opid}
  3. Fetches the fabric topology graph via POST api/netq/telemetry/v1/object/topologygraph/fetch-topology
  4. Parses the tier-based node and link graph into a switch tree; Clos topologies are reduced to a canonical tree representation

NVLink domains (topology/block):

  1. Fetches compute node records via GET nmx/v1/compute-nodes using Basic auth — the nmx path reflects the NetQ NVLink Management API, previously known as NMX-M
  2. Groups nodes by DomainUUID to build the NVLink domain map. The DomainUUID is a NetQ/NMX identifier and differs in format from the ClusterUUID.CliqueId value used by nvidia.com/gpu.clique and the InfiniBand provider — both identify the same physical NVLink domain, but the values are not directly comparable as strings.

NVLink domain discovery is best-effort — if it fails, Topograph logs a warning and returns the switch tree only. Multi-premises environments are supported: Topograph iterates over all accessible premises and merges their topology graphs.

Verifying the Output

After triggering topology generation, query the result endpoint:

$id=$(curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate)
$curl -s "http://localhost:49021/v1/topology?uid=$id"

For the Slurm engine, verify the generated topology.conf reflects the expected switch hierarchy. See the Slurm engine documentation for details.