Network Transport and Fabric Visibility

View as Markdown

Network Transport and Fabric Visibility

Backend Switch Fabric API

The purpose of this API is to expose sufficient information about the cluster’s network topology to enable efficient scheduling, placement, and optimization of multi-node GPU workloads. Understanding the network hierarchy between compute instances and switches, as well as both intra- and inter-node NVLink domains, is essential for minimizing communication latency and maximizing throughput. Thus, this applies to North-South, East-West, and NVLink networks (not MGMT). See the appendix for a DGXC recommended reference implementation.

Req IDTest Details (Legend)Requirement AreaDescription
NET01INFOBackend Switch Fabric APIFor each compute node, the API must provide visibility into the backend network switches connecting the node to the core. Identification: Each switch must be identified by a unique, stable identifier. A “switch” may represent a physical switch or a logical connectivity domain. Structure: API may be gRPC or REST. Response structure may include multiple nodes (pagination expected). Topology: Switch info can be returned as an ordered array of IDs (e.g., leaf, spine, core) or separate fields for each tier
NET02INFONVLink Domain APIRequirement: For compute nodes supporting NVLink (e.g., GB200, GB300, Vera Rubin), the API shall return the unique identifier of the NVLink domain associated with each node. Implementation: Can be a separate API method or part of the Backend Switch Fabric API.