gRPC API
To ensure compatibility, use the NVIDIA provided NMX-Controller gRPC proto file to implement the gRPC client.
Property | Type | Description | Notes |
partitionId | unit32 | The partition APIs allow the usage of partitionId to identify specific partition instance. | Values of 1 to 32766 Note: The value of 32766 is reserved for default partition |
partitionName | String | The partition APIs allow the usage of partitionName to identify specific partition instance. partitionName (if provided) must be unique in the domain. | Values up to 255 ASCII characters Note: The value of "Default Partition" is reserved for default partition |
returnCode | enum (ST_ReturnCode) | The value of In a unsolicited server notification, returnCode value is always | |
GatewayId | String | Uniquely identifies the client establishing the gRPC connection. | |
ServerHeader | A message sent in every response and notification from the server. The message includes:
| ||
KeyValPair | A key (string) and its associated value (string). | ||
ConfigKey | An identifier of a static configuration parameter. Includes configFileName (string) and key (string). | ||
Context | String | A string that should be set to | |
Location | A message that provides details on a node location:
| ||
LocationInfo | A message that provides details on a node location with additional physical properties:
| ||
gpuLocation | A message that provides details on a GPU location:
|
Hello(ClientHello) returns (ServerHello)
The first RPC call client that must be send after gRPC connection establishment
Request:
Message/Parameter | Type | Description | Notes |
ClientHello.gatewayId | string | Client identifier | |
ClientHello.major_version | ProtoMsgMajorVersion | Major version | |
ClientHello.minor_version | ProtoMsgMinorVersion | Minor version |
Response:
Message/Parameter | Type | Description | Notes |
ServerHello.serverHeader | ServerHeader | Base server info and return code | |
ServerHello.components_ver | Array of KeyValPair | List of NMX-Controller components, and their version | |
ServerHello.capabilities | Array of string | List of NMX-Controller capabilities | |
ServerHello.host_os_details | String | NMX-Controller host OS details | |
ClientHello.major_version | ProtoMsgMajorVersion | Major version | |
ClientHello.minor_version | ProtoMsgMinorVersion | Minor version |
ServerHello.serverHeader.returnCode:
NMX_ST_SUCCESS - Client hello was successful
NMX_ST_BADPARAM - Missing or invalid gatewayId (e.g. empty string)
NMX_ST_VERSION_MISMATCH - Major version of client and server protos do not match
Server closes the gRPC connection if:
Client ProtoMsgMajorVersion is not the same as server
Client attempts to call any RPC before successful completion of Hello() RPC
Subscribe(SubscribeRequest) returns (stream ServerNotification)
The client calls this RPC to subscribe for asynchronous push notifications. SubscribeRequest has the gatewayID, and the notifyOnSelfChange that should be set to false.
Asynchronous push notifications:
Message/Parameter | Type | Description | Notes |
ServerNotification.subscriptionResponse | SubscriptionResponse | Confirmation of subscription | Sent only to requesting client |
ServerNotification.staticConfigResponse | StaticConfigResponse | Notification of static config change Includes changed items | Sent to all subscribed clients except for the requesting client |
ServerNotification.CreatePartitionResponse | CreatePartitionResponse | Notification of partition creation includes partitionId | Sent to all subscribed clients except for the requesting client |
ServerNotification.DeletePartitionResponse | DeletePartitionResponse | Notification of partition deletion includes partitionId | Sent to all subscribed clients except for the requesting client |
ServerNotification.UpdatePartitionResponse | UpdatePartitionResponse | Notification of partition configuration change includes partitionId | Sent to all subscribed clients except for the requesting client |
ServerNotification.fmEvent. fmEventControlPlaneStateChange | FmEventControlPlaneStateChange | Notification of change in control plane state | Sent to all subscribed clients |
ServerNotification.fmEvent.fmEventTopologyChange | FmEventTopologyChange | Notification of change in discovered topology | Sent to all subscribed clients |
ServerNotification.fmEvent.fmEventPartitionChange | FmEventPartitionChange | Notification of change in partition health includes partitionId | Sent to all subscribed clients |
ServerNotification.healthStateChanged | HealthStateChanged | Notification of change in NMX-C health | Sent to all subscribed clients |
ServerNotification.serverHeader.returnCode:
NMX_ST_SUCCESS - Client subscription was successful
NMX_ST_BADPARAM - Missing or invalid gatewayId
GetStaticConfig(GetStaticConfigRequest) returns (StaticConfigResponse)
The client calls this RPC to read the current static configuration. The client may request to receive either full files, or parameters from the files
Request:
Message/Parameter | Type | Description | Notes |
GetStaticConfigRequest.configKeys | Array of configKey | List of configuration parameters | Either configKeys or configFiles should be provided |
GetStaticConfigRequest.configFiles | Array of configFileName | List of configuration files | Either configKeys or configFiles should be provided |
Supported values for configFileName:
sm_config
fm_config
rdm_config
chassis_mapping
Response:
Message/Parameter | Type | Description | Notes |
staticConfig.configKeyVals | Array of configKeyVals | List of configuration parameters and their values | Either configKeyVals or configFileContents is provided, depending on request |
staticConfig.configFileContents | Array of ConfigFileContent | List of configuration files and their content | Either configKeyVals or configFileContents is provided, depending on request |
StaticConfigResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - Client request for static configuration was successful
NMX_ST_BADPARAM - Missing or invalid input parameter (e.g. empty strings)
Example of fm_config file content:
# Description: Determine whether a default partition needs to be created
# Possible Values:
# 0 - No partitions are created during GFM initialization. GFM disables routing until an API request
# to create a partition is successful.
# 1(default) - Creates a default partition during GFM initialization. GFM creates the partition to include
# all GPUs in the topology and enables routing so that all GPUs can communicate to each other.
MNNVL_ENABLE_DEFAULT_PARTITION=1
# Description: Determine resiliency mode for default partition and when it is unspecified on a user partition
# Possible Values:
# 1 - resiliency mode RESILIENCY_MODE_FULL_BANDWIDTH
# 2(default) - resiliency mode RESILIENCY_MODE_ADAPTIVE_BANDWIDTH
# 3 - resiliency mode RESILIENCY_MODE_USER_ACTION_REQ
MNNVL_DEFAULT_RESILIENCY_MODE=2
# Description: Set type of default partition (specified by location or gpuuid)
# Possible Values:
# 1 - Creates a default partition using locations of GPUs
# 2(default) - Creates a default partition using GPU UIDs
MNNVL_DEFAULT_PARTITION_TYPE=2
MNNVL_TOPOLOGY=gb200_nvl36r1_c2g4_topology
SetStaticConfig(SetStaticConfigRequest) returns (ReturnCode)
The client calls this RPC to update the static configuration. Client may request to update either full files, or parameters in the files.
Request:
Message/Parameter | Type | Description | Notes |
SetStaticConfigRequest.staticConfig.configKeyVals | Array of configKey | List of configuration parameters and their values | Either configKeyVals or configFileContents should be provided |
SetStaticConfigRequest. staticConfig.configFileContents | Array of configFileContent | List of configuration files and their content | Either configKeyVals or configFileContents should be provided |
Supported values for configFileName:
sm_config
fm_config
rdm_config
chassis_mapping
All config files are compliant to Linux INI format (https://en.wikipedia.org/wiki/INI_file), with the following restrictions:
format “key = value”
No support for sections
No support for hierarchy
Case insensitive
Support for comments
No support for duplicate names
Support for Quoted values
Support for Line continuation
Support for Escape characters
The "fm config" file must include the MNNVL_TOPOLOGY parameter with one of the values:
gb200_nvl36r1_c2g4_topology - 36 GPUs, Single chassis, 2 CPUs, 4 GPUs per compute tray
gb200_nvl36r1_c2g2_topology - 36 GPUs, Single chassis, 2 CPUs, 2 GPUs per compute tray
gb200_nvl72r1_c2g4_topology - 72 GPUs, Single chassis, 2 CPUs, 4 GPUs per compute tray
gb200_nvl72r2_c2g4_topology - 72 GPUs, Two chassis, 2 CPUs, 4 GPUs per compute tray
gb200_nvl72r2_c2g2_topology - 72 GPUs, Two chassis, 2 CPUs, 2 GPUs per compute tray
The chassis_mapping includes the mapping between chassis-id and chassis-serial-number, see example for the file format:
chassisId1 ABC123ABC123AchassisId2 XYZ987XYZ987X
ReturnCode:
NMX_ST_SUCCESS - Client request for static configuration was successful
NMX_ST_BADPARAM - Missing or invalid input parameter
GetDomainProperties(GetDomainPropertiesRequest) returns (DomainProperties)
Domain Properties provide an overview of the current topology that NMX-C manages. They provide the maximum number of expected resources(including but not limited to gpus, switches, compute nodes, switch nodes, partitions, nvlinks) in the domain.
The client calls this RPC to get static properties of the domain. The GetDomainPropertiesRequest has the context and gatewayID.
Response:
Message/Parameter | Type | Description | Notes |
DomainProperties.serverHeader | ServerHeader | Base server info and return code | |
DomainProperties.maxComputeNodes | uint32 | Maximum number of Compute Nodes in the NVLink Domain | |
DomainProperties.maxComputeNodesPerChassis | uint32 | Maximum number of Compute Nodes in a chassis | Number of chassis in the NVLink Domain = DomainProperties.maxComputeNodes / DomainProperties.maxComputeNodesPerChassis |
DomainProperties.maxGpusPerComputeNode | uint32 | Maximum number of GPUs in a Compute Node | |
DomainProperties.maxGpuNvLinks | uint32 | Maximum number of NVLinks in a GPU | |
DomainProperties.lineRateMBps | uint32 | Maximum line rate in MBps of an NVLink | |
DomainProperties.maxSwitchNodes | uint32 | Maximum number of Switch Nodes in the NVLink Domain | |
DomainProperties.maxSwitchNodesPerChassis | uint32 | Maximum number of Switch Nodes in a chassis | |
DomainProperties.maxSwitchesPerSwitchNode | uint32 | Maximum number of Switches in a Switch Node | |
DomainProperties.maxSwitchNvLinks | uint32 | Maximum number of NVLinks in a Switch | |
DomainProperties.maxNumPartitions | uint32 | Maximum number of partitions that can be created in the NVLink Domain | |
DomainProperties.maxNumAlids | uint32 | Maximum number of Alids for a GPU | |
DomainProperties.maxMulticastGroups | uint32 | Maximum number of Multicast Groups available in the NvLink Domain | |
DomainProperties.maxNumPorts | uint32 | Total aggregate number of ports across switches and GPUs in the NVLink Domain |
DomainProperties.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetDomainStateInfo(GetDomainStateInfoRequest) returns (DomainStateInfo)
The client calls this RPC to get the dynamic properties of the domain. The GetDomainStateInfoRequest has the context and gatewayID.
Response:
Message/Parameter | Type | Description | Notes |
DomainStateInfo.serverHeader | ServerHeader | Base server info and return code | |
DomainStateInfo.controlPlaneState | ControlPlaneState | State of the domain control plane | |
DomainStateInfo.availableMulticastGroups | uint32 | Number of available multicast groups | |
DomainStateInfo.configStatusDescription | string | Additional details of the control plane state | |
DomainStateInfo.nmxControllerHealth | NmxControllerHealth | NMX-Controller service health |
DomainStateInfo.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
The following are the values of controlPlaneState:
NMX_CONTROL_PLANE_STATE_UNDEFINED = 0
; //!< Control plane state is undefined
NMX_CONTROL_PLANE_STATE_OFFLINE = 1
; //!< Control plane state is offline
NMX_CONTROL_PLANE_STATE_STANDBY = 2
; //!< Control plane state is standby, NvLink Domain is not operational
NMX_CONTROL_PLANE_STATE_CONFIGURED = 3
; //!< Control plane state is configured, NvLink Domain is operational
NMX_CONTROL_PLANE_STATE_TIMEOUT = 4
; //!< Control plane state is timeout
NMX_CONTROL_PLANE_STATE_ERROR = 5
; //!< Control plane state is error, user provided an invalid configuration
NMX_CONTROL_PLANE_STATE_UNCONFIGURED = 6
; //!< Control plane state is unconfigured, pending user provided configuration
Control Plane states and associated configuration status description strings are explained below
NMX_CONTROL_PLANE_STATE_UNCONFIGURED - Pending required FM configuration.
CONFIG_PENDING_UUID - Pending NVLink Domain UUID. FM waits indefinitely until set.
CONFIG_PENDING_TOPOLOGY - Pending MNNVL_TOPOLOGY config. FM waits indefinitely until set.
CONFIG_PENDING_CHASSIS_ID_MAPPING - Pending chassis Id mapping. FM waits for GFM_WAIT_TIMEOUT_SEC until set. If not set, FM allocates the mapping during the initial resource discovery.
CONFIG_RECEIVED - FM received all the required configuration.
NMX_CONTROL_PLANE_STATE_ERROR - Error validating FM configuration. Restart NMX-C after the configuration error is fixed.
CONFIG_ERROR_INCORRECT_TOPOLOGY_FILE - Encountered error while processing the topology file specified in MNNVL_TOPOLOGY
CONFIG_ERROR_CHASSIS_ID_MAPPING_COUNT - Detected mismatch between number of entries in the chassis Id mapping specified and expected number of chassis read from the topology file specified in MNNVL_TOPOLOGY
CONFIG_ERROR_CHASSIS_ID_MAPPING_OUT_OF_RANGE - Detected chassis Id value outside of the allowed range. Allowed range is 1 to n, where n is the number of chassis in the NVLink Domain.
CONFIG_ERROR_DUPLICATE_CHASSIS_SERIAL_NUMBER - Detected duplicate chassis serial number in the chassis Id mapping
CONFIG_ERROR_MISSING_CHASSIS - Detected fewer than expected chassis serial numbers during the initial resource discovery.
CONFIG_ERROR_ADDITIONAL_CHASSIS_DETECTED - Detected more than expected chassis serial number(s) during the initial resource discovery
NMX_CONTROL_PLANE_STATE_DEGRADED - FM detected a misconfiguration. The NVLink Domain continues to work with limited capability. Fixing the misconfiguration restores full functionality.
CONFIG_ERROR_MISWIRED_TRUNK_PORTS - Incorrect/mis-wired trunk connection(s) detected. Use GetConnInfoList() GRPC API to get details.
NMX_CONTROL_PLANE_STATE_CONFIGURED - FM completed configuration validation and initialization
CONFIG_DONE - FM completed configuration validation and initialization
The following are the values of nmxControllerHealth:
NMX_CONTROLLER_HEALTH_UNKNOWN = 0
;
NMX_CONTROLLER_HEALTH_HEALTHY = 1
;
NMX_CONTROLLER_HEALTH_DEGRADED = 2
;
NMX_CONTROLLER_HEALTH_UNHEALTHY = 3
;
GetTopologyInfo returns (FmTopologyInfo)
The client calls this RPC to receive information on the currently discovered topology:
Devices (GPUs and Switches) and their properties and state (DeviceTopoInfo contains one of SwitchTopoInfo or GpuTopoInfo)
Devices ports and their properties and state(PortTopoInfo)
Devices connectivity information
Response:
Message/Parameter | Type | Description | Notes |
FmTopologyInfo.serverHeader | ServerHeader | Base server info and return code | |
FmTopologyInfo.deviceTopoInfo | Array of DeviceTopoInfo | List of discovered devices |
FmTopologyInfo.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
DeviceTopoInfo message:
Message/Parameter | Type | Description | Notes |
switchTopoInfo | SwitchTopoInfo | Details of switch device | Either switchTopoInfo or gpuTopoInfo is provided, depending on the discovered device type |
gpuTopoInfo | GpuTopoInfo | Details of GPU device | Either switchTopoInfo or gpuTopoInfo is provided, depending on the discovered device type |
gpuTopoInfo message:
Message/Parameter | Type | Description | Notes |
loc | LocationInfo | Node location | |
topologyId | uint64 | Indicates switch tray model | Value of 128 is scale-out GB200 NVL switch tray, is 129 non-scale-out GB200 NVL switch tray |
deviceUid | uint64 | Device unique identifier | |
deviceId | uint32 | Device enumeration within the node | From 1 |
numPorts | uint64 | Total number of ports | |
systemUid | uint64 | Node unique identifier | |
vendorId | uint32 | Device vendor ID | |
devicePcieId | uint64 | Device PCIe ID | |
description | string | Device description | |
partitionId | Array of PartitionId | List of partitions the device is associated with | |
deviceHealth | GpuHealth | NVLink Health of the GPU | |
portTopoInfo | Array of PortTopoInfo | List of device ports | |
aLids | Array of uint64 | List of device labels for internal routing |
SwitchTopoInfo message:
Message/Parameter | Type | Description | Notes |
loc | LocationInfo | Node location | |
topologyId | uint64 | Indicates switch tray model | |
deviceUid | uint64 | Device unique identifier | |
deviceId | uint32 | Device enumberation within the node | |
numPorts | uint64 | Total number of ports | |
systemUid | uint64 | Node unique identifier | |
vendorId | uint32 | Device vendor ID | |
devicePcieId | uint64 | Device PCIe ID | |
description | string | Device description | |
partitionId | Array of PartitionId | List of partitions the device is associated with | |
deviceHealth | SwitchHealth | NVLink Health of the Switch | |
portTopoInfo | Array of PortTopoInfo | List of device ports |
PortTopoInfo message:
Message/Parameter | Type | Description | Notes |
portType | PortType | Type of device port |
|
portUid | uint64 | port identifier |
|
portNum | uint64 | Port number of device | From 1 |
peerPortDeviceUid | uint64 | Peer device unique identifier | |
peerPortNum | uint64 | Peer port number on peer device | From 1 |
physicalState | PhysicalPortState | NVLink port physical state | |
logicalState | LogicalPortState | NVLink port logical state | |
subnetPrefix | uint64 | NVLink port routing subnet | |
isSdnPort | boolean | NVLink management port indicator | |
partitionIdList | Array of PartitionId | List of partitions the port is associated with | |
cageNum | uint32 | Front panel cage number | Provided only when portType=PORT_TYPE_SWITCH_TRUNK From 1 |
cagePortNum | uint32 | Front panel port number in cage | Provided only when portType=PORT_TYPE_SWITCH_TRUNK From 1 |
cageSplitPortNum | uint32 | Front split port number in cage | Provided only when portType=PORT_TYPE_SWITCH_TRUNK From 1 |
baseLid | uint64 | Base label for routing | Provided only when portType=PORT_TYPE_GPU |
systemPortNum | uint64 | Port number per tray | Provided only when portType=PORT_TYPE_SWITCH_ACCESS From 1 |
computePortNum | uint64 | Port number per GPU | Provided only when portType=PORT_TYPE_GPU From 0 |
containAndDrain | bool | Indication that the port is in contain and drain state | value of TRUE indicates active contain and drain state |
rail | uint32 | Rail of the port | |
plane | uint32 | Plane of the port | |
linkRateMbps | uint32 | Rate of the port link in Mbits/sec |
GetComputeNodeCount(GetComputeNodeCountRequest) returns (GetComputeNodeCountResponse)
The client calls this RPC to get the number of compute nodes. It allows the client to filter the compute nodes based on Attribute, Health and Location.
Attribute reflects the allocation status of a compute node to a partition:
NMX_COMPUTE_NODE_ATTR_ALL = 1 - all compute nodes in the NvLink Domain irrespective of the allocation status
NMX_COMPUTE_NODE_ATTR_FREE = 2 - compute nodes in the NvLink Domain that are not allocated to any partitions
NMX_COMPUTE_NODE_ATTR_FULLY_ALLOCATED = 3 - compute nodes in the NvLink Domain where all of its gpus are allocated to partitions
NMX_COMPUTE_NODE_ATTR_PARTIALLY_ALLOCATED = 4 - compute nodes in the NvLink Domain where one or more(but not all) of its gpus are allocated to partitions
Compute node health:
NMX_COMPUTE_NODE_HEALTH_HEALTHY = 1 - Fully healthy
NMX_COMPUTE_NODE_HEALTH_DEGRADED = 2 - Some GPUs are degraded to NO_NVLINK
NMX_COMPUTE_NODE_HEALTH_UNHEALTHY = 3 - Unable to participate in NVLink
Location and health are optional filters.
Request:
Message/Parameter | Type | Description | Notes |
GetComputeNodeCountRequest.attr | ComputeNodeAttr | Filter based on partition allocation status of the compute node | |
GetComputeNodeCountRequest.chassisId | unit64 | Chassis ID | [Optional] Set to 0 when not used |
GetComputeNodeCountRequest.nodeHealth | ComputeNodeHealth | NVLink health of the compute node | [Optional] Set to COMPUTE_NODE_HEALTH_UNKNOWN when not used |
Response message:
Message/Parameter | Type | Description | Notes |
GetComputeNodeCountResponse.numNodes | unit32 | Limit on number of compute nodes matching the filters |
GetComputeNodeCountResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetComputeNodeLocationList(GetComputeNodeLocationListRequest) returns (GetComputeNodeLocationListResponse)
Client calls this RPC to get the locations of the compute nodes. It allows the client to filter the compute nodes based on Attribute, Health and Location.
Attribute reflects the allocation status of a compute node to a partition:
NMX_COMPUTE_NODE_ATTR_ALL = 1 - all compute nodes in the NvLink Domain irrespective of the allocation status
NMX_COMPUTE_NODE_ATTR_FREE = 2 - compute nodes in the NvLink Domain that are not allocated to any partitions
NMX_COMPUTE_NODE_ATTR_FULLY_ALLOCATED = 3 - compute nodes in the NvLink Domain where all of its gpus are allocated to partitions
NMX_COMPUTE_NODE_ATTR_PARTIALLY_ALLOCATED = 4 - compute nodes in the NvLink Domain where one or more(but not all) of its gpus are allocated to partitions
Compute node health:
NMX_COMPUTE_NODE_HEALTH_HEALTHY = 1 - Fully healthy
NMX_COMPUTE_NODE_HEALTH_DEGRADED = 2 - Some GPUs are degraded to NO_NVLINK
NMX_COMPUTE_NODE_HEALTH_UNHEALTHY = 3 - Unable to participate in NVLink
Location and health are optional filters.
Request message:
Message/Parameter | Type | Description | Notes |
GetComputeNodeLocationListRequest.attr | ComputeNodeAttr | Filter based on partition allocation status of the compute node | |
GetComputeNodeLocationListRequest.chassisId | unit64 | Chassis ID | [Optional] Set to 0 when not used |
GetComputeNodeLocationListRequest.nodeHealth | ComputeNodeHealth | NVLink health of the compute node | [Optional] Set to COMPUTE_NODE_HEALTH_UNKNOWN when not used |
GetComputeNodeLocationListRequest.numNodes | unit32 | Limit on number of nodes in the response |
Response message:
Message/Parameter | Type | Description | Notes |
GetComputeNodeLocationListResponse.locList | Array of Location | List of locations |
GetComputeNodeLocationListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetComputeNodeInfoList(GetComputeNodeInfoListRequest) returns (GetComputeNodeInfoListResponse)
The client calls this RPC to get information on the compute nodes. The client uses it get both static(location and number of gpus in the node) and dynamic information(partition Ids and health) about compute nodes.
Request message:
Message/Parameter | Type | Description | Notes |
GetComputeNodeInfoListResponse.locList | Array of Location | List of locations | [Optional] If the list is empty, the response includes all nodes |
Response message:
Message/Parameter | Type | Description | Notes |
GetComputeNodeInfoListResponse.nodeInfoList | Array of ComputeNodeInfo |
GetComputeNodeInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
ComputeNodeInfo message:
Message/Parameter | Type | Description | Notes |
ComputeNodeInfo.loc | LocationInfo | Node location | |
ComputeNodeInfo.numGPUs | unit32 | Number of GPUs for the node | |
ComputeNodeInfo.nodeHealth | ComputeNodeHealth | NVLink health of the compute node | |
ComputeNodeInfo.partitionIdList | Array of PartitionId | List of partitions the device is associated with |
GetGpuInfoList(GetGpuInfoListRequest) returns (GetGpuInfoListResponse)
The client calls this RPC to get information on the GPUs.
The client can filter GPUs based on location or partition by passing the attribute NMX_GPU_ATTR_LOCATION or NMX_GPU_ATTR_PARTITION_ID respectively.
Notes:
Specifying NMX_GPU_ATTR_PARTITION_ID and partitionId=0 provide info on GPUs that are not associated with any partition
Specifying NMX_GPU_ATTR_ALL ignores the location or partition values set in the request
Request message:
Message/Parameter | Type | Description | Notes |
GetGpuInfoListRequest.attr | GpuAttr | Filter based on GPUs that belong to a partition or location | |
GetGpuInfoListRequest.numGpus | unit32 | Limit on number of GPUs for response | [Optional] Set to 0 when not used, and response includes all GPUs matching the filter |
GetGpuInfoListRequest.loc | Location | Location | Values used only when attr=GPU_ATTR_LOCATION_ID |
GetGpuInfoListRequest.partitionId | PartitionId | Partition ID | Value used only when attr=GPU_ATTR_PARTITION_ID |
GetGpuInfoListRequest.gpuHealth | GpuHealth | GPU health | [Optional] Set to 0 (GPU_HEALTH_UNKNOWN), when not used |
GPU health values:
NMX_ GPU_HEALTH_HEALTHY = 1 //!< Fully healthy
NMX_ GPU_HEALTH_DEGRADED = 2 //!< One or more links are down
NMX_ GPU_HEALTH_NO_NVLINK = 3 //!< Unable to participate in NVLink partition
NMX_GPU_HEALTH_DEGRADED_BW = 4 //!< GPU operates in degraded bandwidth
Note: to get all healthy GPU not associated with any partition set:
attr to GPU_PARTITION_ID
partitionId to 0
Response message:
Message/Parameter | Type | Description | Notes |
GetGpuInfoListResponse.gpuInfoList | Array of GpuInfo | A list of GPU information |
GetGpuInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GpuInfo message:
Message/Parameter | Type | Description | Notes |
GpuInfo.loc | LocationInfo | Node location | |
GpuInfo.gpuId | uint32 | GPU enumeration within the node | From 1 |
GpuInfo.gpuUid | uint64 | GPU unique identifier | |
GpuInfo.gpuHealth | GpuHealth | NVLink health of the GPU | |
GpuInfo.partitionId | PartitionId | Partitions the device is associated with |
GetSwitchNodeCount(GetSwitchNodeCountRequest) returns (GetSwitchNodeCountResponse)
The client calls this RPC to get the number of switch nodes.
The client can filter the switch nodes using attribute and health. Note that in the NVLink5 topologies do not have switch nodes that are of type NMX_SWITCH_NODE_ATTR_L2.
Request message:
Message/Parameter | Type | Description | Notes |
GetSwitchNodeCountRequest.attr | SwitchNodeAttr | Filter based on the type of switch node | |
GetSwitchNodeCountRequest.nodeHealth | SwitchNodeHealth | NVLink health of the switch node | [Optional] Set to SWITCH_NODE_HEALTH_UNKNOWN when not used |
GetSwitchNodeCountRequest.numNodes | uint32 | Limit on number of nodes for the response | [Optional] Set to 0 for no limit |
Response message:
Message/Parameter | Type | Description | Notes |
GetSwitchNodeCountResponse.numNodes | uint32 | Number of nodes matching the filter |
GetSwitchNodeCountResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetSwitchNodeLocationList(GetSwitchNodeLocationListRequest) returns (GetSwitchNodeLocationListResponse)
The client calls this RPC to get the location of switch nodes.
The client can filter the switch nodes using attribute and health. Note that in the NVLink5 topologies do not have switch nodes that are of type NMX_SWITCH_NODE_ATTR_L2
Request message:
Message/Parameter | Type | Description | Notes |
GetSwitchNodeLocationListRequest.attr | SwitchNodeAttr | Filter based on the type of switch node | |
GetSwitchNodeLocationListRequest.nodeHealth | SwitchNodeHealth | NVLink health of the switch node | [Optional] Set to SWITCH_NODE_HEALTH_UNKNOWN when not used |
GetSwitchNodeLocationListRequest.numNodes | uint32 | Limit on number of nodes for the response | [Optional] Set to 0 for no limit |
Response message:
Message/Parameter | Type | Description | Notes |
GetSwitchNodeLocationListResponse.locList | Array of Location | List of nodes locations |
GetSwitchNodeLocationListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetSwitchNodeInfoList(GetSwitchNodeInfoListRequest) returns (GetSwitchNodeInfoListResponse)
The client calls this RPC to get information on the switch nodes. The client uses it get both static(location and number of switches in the node) and dynamic information(partition Ids and health) about switch nodes
Request message:
Message/Parameter | Type | Description | Notes |
GetSwitchNodeInfoListRequest.locList | Array of Location | List of nodes locations | [Optional] I the list is empty, the response includes all nodes |
Response message:
Message/Parameter | Type | Description | Notes |
GetSwitchNodeInfoListResponse.nodeInfoList | Array of SwitchNodeInfo | List of switch nodes |
GetSwitchNodeInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
SwitchNodeInfo message:
Message/Parameter | Type | Description | Notes |
SwitchNodeInfo.loc | LocationInfo | Node location | |
SwitchNodeInfo.numSwitches | uint32 | Number of switches in the node | |
SwitchNodeInfo.nodeHealth | SwitchNodeHealth | NVLink health of the switch node | |
SwitchNodeInfo.partitionIdList | Array of PartitionId | List of partitions the node is associated with |
GetSwitchInfoList(GetSwitchInfoListRequest) returns (GetSwitchInfoListResponse)
The client calls this RPC to get information on the switches.
Request message:
Message/Parameter | Type | Description | Notes |
GetSwitchInfoListRequest.numSwitches | unit32 | Limit on number of switches for response | [Optional] Set to 0 when not used. The response includes all switches matching the filter |
GetSwitchInfoListRequest.loc | Location | Location |
Response message:
Message/Parameter | Type | Description | Notes |
GetSwitchInfoListResponse.switchInfoList | Array of SwitchInfo | List of switch info |
GetSwitchInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
SwitchInfo message:
Message/Parameter | Type | Description | Notes |
SwitchInfo.loc | LocationInfo | Node location | |
SwitchInfo.switchId | uint32 | Switch enumeration with n the node | From 1 |
SwitchInfo.switchUid | uint64 | Switch unique identifier | |
SwitchInfo.health | SwitchHealth | NVLink health of the switch | |
SwitchInfo.numPorts | uint32 | Number of ports |
GetPartitionCount(GetPartitionCountRequest) returns (GetPartitionCountResponse)
The client calls this RPC to get the number of partitions. The number of partitions can be filtered using the numGpus, numNodes or health filters. The infoAttr value decides which filter to choose between numGpus and numNodes. If set to ATTR_ALL, numGpus and numNodes are ignored.
enum
PartitionInfoAttr {
NMX_PARTITION_INFO_ATTR_UNDEFINED = 0
;
NMX_PARTITION_INFO_ATTR_ALL = 1
; //!< All Partitions
NMX_PARTITION_INFO_ATTR_NUM_GPUS = 2
; //!< Number of Partitions with a specific GPU size
NMX_PARTITION_INFO_ATTR_NUM_COMPUTE_NODES = 3
; //!< Number of Partitions with a specific number of Compute nodes
}
enum
PartitionHealth {
NMX_PARTITION_HEALTH_UNKNOWN = 0
;
NMX_PARTITION_HEALTH_HEALTHY = 1
; //!< Partition is healthy
NMX_PARTITION_HEALTH_DEGRADED_BANDWIDTH = 2
; //!< Partition is in degraded bandwidth
NMX_PARTITION_HEALTH_DEGRADED = 3
; //!< One or more GPUs has routing disabled
NMX_PARTITION_HEALTH_UNHEALTHY = 4
; //!< Partition is unhealthy
}
Request message:
Message/Parameter | Type | Description | Notes |
GetPartitionCountRequest.infoAttr | PartitionInfoAttr | Filter based on number of GPUs or compute nodes in a partition | |
GetPartitionCountRequest.numGpus | uint32 | Number of GPUs in a partition | Values used only when attr=PARITITION_INFO_ATTR_NUM_GPUS |
GetPartitionCountRequest.numNodes | uint32 | Number of compute nodes in a partition | Values used only when attr=PARTITION_INFO_ATTR_NUM_COMPUTE_NODES |
GetPartitionCountRequest.health | PartitionHealth | NVLink health of the partition | Optional, set to PARTITION_HEALTH_UNKNOWN when not used |
Response message:
Message/Parameter | Type | Description | Notes |
GetPartitionCountResponse.numPartitions | uint32 | Number of partitions matching the filter |
GetPartitionCountResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetPartitionIdList(GetPartitionIdListRequest) returns (GetPartitionIdListResponse)
The client calls this RPC to get the partitions IDs. The partitionIds can be filtered using the numGpus, numNodes or health filters. The infoAttr value decides which filter to choose between numGpus and numNodes. If set to ATTR_ALL, numGpus and numNodes are ignored.
enum
PartitionInfoAttr {
NMX_PARTITION_INFO_ATTR_UNDEFINED = 0
;
NMX_PARTITION_INFO_ATTR_ALL = 1
; //!< All Partitions
NMX_PARTITION_INFO_ATTR_NUM_GPUS = 2
; //!< Number of Partitions with a specific GPU size
NMX_PARTITION_INFO_ATTR_NUM_COMPUTE_NODES = 3
; //!< Number of Partitions with a specific number of Compute nodes
}
enum
PartitionHealth {
NMX_PARTITION_HEALTH_UNKNOWN = 0
;
NMX_PARTITION_HEALTH_HEALTHY = 1
; //!< Partition is healthy
NMX_PARTITION_HEALTH_DEGRADED_BANDWIDTH = 2
; //!< Partition is in degraded bandwidth
NMX_PARTITION_HEALTH_DEGRADED = 3
; //!< One or more GPUs has routing disabled
NMX_PARTITION_HEALTH_UNHEALTHY = 4
; //!< Partition is unhealthy
}
Request message:
Message/Parameter | Type | Description | Notes |
GetPartitionIdListRequest.infoAttr | PartitionInfoAttr | Filter based on number of GPUs or compute nodes in a partition | |
GetPartitionIdListRequest.numGpus | uint32 | Number of GPUs in a partition | Values used only when attr=PARITIION_INFO_ATTR_NUM_GPUS |
GetPartitionIdListRequest.numNodes | uint32 | Number of compute nodes in a partition | Values used only when attr=PARTITION_INFO_ATTR_NUM_COMPUTE_NODES |
GetPartitionIdListRequest.health | PartitionHealth | NVLink health of the partition | [Optional] Set to PARTITION_HEALTH_UNKNOWN when not used |
GetPartitionIdListRequest.numPartitions | uint32 | Number of partitions |
Response message:
Message/Parameter | Type | Description | Notes |
GetPartitionIdListResponse.partitionList | Array of Partition | List of partitions |
GetPartitionIdListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
Partition message:
Message/Parameter | Type | Description | Notes |
Partition.partitionId | uint32 | Partition ID | |
Partition.numGpus | uint32 | Number of GPUs in partition |
GetPartitionInfoList(GetPartitionInfoListRequest) returns (GetPartitionInfoListResponse)
The client calls this RPC to get the partitions information. User can pass in a list of partitionIds or partitionNames or both. Only valid partitionIds or partitionNames are considered. If both lists are empty, information for all partitions are returned.
Request message:
Message/Parameter | Type | Description | Notes |
GetPartitionInfoListRequest.partitionIdList | Array of PartitionId | List of partition IDs | [Optional] if IDs/Names not provided response includes all provisioned Partitions |
GetPartitionInfoListRequest.partitionNameList | Array of strings | List of partition names | [Optional] if IDs/Names not provided response includes all provisioned Partitions |
Response message:
Message/Parameter | Type | Description | Notes |
GetPartitionInfoListResponse.partitionInfoList | Array of PartitionInfo | List of partition info |
GetPartitionInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
PartitionInfo message:
Message/Parameter | Type | Description | Notes |
partitionId | PartitionId | Partition ID | |
name | string | Partition name | |
numGpus | uint32 | Number of GPUs in partition | |
gpuLocationList | Array of GpuLocation | GPUs location | |
gpuUidList | Array of uint64 | GPUs unique IDs | |
health | PartitionHealth | NVLink health of the partition | |
partitionType | PartitionType | Partition type | PARTITION_TYPE_LOCATION_BASED if GPUs are associated by location PARTITION_TYPE_GPUUID_BASED if GPUs are associated by gpuUid |
numAllocatedMulticastGroups | uint32 | Number of allocated multicast groups to the partition | |
attr.resiliencyMode | ResiliencyMode | Resiliency mode | RESILIENCY_MODE_UNDEFINED RESILIENCY_MODE_FULL_BANDWIDTH RESILIENCY_MODE_ADAPTIVE_BANDWIDTH RESILIENCY_MODE_USER_ACTION_REQUIRED |
attr.MulticastGroupsLimit | uint32 | Limit on number of multicast groups in partition |
The meaning of the ResiliencyMode values are:
Full Bandwidth: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, GPUs will be excluded from the fabric to maintain full bandwidth for the rest of the GPUs.
Adaptive Bandwidth: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, the partition's GPUs will operate with a lower bandwidth than optimal.
User Action Required: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, the partition will go into an unhealthy state which requires user action for recovery. Example actions would be providing additional trunk links or removing GPUs from the partition.
CreatePartition(CreatePartitionRequest) returns (CreatePartitionResponse)
The client calls this RPC to create a partition. The user provides a list of GPU UIDs or a list of GPU Locations. If both lists are empty, the RPC returns NMX_ST_BADPARAM
. The user can specify the attributes for the partition to be created.
If the partition creation is requested with the resiliency mode
NMX_RESILIENCY_MODE_UNDEFINED
, a default resiliency mode as specified in the configuration fileMNNVL_DEFAULT_RESILIENCY_MODE
is usedIf the partition creation is requested with a
multicastGroupsLimit
that cannot be satisfied, the RPC returnsNMX_ST_RESOURCE_EXHAUSTED
. If the value is not a multiple of 4, the RPC returnsNMX_ST_BADPARAM
partitionId's are allocated from 1 to 0x7FFD. Partition Id 0x7FFE is reserved for Default Partition. The user can specify a partitionId as part of the creation request. If the specified ID is greater than 0x7FFD, the RPC returns NMX_ST_BADPARAM
.
If a partition creation request succeeds, and a later request to create another partition with the same set of parameters is received, the RPC returns
NMX_ST_PARTITION_EXISTS
If a partition cannot be created owing to exhaustion of partitionIds, the RPC returns
NMX_ST_RESOURCE_EXHAUSTED
If a requested GPU is already part of another partition, the RPC returns
NMX_ST_RESOURCE_IN_USE
If a requested GPU does not have a valid UID or a valid location, the RPC returns
NMX_ST_RESOURCE_BAD
If the requested partitionId is already in use by another partition, the RPC returns
NMX_ST_PARTITION_ID_IN_USE
If the requested partitionName is already in use by another partition, the RPC returns
NMX_ST_PARTITION_NAME_IN_USE
If the creation fails due to any other error, the RPC returns
NMX_ST_GENERIC_ERROR
Request message:
Message/Parameter | Type | Description | Notes |
string | Partition name | [Optional] Must be unique in domain if provided | |
CreatePartitionRequest.gpuResourceId | Array of GpuResourceId | Either gpuLocation or gpuUid | The GPU can be allocated either by GPU location or GPU unique ID |
CreatePartitionRequest.attr.resiliencyMode | ResiliencyMode | Resiliency mode | RESILIENCY_MODE_UNDEFINED RESILIENCY_MODE_FULL_BANDWIDTH RESILIENCY_MODE_ADAPTIVE_BANDWIDTH RESILIENCY_MODE_USER_ACTION_REQUIRED |
CreatePartitionRequest.attr.MulticastGroupsLimit | uint32 | Limit on number of multicast groups in partition | |
CreatePartitionRequest.partitionId | PartitionId | Partition ID | [Optional] Set to 0 for system to auto-generate ID |
Response message:
Message/Parameter | Type | Description | Notes |
CreatePartitionResponse.partitionId | PartitionId | Partition ID | User provided partition ID, or system auto-generated ID (if user did not specify) |
CreatePartitionResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter, requested multicastGroupsLimit is not a multiple of 4, requested partitionId is greater than 0x7FFD(allowed range is 0x1-0x7FFD)
NMX_ST_GENERIC_ERROR - Call to GFM API has failed due to an internal error
NMX_ST_RESOURCE_EXHAUSTED - Requested multicastGroupsLimit cannot be satisfied, partitionIds are exhausted
NMX_ST_PARTITION_EXISTS - Partition with requested parameters already exists
NMX_ST_RESOURCE_IN_USE - Requested resource is already part of another partition
NMX_ST_RESOURCE_BAD - Requested resource does not have a valid UID or a location
NMX_ST_PARTITION_ID_IN_USE - Requested partitionId is already in use
NMX_ST_PARTITION_NAME_IN_USE - Requested partitionName is already in use
DeletePartition(DeletePartitionRequest) returns (DeletePartitionResponse)
The client calls this RPC to delete a partition. User provides either a valid name or a partition Id or both as part of the request
Request message:
Message/Parameter | Type | Description | Notes |
DeletePartitionRequest.partitionId | PartitionId | Partition ID | partition ID is optional if partition name is provided |
string | Partition name | partition name is optional if partition ID is provided |
Response message:
Message/Parameter | Type | Description | Notes |
DeletePartitionResponse.partitionId | PartitionId |
DeletePartitionResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
NMX_ST_PARTITION_ID_NOT_IN_USE - Requested partitionId not in use
AddGpusToPartition(UpdatePartitionRequest) returns (UpdatePartitionResponse)
Client calls this RPC to add GPUs to a partition. In a partition that is contained within a chassis boundary, the reroute flag is ignored. In a partition that uses trunk links and hence crosses the chassis boundary, the reroute flag determines if the trunk link routing is adjusted:
If reroute = true(default), the trunk link routing is adjusted when additional GPUs are added to the partition. This can disrupt applications running in the partition.
If reroute = false, the trunk link routing is not adjusted when additional GPUs are added to the partition.
The RPC also allows a special operation called "reroute" where both the locationList and gpuUidList are empty, and reroute is set to True. This allows the client to adjust the trunk link routing (i.e. "reroute") for the partition to use an optimal number of trunk links. This operation can cause traffic disruption and must be used with caution.
When one of the location list or the gpuUid list is populated and this does not match the type (location or GPU UID) with which the partition currently operates, the RPC returns
NMX_ST_NOT_SUPPORTED
. The type of the partition can be determined from the PartitionInfo message which is returned from the GetPartitionInfoList() RPC callIf a requested GPU does not have a valid GUID or a valid location, the RPC returns
NMX_ST_RESOURCE_BAD
If the requested partition ID is not in use, the RPC returns
NMX_ST_PARTITION_ID_NOT_IN_USE
If the GPU to be added is already part of a partition, the RPC returns
NMX_ST_RESOURCE_IN_USE
Request message:
Message/Parameter | Type | Description | Notes |
UpdatePartitionRequest.partitionId | PartitionId | Partition ID is optional if partition name is provided | |
UpdatePartitionRequest.locationList | Array of GpuLocation | List of GPU locations | Provide only if PartitionType=PARTITION_TYPE_LOCATION_BASED |
UpdatePartitionRequest.gpuUid | Array of gpuUid | Provide only if PartitionType=PARTITION_TYPE_GPUUID_BASED | |
string | partition name | Partition name is optional if partition ID is provided | |
UpdatePartitionRequest.reroute | Boolean | Reroute partition on update | Default is true, will be deprecated and then removed in future releases |
User can request partition reroute by setting:
locationList to empty array
gpuUid to empty array
reroute to true
Response message:
Message/Parameter | Type | Description | Notes |
UpdatePartitionResponse.partitionId | PartitionId |
UpdatePartitionResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter, requested partitionId does not exist
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
NMX_ST_NOT_SUPPORTED - Requested type(location or gpu UID) does not match the partition type
NMX_ST_RESOURCE_BAD - Requested resource does not have a valid UID or a location
NMX_ST_PARTITION_ID_NOT_IN_USE - Requested partitionId not in use
NMX_ST_RESOURCE_IN_USE - Requested resource is already part of another partition
RemoveGpusFromPartition(UpdatePartitionRequest) returns (UpdatePartitionResponse)
The client calls this RPC to remove GPUs from a partition. UpdatePartitionRequest
and UpdatePartitionResponse
messages are the same as in AddGpusToPartition
.
If the number of GPUs to be removed is equal to the number of GPUs in the partition, the RPC returns
NMX_ST_NOT_SUPPORTED
If the number of GPUs to be removed is greater than the number of GPUs in the partition, the RPC returns
NMX_ST_BADPARAM
If the GPU to be removed is not part of the partition, the RPC returns
NMX_ST_RESOURCE_NOT_IN_USE
UpdatePartitionResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
NMX_ST_NOT_SUPPORTED - Requested number of gpus to be removed is equal to the number of gpus in the partition
NMX_ST_BADPARAM - Requested number of gpus to be removed is greater than the number of GPUs in the partition
NMX_ST_RESOURCE_NOT_IN_USE - Requested resource to be removed is not part of the partition
GetConnCount(GetConnCountRequest) returns (GetConnCountResponse)
The client calls this RPC to get the number of fabric connections. It allows the client to filter the connections based on Attribute, Type and Location.
enum
ConnAttr {
NMX_NVLINK_CONN_ATTR_UNKNOWN = 0
;
NMX_NVLINK_CONN_ATTR_EXPECTED = 1
; //!< All expected connections as per FM Topology
NMX_NVLINK_CONN_ATTR_DISCOVERED = 2
; //!< All discovered connections
NMX_NVLINK_CONN_ATTR_EXPECTED_ACTIVE = 3
; //!< All expected active connections as per FM Topology
NMX_NVLINK_CONN_ATTR_EXPECTED_INACTIVE = 4
; //!< All expected inactive connections
NMX_NVLINK_CONN_ATTR_UNEXPECTED = 5
; //!< All unexpected connections which are discovered
}
enum
ConnType {
NMX_NVLINK_CONN_TYPE_UNKNOWN = 0
;
NMX_NVLINK_CONN_TYPE_ALL = 1
; //!< Dump the GPU and switch connections
NMX_NVLINK_CONN_TYPE_GPU = 2
; //!< Dump the GPU connections
NMX_NVLINK_CONN_TYPE_SWITCH = 3
; //!< Dump the Switch connections
}
Various combinations of attributes and types are provided below to provide an idea on how connection information can be mined:
Connection category | Connection Attribute | Connection Type |
Access discovered | CONN_ATTR_DISCOVERED | CONN_TYPE_GPU |
Trunk discovered | CONN_ATTR_DISCOVERED | CONN_TYPE_SWITCH |
All discovered | CONN_ATTR_DISCOVERED | CONN_TYPE_ALL |
Access expected | CONN_ATTR_EXPECTED | CONN_TYPE_GPU |
Access inactive | CONN_ATTR_EXPECTED_INACTIVE | CONN_TYPE_GPU |
Trunk unexpected | CONN_ATTR_UNEXPECTED | CONN_TYPE_SWITCH |
Request message:
Message/Parameter | Type | Description | Notes |
GetConnCountRequest.connType | ConnType | Filter based on connection type | Connection Types are NVLINK_CONN_TYPE_GPU, NVLINK_CONN_TYPE_SWITCH. Specify NVLINK_CONN_TYPE_ALL to include both |
GetConnCountRequest.connAttr | ConnAttr | Filter based on discovered and expected connections | Discovered connections can be Active/Unexpected. Expected connections can be Active/Inactive/Missing. |
GetConnCountRequest.loc | Location | Filter connections for a specific location |
Response message:
Message/Parameter | Type | Description | Notes |
GetConnCountResponse.numConns | uint32 | Number of connections | |
GetConnCountResponse.timestamp | string | Timestamp from when the connection database was last updated |
GetConnCountResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetConnInfoList(GetConnInfoListRequest) returns (GetConnInfoListResponse)
The client calls this RPC to get information on the fabric connections. It allows the client to filter the connections based on Attribute, Type and Location.
enum
ConnAttr {
NMX_NVLINK_CONN_ATTR_UNKNOWN = 0
;
NMX_NVLINK_CONN_ATTR_EXPECTED = 1
; //!< All expected connections as per FM Topology
NMX_NVLINK_CONN_ATTR_DISCOVERED = 2
; //!< All discovered connections
NMX_NVLINK_CONN_ATTR_EXPECTED_ACTIVE = 3
; //!< All expected active connections as per FM Topology
NMX_NVLINK_CONN_ATTR_EXPECTED_INACTIVE = 4
; //!< All expected inactive connections
NMX_NVLINK_CONN_ATTR_UNEXPECTED = 5
; //!< All unexpected connections which are discovered
}
enum
ConnType {
NMX_NVLINK_CONN_TYPE_UNKNOWN = 0
;
NMX_NVLINK_CONN_TYPE_ALL = 1
; //!< Dump the GPU and switch connections
NMX_NVLINK_CONN_TYPE_GPU = 2
; //!< Dump the GPU connections
NMX_NVLINK_CONN_TYPE_SWITCH = 3
; //!< Dump the Switch connections
The API returns a list of connections and the state of each connection:
enum
ConnState {
NMX_NVLINK_CONN_STATE_UNKNOWN = 0
;
NMX_NVLINK_CONN_STATE_ACTIVE = 1
; //!< Active link or connection state
NMX_NVLINK_CONN_STATE_INACTIVE = 2
; //!< Inactive link or connection state
}
GetConnInfoListRequest message:
Message/Parameter | Type | Description | Notes |
GetConnInfoListRequest.connType | ConnType | Filter based on connection type | |
GetConnInfoListRequest.connAttr | ConnAttr | Filter based on discovered and expected connections | |
GetConnInfoListRequest.loc | Location | Filter connections for a specific location | |
GetConnInfoListRequest.numConns | uint32 | Number of connections |
Response message:
Message/Parameter | Type | Description | Notes |
GetConnInfoListResponse.connInfoList | Array of ConnInfo | List of connection information | |
GetConnInfoListResponse.timestamp | string | Timestamp from when the connection database was last updated |
GetConnInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
ConnInfo message:
Message/Parameter | Type | Description | Notes |
endPointA | LinkEndPoint | One end of the connection(a device port) | |
endPointB | LinkEndPoint | Another end of the connection(a device port) | |
connType | ConnType | Connection type | |
connState | ConnState | Connection state | NVLINK_CONN_STATE_ACTIVE or NVLINK_CONN_STATE_INACTIVE |
LinkEndPoint message:
Message/Parameter | Type | Description | Notes |
loc | Location | Location of the node of the endpoint | |
switchOrGpuId | uint32 | Location of the device(GPU/switch) within of the endpoint | |
cageNum | uint32 | Cage Number of the endpoint | Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0 |
cagePortNum | uint32 | Cage Port Number of the endpoint | Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0 |
cagePortSplitNum | uint32 | Cage Split Port Number of the endpoint | Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0 |
portNum | uint32 | Port Number of the endpoint on the device(GPU/switch) |
GetConnInfoCombined(GetConnInfoCombinedRequest) returns (ConnInfoCombined)
The client calls this RPC to get information on the fabric trunk connections that are mis-wired.
Response message:
Message/Parameter | Type | Description | Notes |
ConnInfoCombined.unexpectedConnList | Array of ConnInfo | List of mis-wired trunk connections |
ConnInfoCombined.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully completed the call
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
FactoryReset(FactoryResetRequest) returns (ReturnCode)
The client calls this RPC to perform factory reset to the NMX-Controller. After this call is completed the NMX-Controller configuration and state is as initially delivered from factory.
ReturnCode:
NMX_ST_SUCCESS - successfully completed the call
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call has failed