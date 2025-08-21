The first RPC call that client must send after gRPC connection establishment

Request:

Message/Parameter Type Description Notes ClientHello.gatewayId String Client identifier ClientHello.major_version ProtoMsgMajorVersion Major version ClientHello.minor_version ProtoMsgMinorVersion Minor version

Response:

Message/Parameter Type Description Notes ServerHello.serverHeader ServerHeader Base server info and return code ServerHello.components_ver Array of KeyValPair List of NMX-Controller components, and their version ServerHello.capabilities Array of string List of NMX-Controller capabilities ServerHello.host_os_details String NMX-Controller host OS details ClientHello.major_version ProtoMsgMajorVersion Major version ClientHello.minor_version ProtoMsgMinorVersion Minor version

ServerHello.serverHeader.returnCode:

NMX_ST_SUCCESS - Client hello was successful

NMX_ST_BADPARAM - Missing or invalid gatewayId (e.g. empty string)

NMX_ST_VERSION_MISMATCH - Major version of client and server protos do not match

Server closes the gRPC connection if:

Client ProtoMsgMajorVersion is not the same as server

Client attempts to call any other RPC before successful completion of Hello() RPC.

The client calls this RPC to subscribe for asynchronous push notifications. SubscribeRequest has the gatewayID, and the notifyOnSelfChange that should be set to false.

Asynchronous push notifications:

Message/Parameter Type Description Notes ServerNotification.subscriptionResponse SubscriptionResponse Confirmation of subscription Sent only to requesting client ServerNotification.staticConfigResponse StaticConfigResponse Notification of static config change Includes changed items Sent to all subscribed clients except for the requesting client ServerNotification.CreatePartitionResponse CreatePartitionResponse Notification of partition creation includes partitionId Sent to all subscribed clients except for the requesting client ServerNotification.DeletePartitionResponse DeletePartitionResponse Notification of partition deletion includes partitionId Sent to all subscribed clients except for the requesting client ServerNotification.UpdatePartitionResponse UpdatePartitionResponse Notification of partition configuration change includes partitionId Sent to all subscribed clients except for the requesting client ServerNotification.fmEvent. fmEventControlPlaneStateChange FmEventControlPlaneStateChange Notification of change in control plane state Sent to all subscribed clients ServerNotification.fmEvent.fmEventTopologyChange FmEventTopologyChange Notification of change in discovered topology Sent to all subscribed clients ServerNotification.fmEvent.fmEventPartitionChange FmEventPartitionChange Notification of change in partition health includes partitionId Sent to all subscribed clients ServerNotification.healthStateChanged HealthStateChanged Notification of change in NMX-C health Sent to all subscribed clients

ServerNotification.serverHeader.returnCode:

NMX_ST_SUCCESS - Client subscription was successful

NMX_ST_BADPARAM - Missing or invalid gatewayId

The client calls this RPC to read the current static configuration. The client may request to receive either full files, or parameters from the files

Request:

Message/Parameter Type Description Notes GetStaticConfigRequest.configKeys Array of configKey List of configuration parameters Either configKeys or configFiles should be provided GetStaticConfigRequest.configFiles Array of configFileName List of configuration files Either configKeys or configFiles should be provided

Supported values for configFileName:

sm_config

fm_config

rdm_config

chassis_mapping

Response:

Message/Parameter Type Description Notes staticConfig.configKeyVals Array of configKeyVals List of configuration parameters and their values Either configKeyVals or configFileContents is provided, depending on request staticConfig.configFileContents Array of ConfigFileContent List of configuration files and their content Either configKeyVals or configFileContents is provided, depending on request

StaticConfigResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - Client request for static configuration was successful

NMX_ST_BADPARAM - Missing or invalid input parameter (e.g. empty strings)

Example of fm_config file content:

Copy Copied! MNNVL_ENABLE_DEFAULT_PARTITION=1 MNNVL_DEFAULT_RESILIENCY_MODE=2 MNNVL_DEFAULT_PARTITION_TYPE=2 MNNVL_TOPOLOGY=gb200_nvl36r1_c2g4_topology





The client calls this RPC to update the static configuration. Client may request to update either full files, or parameters in the files.

Request:

Message/Parameter Type Description Notes SetStaticConfigRequest.staticConfig.configKeyVals Array of configKey List of configuration parameters and their values Either configKeyVals or configFileContents should be provided SetStaticConfigRequest. staticConfig.configFileContents Array of configFileContent List of configuration files and their content Either configKeyVals or configFileContents should be provided

Supported values for configFileName:

sm_config

fm_config

rdm_config

chassis_mapping

All config files are compliant to Linux INI format (https://en.wikipedia.org/wiki/INI_file), with the following restrictions:

format “key = value”

No support for sections

No support for hierarchy

Case insensitive

Support for comments

No support for duplicate names

Support for Quoted values

Support for Line continuation

Support for Escape characters

The "fm config" file must include the MNNVL_TOPOLOGY parameter with one of the values:

gb200_nvl36r1_c2g4_topology - 36 GPUs, Single chassis, 2 CPUs, 4 GPUs per compute tray

gb200_nvl36r1_c2g2_topology - 36 GPUs, Single chassis, 2 CPUs, 2 GPUs per compute tray

gb200_nvl72r1_c2g4_topology - 72 GPUs, Single chassis, 2 CPUs, 4 GPUs per compute tray

gb200_nvl72r2_c2g4_topology - 72 GPUs, Two chassis, 2 CPUs, 4 GPUs per compute tray

gb200_nvl72r2_c2g2_topology - 72 GPUs, Two chassis, 2 CPUs, 2 GPUs per compute tray

The chassis_mapping includes the mapping between chassis-id and chassis-serial-number, see example for the file format:

chassisId1 ABC123ABC123A chassisId2 XYZ987XYZ987X

ReturnCode:

NMX_ST_SUCCESS - Client request for static configuration was successful

NMX_ST_NMX_CONTROLLER_INVALID_CONFIG_FILE - config file parsing error

NMX_ST_BADPARAM - Missing or invalid input parameter

Domain Properties provide an overview of the current topology that NMX-C manages. They provide the maximum number of expected resources(including but not limited to gpus, switches, compute nodes, switch nodes, partitions, nvlinks) in the domain.

The client calls this RPC to get static properties of the domain. The GetDomainPropertiesRequest has the context and gatewayID.

Response:

Message/Parameter Type Description Notes DomainProperties.serverHeader ServerHeader Base server info and return code DomainProperties.maxComputeNodes uint32 Maximum number of Compute Nodes in the NVLink Domain DomainProperties.maxComputeNodesPerChassis uint32 Maximum number of Compute Nodes in a chassis Number of chassis in the NVLink Domain = DomainProperties.maxComputeNodes / DomainProperties.maxComputeNodesPerChassis DomainProperties.maxGpusPerComputeNode uint32 Maximum number of GPUs in a Compute Node DomainProperties.maxGpuNvLinks uint32 Maximum number of NVLinks in a GPU DomainProperties.lineRateMBps uint32 Maximum line rate in MBps of an NVLink DomainProperties.maxSwitchNodes uint32 Maximum number of Switch Nodes in the NVLink Domain DomainProperties.maxSwitchNodesPerChassis uint32 Maximum number of Switch Nodes in a chassis DomainProperties.maxSwitchesPerSwitchNode uint32 Maximum number of Switches in a Switch Node DomainProperties.maxSwitchNvLinks uint32 Maximum number of NVLinks in a Switch DomainProperties.maxNumPartitions uint32 Maximum number of partitions that can be created in the NVLink Domain DomainProperties.maxNumAlids uint32 Maximum number of Alids for a GPU DomainProperties.maxMulticastGroups uint32 Maximum number of Multicast Groups available in the NvLink Domain DomainProperties.maxNumPorts uint32 Total aggregate number of ports across switches and GPUs in the NVLink Domain

DomainProperties.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The client calls this RPC to get the dynamic properties of the domain. The GetDomainStateInfoRequest has the context and gatewayID.

Response:

Message/Parameter Type Description Notes DomainStateInfo.serverHeader ServerHeader Base server info and return code DomainStateInfo.controlPlaneState ControlPlaneState State of the domain control plane DomainStateInfo.availableMulticastGroups uint32 Number of available multicast groups DomainStateInfo.configStatusDescription string Additional details of the control plane state DomainStateInfo.nmxControllerHealth NmxControllerHealth NMX-Controller service health

DomainStateInfo.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The following are the values of controlPlaneState:

Copy Copied! NMX_CONTROL_PLANE_STATE_UNDEFINED = 0 ; NMX_CONTROL_PLANE_STATE_OFFLINE = 1 ; NMX_CONTROL_PLANE_STATE_STANDBY = 2 ; NMX_CONTROL_PLANE_STATE_CONFIGURED = 3 ; NMX_CONTROL_PLANE_STATE_TIMEOUT = 4 ; NMX_CONTROL_PLANE_STATE_ERROR = 5 ; NMX_CONTROL_PLANE_STATE_UNCONFIGURED = 6 ;

Control Plane states and associated configuration status description strings are explained below

NMX_CONTROL_PLANE_STATE_UNCONFIGURED - Pending required FM configuration. CONFIG_PENDING_UUID - Pending NVLink Domain UUID. FM waits indefinitely until set. CONFIG_PENDING_TOPOLOGY - Pending MNNVL_TOPOLOGY config. FM waits indefinitely until set. CONFIG_PENDING_CHASSIS_ID_MAPPING - Pending chassis Id mapping. FM waits for GFM_WAIT_TIMEOUT_SEC until set. If not set, FM allocates the mapping during the initial resource discovery. CONFIG_RECEIVED - FM received all the required configuration.

NMX_CONTROL_PLANE_STATE_ERROR - Error validating FM configuration. Restart NMX-C after the configuration error is fixed. CONFIG_ERROR_INCORRECT_TOPOLOGY_FILE - Encountered error while processing the topology file specified in MNNVL_TOPOLOGY CONFIG_ERROR_CHASSIS_ID_MAPPING_COUNT - Detected mismatch between number of entries in the chassis Id mapping specified and expected number of chassis read from the topology file specified in MNNVL_TOPOLOGY CONFIG_ERROR_CHASSIS_ID_MAPPING_OUT_OF_RANGE - Detected chassis Id value outside of the allowed range. Allowed range is 1 to n, where n is the number of chassis in the NVLink Domain. CONFIG_ERROR_DUPLICATE_CHASSIS_SERIAL_NUMBER - Detected duplicate chassis serial number in the chassis Id mapping CONFIG_ERROR_MISSING_CHASSIS - Detected fewer than expected chassis serial numbers during the initial resource discovery. CONFIG_ERROR_ADDITIONAL_CHASSIS_DETECTED - Detected more than expected chassis serial number(s) during the initial resource discovery

NMX_CONTROL_PLANE_STATE_DEGRADED - FM detected a misconfiguration. The NVLink Domain continues to work with limited capability. Fixing the misconfiguration restores full functionality. CONFIG_ERROR_MISWIRED_TRUNK_PORTS - Incorrect/mis-wired trunk connection(s) detected. Use GetConnInfoList() GRPC API to get details.

NMX_CONTROL_PLANE_STATE_CONFIGURED - FM completed configuration validation and initialization CONFIG_DONE - FM completed configuration validation and initialization



The following are the values of nmxControllerHealth:

Copy Copied! NMX_CONTROLLER_HEALTH_UNKNOWN = 0 ; NMX_CONTROLLER_HEALTH_HEALTHY = 1 ; NMX_CONTROLLER_HEALTH_DEGRADED = 2 ; NMX_CONTROLLER_HEALTH_UNHEALTHY = 3 ; NMX_CONTROLLER_HEALTH_UNHEALTHY_DB_CORRUPTED = 4 ;





The client calls this RPC to receive information on the currently discovered topology:

Devices (GPUs and Switches) and their properties and state (DeviceTopoInfo contains one of SwitchTopoInfo or GpuTopoInfo)

Devices ports and their properties and state(PortTopoInfo)

Devices connectivity information

Response:

Message/Parameter Type Description Notes FmTopologyInfo.serverHeader ServerHeader Base server info and return code FmTopologyInfo.deviceTopoInfo Array of DeviceTopoInfo List of discovered devices

FmTopologyInfo.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

DeviceTopoInfo message:

Message/Parameter Type Description Notes switchTopoInfo SwitchTopoInfo Details of switch device Either switchTopoInfo or gpuTopoInfo is provided, depending on the discovered device type gpuTopoInfo GpuTopoInfo Details of GPU device Either switchTopoInfo or gpuTopoInfo is provided, depending on the discovered device type

gpuTopoInfo message:

Message/Parameter Type Description Notes loc LocationInfo Node location topologyId uint64 Indicates switch tray model Value of 128 is scale-out GB200 NVL switch tray, is 129 non-scale-out GB200 NVL switch tray deviceUid uint64 Device unique identifier deviceId uint32 Device enumeration within the node From 1 numPorts uint64 Total number of ports systemUid uint64 Node unique identifier vendorId uint32 Device vendor ID devicePcieId uint64 Device PCIe ID description string Device description partitionId Array of PartitionId List of partitions the device is associated with deviceHealth GpuHealth NVLink Health of the GPU portTopoInfo Array of PortTopoInfo List of device ports aLids Array of uint64 List of device labels for internal routing

SwitchTopoInfo message:

Message/Parameter Type Description Notes loc LocationInfo Node location topologyId uint64 Indicates switch tray model deviceUid uint64 Device unique identifier deviceId uint32 Device enumberation within the node numPorts uint64 Total number of ports systemUid uint64 Node unique identifier vendorId uint32 Device vendor ID devicePcieId uint64 Device PCIe ID description string Device description partitionId Array of PartitionId List of partitions the device is associated with deviceHealth SwitchHealth NVLink Health of the Switch portTopoInfo Array of PortTopoInfo List of device ports

PortTopoInfo message:

Message/Parameter Type Description Notes portType PortType Type of device port For GPU: PORT_TYPE_GPU

For a switch: PORT_TYPE_SWITCH_ACCESS (port towards GPU), PORT_TYPE_SWITCH_TRUNK (relevant only for scale-out switch trays, scale out port), PORT_TYPE_FNM (internal port, for internal management traffic) portUid uint64 port identifier For GPU: Unique port identifier

For a switch: Equal to device unique identifier portNum uint64 Port number of device From 1 peerPortDeviceUid uint64 Peer device unique identifier peerPortNum uint64 Peer port number on peer device From 1 physicalState PhysicalPortState NVLink port physical state logicalState LogicalPortState NVLink port logical state subnetPrefix uint64 NVLink port routing subnet isSdnPort boolean NVLink management port indicator partitionIdList Array of PartitionId List of partitions the port is associated with cageNum uint32 Front panel cage number Provided only when portType=PORT_TYPE_SWITCH_TRUNK From 1 cagePortNum uint32 Front panel port number in cage Provided only when portType=PORT_TYPE_SWITCH_TRUNK From 1 cageSplitPortNum uint32 Front split port number in cage Provided only when portType=PORT_TYPE_SWITCH_TRUNK From 1 baseLid uint64 Base label for routing Provided only when portType=PORT_TYPE_GPU systemPortNum uint64 Port number per tray Provided only when portType=PORT_TYPE_SWITCH_ACCESS From 1 computePortNum uint64 Port number per GPU Provided only when portType=PORT_TYPE_GPU From 0 containAndDrain bool Indication that the port is in contain and drain state value of TRUE indicates active contain and drain state rail uint32 Rail of the port plane uint32 Plane of the port linkRateMbps uint32 Rate of the port link in Mbits/sec

The client calls this RPC to get the number of compute nodes. It allows the client to filter the compute nodes based on Attribute, Health and Location.

Attribute reflects the allocation status of a compute node to a partition:

NMX_COMPUTE_NODE_ATTR_ALL = 1 - all compute nodes in the NvLink Domain irrespective of the allocation status

NMX_COMPUTE_NODE_ATTR_FREE = 2 - compute nodes in the NvLink Domain that are not allocated to any partitions

NMX_COMPUTE_NODE_ATTR_FULLY_ALLOCATED = 3 - compute nodes in the NvLink Domain where all of its gpus are allocated to partitions

NMX_COMPUTE_NODE_ATTR_PARTIALLY_ALLOCATED = 4 - compute nodes in the NvLink Domain where one or more(but not all) of its gpus are allocated to partitions

Compute node health:

NMX_COMPUTE_NODE_HEALTH_HEALTHY = 1 - Fully healthy

NMX_COMPUTE_NODE_HEALTH_DEGRADED = 2 - Some GPUs are degraded to NO_NVLINK

NMX_COMPUTE_NODE_HEALTH_UNHEALTHY = 3 - Unable to participate in NVLink

Location and health are optional filters.

Request:

Message/Parameter Type Description Notes GetComputeNodeCountRequest.attr ComputeNodeAttr Filter based on partition allocation status of the compute node GetComputeNodeCountRequest.chassisId unit64 Chassis ID [Optional] Set to 0 when not used GetComputeNodeCountRequest.nodeHealth ComputeNodeHealth NVLink health of the compute node [Optional] Set to COMPUTE_NODE_HEALTH_UNKNOWN when not used

Response message:

Message/Parameter Type Description Notes GetComputeNodeCountResponse.numNodes unit32 Limit on number of compute nodes matching the filters

GetComputeNodeCountResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

Client calls this RPC to get the locations of the compute nodes. It allows the client to filter the compute nodes based on Attribute, Health and Location.

Attribute reflects the allocation status of a compute node to a partition:

NMX_COMPUTE_NODE_ATTR_ALL = 1 - all compute nodes in the NvLink Domain irrespective of the allocation status

NMX_COMPUTE_NODE_ATTR_FREE = 2 - compute nodes in the NvLink Domain that are not allocated to any partitions

NMX_COMPUTE_NODE_ATTR_FULLY_ALLOCATED = 3 - compute nodes in the NvLink Domain where all of its gpus are allocated to partitions

NMX_COMPUTE_NODE_ATTR_PARTIALLY_ALLOCATED = 4 - compute nodes in the NvLink Domain where one or more(but not all) of its gpus are allocated to partitions

Compute node health:

NMX_COMPUTE_NODE_HEALTH_HEALTHY = 1 - Fully healthy

NMX_COMPUTE_NODE_HEALTH_DEGRADED = 2 - Some GPUs are degraded to NO_NVLINK

NMX_COMPUTE_NODE_HEALTH_UNHEALTHY = 3 - Unable to participate in NVLink

Location and health are optional filters.

Request message:

Message/Parameter Type Description Notes GetComputeNodeLocationListRequest.attr ComputeNodeAttr Filter based on partition allocation status of the compute node GetComputeNodeLocationListRequest.chassisId unit64 Chassis ID [Optional] Set to 0 when not used GetComputeNodeLocationListRequest.nodeHealth ComputeNodeHealth NVLink health of the compute node [Optional] Set to COMPUTE_NODE_HEALTH_UNKNOWN when not used GetComputeNodeLocationListRequest.numNodes unit32 Limit on number of nodes in the response

Response message:

Message/Parameter Type Description Notes GetComputeNodeLocationListResponse.locList Array of Location List of locations

GetComputeNodeLocationListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The client calls this RPC to get information on the compute nodes. The client uses it get both static(location and number of gpus in the node) and dynamic information(partition Ids and health) about compute nodes.

Request message:

Message/Parameter Type Description Notes GetComputeNodeInfoListResponse.locList Array of Location List of locations [Optional] If the list is empty, the response includes all nodes

Response message:

Message/Parameter Type Description Notes GetComputeNodeInfoListResponse.nodeInfoList Array of ComputeNodeInfo

GetComputeNodeInfoListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

ComputeNodeInfo message:

Message/Parameter Type Description Notes ComputeNodeInfo.loc LocationInfo Node location ComputeNodeInfo.numGPUs unit32 Number of GPUs for the node ComputeNodeInfo.nodeHealth ComputeNodeHealth NVLink health of the compute node ComputeNodeInfo.partitionIdList Array of PartitionId List of partitions the device is associated with

The client calls this RPC to get information on the GPUs.

The client can filter GPUs based on location or partition by passing the attribute NMX_GPU_ATTR_LOCATION or NMX_GPU_ATTR_PARTITION_ID respectively.

Notes:

Specifying NMX_GPU_ATTR_PARTITION_ID and partitionId=0 provide info on GPUs that are not associated with any partition

Specifying NMX_GPU_ATTR_ALL ignores the location or partition values set in the request

Request message:

Message/Parameter Type Description Notes GetGpuInfoListRequest.attr GpuAttr Filter based on GPUs that belong to a partition or location GetGpuInfoListRequest.numGpus unit32 Limit on number of GPUs for response [Optional] Set to 0 when not used, and response includes all GPUs matching the filter GetGpuInfoListRequest.loc Location Location Values used only when attr=GPU_ATTR_LOCATION_ID GetGpuInfoListRequest.partitionId PartitionId Partition ID Value used only when attr=GPU_ATTR_PARTITION_ID GetGpuInfoListRequest.gpuHealth GpuHealth GPU health [Optional] Set to 0 (GPU_HEALTH_UNKNOWN), when not used

GPU health values:

NMX_ GPU_HEALTH_HEALTHY = 1 //!< Fully healthy

NMX_ GPU_HEALTH_DEGRADED = 2 //!< One or more links are down

NMX_ GPU_HEALTH_NO_NVLINK = 3 //!< Unable to participate in NVLink partition

NMX_GPU_HEALTH_DEGRADED_BW = 4 //!< GPU operates in degraded bandwidth

Note: to get all healthy GPU not associated with any partition set:

attr to GPU_PARTITION_ID

partitionId to 0

Response message:

Message/Parameter Type Description Notes GetGpuInfoListResponse.gpuInfoList Array of GpuInfo A list of GPU information

GetGpuInfoListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GpuInfo message:

Message/Parameter Type Description Notes GpuInfo.loc LocationInfo Node location GpuInfo.gpuId uint32 GPU enumeration within the node From 1 GpuInfo.gpuUid uint64 GPU unique identifier GpuInfo.gpuHealth GpuHealth NVLink health of the GPU GpuInfo.partitionId PartitionId Partitions the device is associated with

The client calls this RPC to get the number of switch nodes.

The client can filter the switch nodes using attribute and health. Note that in the NVLink5 topologies do not have switch nodes that are of type NMX_SWITCH_NODE_ATTR_L2.

Request message:

Message/Parameter Type Description Notes GetSwitchNodeCountRequest.attr SwitchNodeAttr Filter based on the type of switch node GetSwitchNodeCountRequest.nodeHealth SwitchNodeHealth NVLink health of the switch node [Optional] Set to SWITCH_NODE_HEALTH_UNKNOWN when not used GetSwitchNodeCountRequest.numNodes uint32 Limit on number of nodes for the response [Optional] Set to 0 for no limit

Response message:

Message/Parameter Type Description Notes GetSwitchNodeCountResponse.numNodes uint32 Number of nodes matching the filter

GetSwitchNodeCountResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The client calls this RPC to get the location of switch nodes.

The client can filter the switch nodes using attribute and health. Note that in the NVLink5 topologies do not have switch nodes that are of type NMX_SWITCH_NODE_ATTR_L2

Request message:

Message/Parameter Type Description Notes GetSwitchNodeLocationListRequest.attr SwitchNodeAttr Filter based on the type of switch node GetSwitchNodeLocationListRequest.nodeHealth SwitchNodeHealth NVLink health of the switch node [Optional] Set to SWITCH_NODE_HEALTH_UNKNOWN when not used GetSwitchNodeLocationListRequest.numNodes uint32 Limit on number of nodes for the response [Optional] Set to 0 for no limit

Response message:

Message/Parameter Type Description Notes GetSwitchNodeLocationListResponse.locList Array of Location List of nodes locations

GetSwitchNodeLocationListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The client calls this RPC to get information on the switch nodes. The client uses it get both static(location and number of switches in the node) and dynamic information(partition Ids and health) about switch nodes

Request message:

Message/Parameter Type Description Notes GetSwitchNodeInfoListRequest.locList Array of Location List of nodes locations [Optional] I the list is empty, the response includes all nodes

Response message:

Message/Parameter Type Description Notes GetSwitchNodeInfoListResponse.nodeInfoList Array of SwitchNodeInfo List of switch nodes

GetSwitchNodeInfoListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

SwitchNodeInfo message:

Message/Parameter Type Description Notes SwitchNodeInfo.loc LocationInfo Node location SwitchNodeInfo.numSwitches uint32 Number of switches in the node SwitchNodeInfo.nodeHealth SwitchNodeHealth NVLink health of the switch node SwitchNodeInfo.partitionIdList Array of PartitionId List of partitions the node is associated with

The client calls this RPC to get information on the switches.

Request message:

Message/Parameter Type Description Notes GetSwitchInfoListRequest.numSwitches unit32 Limit on number of switches for response [Optional] Set to 0 when not used. The response includes all switches matching the filter GetSwitchInfoListRequest.loc Location Location

Response message:

Message/Parameter Type Description Notes GetSwitchInfoListResponse.switchInfoList Array of SwitchInfo List of switch info

GetSwitchInfoListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

SwitchInfo message:

Message/Parameter Type Description Notes SwitchInfo.loc LocationInfo Node location SwitchInfo.switchId uint32 Switch enumeration with n the node From 1 SwitchInfo.switchUid uint64 Switch unique identifier SwitchInfo.health SwitchHealth NVLink health of the switch SwitchInfo.numPorts uint32 Number of ports

The client calls this RPC to get the number of partitions. The number of partitions can be filtered using the numGpus, numNodes or health filters. The infoAttr value decides which filter to choose between numGpus and numNodes. If set to ATTR_ALL, numGpus and numNodes are ignored.

Copy Copied! enum PartitionInfoAttr { NMX_PARTITION_INFO_ATTR_UNDEFINED = 0 ; NMX_PARTITION_INFO_ATTR_ALL = 1 ; NMX_PARTITION_INFO_ATTR_NUM_GPUS = 2 ; NMX_PARTITION_INFO_ATTR_NUM_COMPUTE_NODES = 3 ; } enum PartitionHealth { NMX_PARTITION_HEALTH_UNKNOWN = 0 ; NMX_PARTITION_HEALTH_HEALTHY = 1 ; NMX_PARTITION_HEALTH_DEGRADED_BANDWIDTH = 2 ; NMX_PARTITION_HEALTH_DEGRADED = 3 ; NMX_PARTITION_HEALTH_UNHEALTHY = 4 ; }

Request message:

Message/Parameter Type Description Notes GetPartitionCountRequest.infoAttr PartitionInfoAttr Filter based on number of GPUs or compute nodes in a partition GetPartitionCountRequest.numGpus uint32 Number of GPUs in a partition Values used only when attr=PARITITION_INFO_ATTR_NUM_GPUS GetPartitionCountRequest.numNodes uint32 Number of compute nodes in a partition Values used only when attr=PARTITION_INFO_ATTR_NUM_COMPUTE_NODES GetPartitionCountRequest.health PartitionHealth NVLink health of the partition Optional, set to PARTITION_HEALTH_UNKNOWN when not used

Response message:

Message/Parameter Type Description Notes GetPartitionCountResponse.numPartitions uint32 Number of partitions matching the filter

GetPartitionCountResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The client calls this RPC to get the partitions IDs. The partitionIds can be filtered using the numGpus, numNodes or health filters. The infoAttr value decides which filter to choose between numGpus and numNodes. If set to ATTR_ALL, numGpus and numNodes are ignored.

Copy Copied! enum PartitionInfoAttr { NMX_PARTITION_INFO_ATTR_UNDEFINED = 0 ; NMX_PARTITION_INFO_ATTR_ALL = 1 ; NMX_PARTITION_INFO_ATTR_NUM_GPUS = 2 ; NMX_PARTITION_INFO_ATTR_NUM_COMPUTE_NODES = 3 ; } enum PartitionHealth { NMX_PARTITION_HEALTH_UNKNOWN = 0 ; NMX_PARTITION_HEALTH_HEALTHY = 1 ; NMX_PARTITION_HEALTH_DEGRADED_BANDWIDTH = 2 ; NMX_PARTITION_HEALTH_DEGRADED = 3 ; NMX_PARTITION_HEALTH_UNHEALTHY = 4 ; }

Request message:

Message/Parameter Type Description Notes GetPartitionIdListRequest.infoAttr PartitionInfoAttr Filter based on number of GPUs or compute nodes in a partition GetPartitionIdListRequest.numGpus uint32 Number of GPUs in a partition Values used only when attr=PARITIION_INFO_ATTR_NUM_GPUS GetPartitionIdListRequest.numNodes uint32 Number of compute nodes in a partition Values used only when attr=PARTITION_INFO_ATTR_NUM_COMPUTE_NODES GetPartitionIdListRequest.health PartitionHealth NVLink health of the partition [Optional] Set to PARTITION_HEALTH_UNKNOWN when not used GetPartitionIdListRequest.numPartitions uint32 Number of partitions

Response message:

Message/Parameter Type Description Notes GetPartitionIdListResponse.partitionList Array of Partition List of partitions

GetPartitionIdListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

Partition message:

Message/Parameter Type Description Notes Partition.partitionId uint32 Partition ID Partition.numGpus uint32 Number of GPUs in partition

The client calls this RPC to get the partitions information. User can pass in a list of partitionIds or partitionNames or both. Only valid partitionIds or partitionNames are considered. If both lists are empty, information for all partitions are returned.

Request message:

Message/Parameter Type Description Notes GetPartitionInfoListRequest.partitionIdList Array of PartitionId List of partition IDs [Optional] if IDs/Names not provided response includes all provisioned Partitions GetPartitionInfoListRequest.partitionNameList Array of strings List of partition names [Optional] if IDs/Names not provided response includes all provisioned Partitions

Response message:

Message/Parameter Type Description Notes GetPartitionInfoListResponse.partitionInfoList Array of PartitionInfo List of partition info

GetPartitionInfoListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

PartitionInfo message:

Message/Parameter Type Description Notes partitionId PartitionId Partition ID name string Partition name numGpus uint32 Number of GPUs in partition gpuLocationList Array of GpuLocation GPUs location gpuUidList Array of uint64 GPUs unique IDs health PartitionHealth NVLink health of the partition partitionType PartitionType Partition type PARTITION_TYPE_LOCATION_BASED if GPUs are associated by location PARTITION_TYPE_GPUUID_BASED if GPUs are associated by gpuUid numAllocatedMulticastGroups uint32 Number of allocated multicast groups to the partition attr.resiliencyMode ResiliencyMode Resiliency mode RESILIENCY_MODE_UNDEFINED RESILIENCY_MODE_FULL_BANDWIDTH RESILIENCY_MODE_ADAPTIVE_BANDWIDTH RESILIENCY_MODE_USER_ACTION_REQUIRED attr.MulticastGroupsLimit uint32 Limit on number of multicast groups in partition

The meaning of the ResiliencyMode values are:

Full Bandwidth: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, GPUs will be excluded from the fabric to maintain full bandwidth for the rest of the GPUs.

Adaptive Bandwidth: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, the partition's GPUs will operate with a lower bandwidth than optimal.

User Action Required: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, the partition will go into an unhealthy state which requires user action for recovery. Example actions would be providing additional trunk links or removing GPUs from the partition.

The client calls this RPC to create a partition. The user provides a list of GPU UIDs or a list of GPU Locations. If both lists are empty, the RPC returns NMX_ST_BADPARAM . The user can specify the attributes for the partition to be created.

If the partition creation is requested with the resiliency mode NMX_RESILIENCY_MODE_UNDEFINED , a default resiliency mode as specified in the configuration file MNNVL_DEFAULT_RESILIENCY_MODE is used

If the partition creation is requested with a multicastGroupsLimit that cannot be satisfied, the RPC returns NMX_ST_RESOURCE_EXHAUSTED . If the value is not a multiple of 4, the RPC returns NMX_ST_BADPARAM

partitionId's are allocated from 1 to 0x7FFD. Partition Id 0x7FFE is reserved for Default Partition. The user can specify a partitionId as part of the creation request. If the specified ID is greater than 0x7FFD, the RPC returns NMX_ST_BADPARAM .

If a partition creation request succeeds, and a later request to create another partition with the same set of parameters is received, the RPC returns NMX_ST_PARTITION_EXISTS

If a partition cannot be created owing to exhaustion of partitionIds, the RPC returns NMX_ST_RESOURCE_EXHAUSTED

If a requested GPU is already part of another partition, the RPC returns NMX_ST_RESOURCE_IN_USE

If a requested GPU does not have a valid UID or a valid location, the RPC returns NMX_ST_RESOURCE_BAD

If the requested partitionId is already in use by another partition, the RPC returns NMX_ST_PARTITION_ID_IN_USE

If the requested partitionName is already in use by another partition, the RPC returns NMX_ST_PARTITION_NAME_IN_USE

If the creation fails due to any other error, the RPC returns NMX_ST_GENERIC_ERROR

Request message:

Message/Parameter Type Description Notes CreatePartitionRequest.name string Partition name [Optional] Must be unique in domain if provided CreatePartitionRequest.gpuResourceId Array of GpuResourceId Either gpuLocation or gpuUid The GPU can be allocated either by GPU location or GPU unique ID CreatePartitionRequest.attr.resiliencyMode ResiliencyMode Resiliency mode RESILIENCY_MODE_UNDEFINED RESILIENCY_MODE_FULL_BANDWIDTH RESILIENCY_MODE_ADAPTIVE_BANDWIDTH RESILIENCY_MODE_USER_ACTION_REQUIRED CreatePartitionRequest.attr.MulticastGroupsLimit uint32 Limit on number of multicast groups in partition CreatePartitionRequest.partitionId PartitionId Partition ID [Optional] Set to 0 for system to auto-generate ID

Response message:

Message/Parameter Type Description Notes CreatePartitionResponse.partitionId PartitionId Partition ID User provided partition ID, or system auto-generated ID (if user did not specify)

CreatePartitionResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter, requested multicastGroupsLimit is not a multiple of 4, requested partitionId is greater than 0x7FFD(allowed range is 0x1-0x7FFD)

NMX_ST_GENERIC_ERROR - Call to GFM API has failed due to an internal error

NMX_ST_RESOURCE_EXHAUSTED - Requested multicastGroupsLimit cannot be satisfied, partitionIds are exhausted

NMX_ST_PARTITION_EXISTS - Partition with requested parameters already exists

NMX_ST_RESOURCE_USED_IN_ANOTHER_PARTITION - Requested GPU is already member of another partition

NMX_ST_RESOURCE_USED_IN_THIS_PARTITION - Requested GPU is already member of the same partition

NMX_ST_RESOURCE_BAD - Requested resource does not have a valid UID or a location

NMX_ST_PARTITION_ID_IN_USE - Requested partitionId is already in use

NMX_ST_PARTITION_NAME_IN_USE - Requested partitionName is already in use

NMX_ST_PARTITION_MISWIRED_TRUNKS - miswired trunks are detected

NMX_ST_PARTITION_INSUFFICIENT_TRUNKS - insufficient trunk links to complete the operation

NMX_ST_PARTITION_MISSING_SWITCHES - missing switches in the NVL domain

NMX_ST_PARTITION_SUBNET_ERROR - error occurred in subnet manager

NMX_ST_PARTITION_ROUTE_PROGRAMING_ERROR - partition has route programming error

The client calls this RPC to delete a partition. User provides either a valid name or a partition Id or both as part of the request

Request message:

Message/Parameter Type Description Notes DeletePartitionRequest.partitionId PartitionId Partition ID partition ID is optional if partition name is provided DeletePartitionRequest.name string Partition name partition name is optional if partition ID is provided

Response message:

Message/Parameter Type Description Notes DeletePartitionResponse.partitionId PartitionId

DeletePartitionResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

NMX_ST_PARTITION_ID_NOT_IN_USE - Requested partitionId not in use

Client calls this RPC to add GPUs to a partition. In a partition that is contained within a chassis boundary, the reroute flag is ignored. In a partition that uses trunk links and hence crosses the chassis boundary, the reroute flag determines if the trunk link routing is adjusted:

If reroute = true(default), the trunk link routing is adjusted when additional GPUs are added to the partition. This can disrupt applications running in the partition.

If reroute = false, the trunk link routing is not adjusted when additional GPUs are added to the partition.

The RPC also allows a special operation called "reroute" where both the locationList and gpuUidList are empty, and reroute is set to True. This allows the client to adjust the trunk link routing (i.e. "reroute") for the partition to use an optimal number of trunk links. This operation can cause traffic disruption and must be used with caution.

When one of the location list or the gpuUid list is populated and this does not match the type (location or GPU UID) with which the partition currently operates, the RPC returns NMX_ST_NOT_SUPPORTED . The type of the partition can be determined from the PartitionInfo message which is returned from the GetPartitionInfoList() RPC call

If a requested GPU does not have a valid GUID or a valid location, the RPC returns NMX_ST_RESOURCE_BAD

If the requested partition ID is not in use, the RPC returns NMX_ST_PARTITION_ID_NOT_IN_USE

If the GPU to be added is already part of the partition, the RPC returns NMX_ST_RESOURCE_USED_IN_THIS_PARTITION

If the GPU to be added is already part of another partition, the RPC returns NMX_ST_RESOURCE_USED_IN_ANOTHER_PARTITION

Request message:

Message/Parameter Type Description Notes UpdatePartitionRequest.partitionId PartitionId Partition ID is optional if partition name is provided UpdatePartitionRequest.locationList Array of GpuLocation List of GPU locations Provide only if PartitionType=PARTITION_TYPE_LOCATION_BASED UpdatePartitionRequest.gpuUid Array of gpuUid Provide only if PartitionType=PARTITION_TYPE_GPUUID_BASED UpdatePartitionRequest.name string partition name Partition name is optional if partition ID is provided UpdatePartitionRequest.reroute Boolean Reroute partition on update Default is true, will be deprecated and then removed in future releases

User can request partition reroute by setting:

locationList to empty array

gpuUid to empty array

reroute to true

Response message:

Message/Parameter Type Description Notes UpdatePartitionResponse.partitionId PartitionId

UpdatePartitionResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter, requested partitionId does not exist

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

NMX_ST_NOT_SUPPORTED - Requested type(location or gpu UID) does not match the partition type

NMX_ST_RESOURCE_BAD - Requested resource does not have a valid UID or a location

NMX_ST_PARTITION_ID_NOT_IN_USE - Requested partitionId not in use

NMX_ST_RESOURCE_USED_IN_ANOTHER_PARTITION - Requested GPU is already member of another partition

NMX_ST_RESOURCE_USED_IN_THIS_PARTITION - Requested GPU is already member of the same partition

NMX_ST_PARTITION_MISWIRED_TRUNKS - miswired trunks are detected

NMX_ST_PARTITION_INSUFFICIENT_TRUNKS - insufficient trunk links to complete the operation

NMX_ST_PARTITION_MISSING_SWITCHES - missing switches in the NVL domain

NMX_ST_PARTITION_SUBNET_ERROR - error occurred in subnet manager

NMX_ST_PARTITION_ROUTE_PROGRAMING_ERROR - partition has route programming error

The client calls this RPC to remove GPUs from a partition. UpdatePartitionRequest and UpdatePartitionResponse messages are the same as in AddGpusToPartition .

If the number of GPUs to be removed is equal to the number of GPUs in the partition, the RPC returns NMX_ST_NOT_SUPPORTED

If the number of GPUs to be removed is greater than the number of GPUs in the partition, the RPC returns NMX_ST_BADPARAM

If the GPU to be removed is not part of the partition, the RPC returns NMX_ST_RESOURCE_NOT_IN_USE

UpdatePartitionResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

NMX_ST_NOT_SUPPORTED - Requested number of gpus to be removed is equal to the number of gpus in the partition

NMX_ST_BADPARAM - Requested number of gpus to be removed is greater than the number of GPUs in the partition

NMX_ST_RESOURCE_NOT_IN_USE - Requested resource to be removed is not part of the partition

NMX_ST_PARTITION_MISWIRED_TRUNKS - miswired trunks are detected

NMX_ST_PARTITION_INSUFFICIENT_TRUNKS - insufficient trunk links to complete the operation

NMX_ST_PARTITION_MISSING_SWITCHES - missing switches in the NVL domain

NMX_ST_PARTITION_SUBNET_ERROR - error occurred in subnet manager

NMX_ST_PARTITION_ROUTE_PROGRAMING_ERROR - partition has route programming error

The client calls this RPC to get the number of fabric connections. It allows the client to filter the connections based on Attribute, Type and Location.

Copy Copied! enum ConnAttr { NMX_NVLINK_CONN_ATTR_UNKNOWN = 0 ; NMX_NVLINK_CONN_ATTR_EXPECTED = 1 ; NMX_NVLINK_CONN_ATTR_DISCOVERED = 2 ; NMX_NVLINK_CONN_ATTR_EXPECTED_ACTIVE = 3 ; NMX_NVLINK_CONN_ATTR_EXPECTED_INACTIVE = 4 ; NMX_NVLINK_CONN_ATTR_UNEXPECTED = 5 ; }

Copy Copied! enum ConnType { NMX_NVLINK_CONN_TYPE_UNKNOWN = 0 ; NMX_NVLINK_CONN_TYPE_ALL = 1 ; NMX_NVLINK_CONN_TYPE_GPU = 2 ; NMX_NVLINK_CONN_TYPE_SWITCH = 3 ; }

Various combinations of attributes and types are provided below to provide an idea on how connection information can be mined:

Connection category Connection Attribute Connection Type Access discovered CONN_ATTR_DISCOVERED CONN_TYPE_GPU Trunk discovered CONN_ATTR_DISCOVERED CONN_TYPE_SWITCH All discovered CONN_ATTR_DISCOVERED CONN_TYPE_ALL Access expected CONN_ATTR_EXPECTED CONN_TYPE_GPU Access inactive CONN_ATTR_EXPECTED_INACTIVE CONN_TYPE_GPU Trunk unexpected CONN_ATTR_UNEXPECTED CONN_TYPE_SWITCH

Request message:

Message/Parameter Type Description Notes GetConnCountRequest.connType ConnType Filter based on connection type Connection Types are NVLINK_CONN_TYPE_GPU, NVLINK_CONN_TYPE_SWITCH. Specify NVLINK_CONN_TYPE_ALL to include both GetConnCountRequest.connAttr ConnAttr Filter based on discovered and expected connections Discovered connections can be Active/Unexpected. Expected connections can be Active/Inactive/Missing. GetConnCountRequest.loc Location Filter connections for a specific location

Response message:

Message/Parameter Type Description Notes GetConnCountResponse.numConns uint32 Number of connections GetConnCountResponse.timestamp string Timestamp from when the connection database was last updated

GetConnCountResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The client calls this RPC to get information on the fabric connections. It allows the client to filter the connections based on Attribute, Type and Location.

Copy Copied! enum ConnAttr { NMX_NVLINK_CONN_ATTR_UNKNOWN = 0 ; NMX_NVLINK_CONN_ATTR_EXPECTED = 1 ; NMX_NVLINK_CONN_ATTR_DISCOVERED = 2 ; NMX_NVLINK_CONN_ATTR_EXPECTED_ACTIVE = 3 ; NMX_NVLINK_CONN_ATTR_EXPECTED_INACTIVE = 4 ; NMX_NVLINK_CONN_ATTR_UNEXPECTED = 5 ; }

Copy Copied! enum ConnType { NMX_NVLINK_CONN_TYPE_UNKNOWN = 0 ; NMX_NVLINK_CONN_TYPE_ALL = 1 ; NMX_NVLINK_CONN_TYPE_GPU = 2 ; NMX_NVLINK_CONN_TYPE_SWITCH = 3 ;

The API returns a list of connections and the state of each connection:

Copy Copied! enum ConnState { NMX_NVLINK_CONN_STATE_UNKNOWN = 0 ; NMX_NVLINK_CONN_STATE_ACTIVE = 1 ; NMX_NVLINK_CONN_STATE_INACTIVE = 2 ; }

GetConnInfoListRequest message:

Message/Parameter Type Description Notes GetConnInfoListRequest.connType ConnType Filter based on connection type GetConnInfoListRequest.connAttr ConnAttr Filter based on discovered and expected connections GetConnInfoListRequest.loc Location Filter connections for a specific location GetConnInfoListRequest.numConns uint32 Number of connections

Response message:

Message/Parameter Type Description Notes GetConnInfoListResponse.connInfoList Array of ConnInfo List of connection information GetConnInfoListResponse.timestamp string Timestamp from when the connection database was last updated

GetConnInfoListResponse.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

ConnInfo message:

Message/Parameter Type Description Notes endPointA LinkEndPoint One end of the connection(a device port) endPointB LinkEndPoint Another end of the connection(a device port) connType ConnType Connection type connState ConnState Connection state NVLINK_CONN_STATE_ACTIVE or NVLINK_CONN_STATE_INACTIVE

LinkEndPoint message:

Message/Parameter Type Description Notes loc Location Location of the node of the endpoint switchOrGpuId uint32 Location of the device(GPU/switch) within of the endpoint cageNum uint32 Cage Number of the endpoint Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0 cagePortNum uint32 Cage Port Number of the endpoint Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0 cagePortSplitNum uint32 Cage Split Port Number of the endpoint Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0 portNum uint32 Port Number of the endpoint on the device(GPU/switch)

The client calls this RPC to get information on the fabric trunk connections that are mis-wired.

Response message:

Message/Parameter Type Description Notes ConnInfoCombined.unexpectedConnList Array of ConnInfo List of mis-wired trunk connections

ConnInfoCombined.serverHeader.returnCode:

NMX_ST_SUCCESS - successfully completed the call

NMX_ST_BADPARAM - Invalid input parameter

NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The client calls this RPC to perform factory reset to the NMX-Controller. After this call is completed the NMX-Controller configuration and state is as initially delivered from factory.

ReturnCode: