On This Page
- Common Messages and Properties used in Multiple APIs
- Common Return Codes
- RPC APIs
- Hello(ClientHello) returns (ServerHello)
- Subscribe(SubscribeRequest) returns (stream ServerNotification)
- GetStaticConfig(GetStaticConfigRequest) returns (StaticConfigResponse)
- SetStaticConfig(SetStaticConfigRequest) returns (ReturnCode)
- GetDomainProperties(GetDomainPropertiesRequest) returns (DomainProperties)
- GetDomainStateInfo(GetDomainStateInfoRequest) returns (DomainStateInfo)
- GetTopologyInfo returns (FmTopologyInfo)
- GetComputeNodeCount(GetComputeNodeCountRequest) returns (GetComputeNodeCountResponse)
- GetComputeNodeLocationList(GetComputeNodeLocationListRequest) returns (GetComputeNodeLocationListResponse)
- GetComputeNodeInfoList(GetComputeNodeInfoListRequest) returns (GetComputeNodeInfoListResponse)
- GetGpuInfoList(GetGpuInfoListRequest) returns (GetGpuInfoListResponse)
- GetSwitchNodeCount(GetSwitchNodeCountRequest) returns (GetSwitchNodeCountResponse)
- GetSwitchNodeLocationList(GetSwitchNodeLocationListRequest) returns (GetSwitchNodeLocationListResponse)
- GetSwitchNodeInfoList(GetSwitchNodeInfoListRequest) returns (GetSwitchNodeInfoListResponse)
- GetSwitchInfoList(GetSwitchInfoListRequest) returns (GetSwitchInfoListResponse)
- GetPartitionCount(GetPartitionCountRequest) returns (GetPartitionCountResponse)
- GetPartitionIdList(GetPartitionIdListRequest) returns (GetPartitionIdListResponse)
- GetPartitionInfoList(GetPartitionInfoListRequest) returns (GetPartitionInfoListResponse)
- CreatePartition(CreatePartitionRequest) returns (CreatePartitionResponse)
- DeletePartition(DeletePartitionRequest) returns (DeletePartitionResponse)
- AddGpusToPartition(UpdatePartitionRequest) returns (UpdatePartitionResponse)
- RemoveGpusFromPartition(UpdatePartitionRequest) returns (UpdatePartitionResponse)
- GetConnCount(GetConnCountRequest) returns (GetConnCountResponse)
- GetConnInfoList(GetConnInfoListRequest) returns (GetConnInfoListResponse)
- GetConnInfoCombined(GetConnInfoCombinedRequest) returns (ConnInfoCombined)
- FactoryReset(FactoryResetRequest) returns (ReturnCode)
gRPC API Documentation
To ensure compatibility, use the NVIDIA provided NMX-Controller gRPC proto file to implement the gRPC client.
Property
Type
Description
Notes
partitionId
unit32
The partition APIs allow the usage of partitionId to identify specific partition instance.
Values of 1 to 32766
Note: The value of 32766 is reserved for default partition
partitionName
String
The partition APIs allow the usage of partitionName to identify specific partition instance.
partitionName (if provided) must be unique in the domain.
Values up to 255 ASCII characters
Note: The value of "Default Partition" is reserved for default partition
returnCode
enum (ST_ReturnCode)
The value of
In a unsolicited server notification, returnCode value is always
GatewayId
String
Uniquely identifies the client establishing the gRPC connection.
ServerHeader
A message sent in every response and notification from the server. The message includes:
KeyValPair
A key (string) and its associated value (string).
ConfigKey
An identifier of a static configuration parameter. Includes configFileName (string) and key (string).
Context
String
A string that should be set to
Location
A message that provides details on a node location:
LocationInfo
A message that provides details on a node location with additional physical properties:
gpuLocation
A message that provides details on a GPU location:
The following return codes are used in multiple RPC calls:
NMX_ST_CONNECTION_NOT_VALID – will be returned when any RPC call received before successful completion of Hello() RPC, and connection will be closed by the server
NMX_ST_UNINITIALIZED - if any RPC call, except calls listed below will be issued when the Control Plane is in outage condition, or Control Plane state is other than
NMX_CONTROL_PLANE_STATE_CONFIGURED (Please see GetDomainStateInfo for additional information)
Following RPC calls are allowed in any state of the Control Plane:
Hello
Subscribe
FactoryReset
GetStaticConfig
SetStaticConfig
GetDomainStateInfo
NMX_ST_NMX_CONTROLLER_INTERNAL_ERROR - RPC call processing failed due to internal error
NMX_ST_NMX_CONTROLLER_DB_ERROR - RPC call processing failed due to DB read/write error
NMX_ST_NMX_CONTROLLER_DB_CORRUPTION – NMX-C is not in operational state due to corrupted DB persistence file. FactoryReset call should be performed as a remedy
Hello(ClientHello) returns (ServerHello)
The first RPC call that client must send after gRPC connection establishment
Request:
Message/Parameter
Type
Description
Notes
ClientHello.gatewayId
String
Client identifier
ClientHello.major_version
ProtoMsgMajorVersion
Major version
ClientHello.minor_version
ProtoMsgMinorVersion
Minor version
Response:
Message/Parameter
Type
Description
Notes
ServerHello.serverHeader
ServerHeader
Base server info and return code
ServerHello.components_ver
Array of KeyValPair
List of NMX-Controller components, and their version
ServerHello.capabilities
Array of string
List of NMX-Controller capabilities
ServerHello.host_os_details
String
NMX-Controller host OS details
ClientHello.major_version
ProtoMsgMajorVersion
Major version
ClientHello.minor_version
ProtoMsgMinorVersion
Minor version
ServerHello.serverHeader.returnCode:
NMX_ST_SUCCESS - Client hello was successful
NMX_ST_BADPARAM - Missing or invalid gatewayId (e.g. empty string)
NMX_ST_VERSION_MISMATCH - Major version of client and server protos do not match
Server closes the gRPC connection if:
Client ProtoMsgMajorVersion is not the same as server
Client attempts to call any other RPC before successful completion of Hello() RPC.
Subscribe(SubscribeRequest) returns (stream ServerNotification)
The client calls this RPC to subscribe for asynchronous push notifications. SubscribeRequest has the gatewayID, and the notifyOnSelfChange that should be set to false.
Asynchronous push notifications:
Message/Parameter
Type
Description
Notes
ServerNotification.subscriptionResponse
SubscriptionResponse
Confirmation of subscription
Sent only to requesting client
ServerNotification.staticConfigResponse
StaticConfigResponse
Notification of static config change
Includes changed items
Sent to all subscribed clients except for the requesting client
ServerNotification.CreatePartitionResponse
CreatePartitionResponse
Notification of partition creation
includes partitionId
Sent to all subscribed clients except for the requesting client
ServerNotification.DeletePartitionResponse
DeletePartitionResponse
Notification of partition deletion
includes partitionId
Sent to all subscribed clients except for the requesting client
ServerNotification.UpdatePartitionResponse
UpdatePartitionResponse
Notification of partition configuration change
includes partitionId
Sent to all subscribed clients except for the requesting client
ServerNotification.fmEvent. fmEventControlPlaneStateChange
FmEventControlPlaneStateChange
Notification of change in control plane state
Sent to all subscribed clients
ServerNotification.fmEvent.fmEventTopologyChange
FmEventTopologyChange
Notification of change in discovered topology
Sent to all subscribed clients
ServerNotification.fmEvent.fmEventPartitionChange
FmEventPartitionChange
Notification of change in partition health
includes partitionId
Sent to all subscribed clients
ServerNotification.healthStateChanged
HealthStateChanged
Notification of change in NMX-C health
Sent to all subscribed clients
ServerNotification.serverHeader.returnCode:
NMX_ST_SUCCESS - Client subscription was successful
NMX_ST_BADPARAM - Missing or invalid gatewayId
GetStaticConfig(GetStaticConfigRequest) returns (StaticConfigResponse)
The client calls this RPC to read the current static configuration. The client may request to receive either full files, or parameters from the files
Request:
Message/Parameter
Type
Description
Notes
GetStaticConfigRequest.configKeys
Array of configKey
List of configuration parameters
Either configKeys or configFiles should be provided
GetStaticConfigRequest.configFiles
Array of configFileName
List of configuration files
Either configKeys or configFiles should be provided
Supported values for configFileName:
sm_config
fm_config
rdm_config
chassis_mapping
Response:
Message/Parameter
Type
Description
Notes
staticConfig.configKeyVals
Array of configKeyVals
List of configuration parameters and their values
Either configKeyVals or configFileContents is provided, depending on request
staticConfig.configFileContents
Array of ConfigFileContent
List of configuration files and their content
Either configKeyVals or configFileContents is provided, depending on request
StaticConfigResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - Client request for static configuration was successful
NMX_ST_BADPARAM - Missing or invalid input parameter (e.g. empty strings)
Example of fm_config file content:
# Description: Determine whether a default partition needs to be created
# Possible Values:
# 0 - No partitions are created during GFM initialization. GFM disables routing until an API request
# to create a partition is successful.
# 1(default) - Creates a default partition during GFM initialization. GFM creates the partition to include
# all GPUs in the topology and enables routing so that all GPUs can communicate to each other.
MNNVL_ENABLE_DEFAULT_PARTITION=1
# Description: Determine resiliency mode for default partition and when it is unspecified on a user partition
# Possible Values:
# 1 - resiliency mode RESILIENCY_MODE_FULL_BANDWIDTH
# 2(default) - resiliency mode RESILIENCY_MODE_ADAPTIVE_BANDWIDTH
# 3 - resiliency mode RESILIENCY_MODE_USER_ACTION_REQ
MNNVL_DEFAULT_RESILIENCY_MODE=2
# Description: Set type of default partition (specified by location or gpuuid)
# Possible Values:
# 1 - Creates a default partition using locations of GPUs
# 2(default) - Creates a default partition using GPU UIDs
MNNVL_DEFAULT_PARTITION_TYPE=2
MNNVL_TOPOLOGY=gb200_nvl36r1_c2g4_topology
SetStaticConfig(SetStaticConfigRequest) returns (ReturnCode)
The client calls this RPC to update the static configuration. Client may request to update either full files, or parameters in the files.
Request:
Message/Parameter
Type
Description
Notes
SetStaticConfigRequest.staticConfig.configKeyVals
Array of configKey
List of configuration parameters and their values
Either configKeyVals or configFileContents should be provided
SetStaticConfigRequest. staticConfig.configFileContents
Array of configFileContent
List of configuration files and their content
Either configKeyVals or configFileContents should be provided
Supported values for configFileName:
sm_config
fm_config
rdm_config
chassis_mapping
All config files are compliant to Linux INI format (https://en.wikipedia.org/wiki/INI_file), with the following restrictions:
format “key = value”
No support for sections
No support for hierarchy
Case insensitive
Support for comments
No support for duplicate names
Support for Quoted values
Support for Line continuation
Support for Escape characters
The "fm config" file must include the MNNVL_TOPOLOGY parameter with one of the values:
gb200_nvl36r1_c2g4_topology - 36 GPUs, Single chassis, 2 CPUs, 4 GPUs per compute tray
gb200_nvl36r1_c2g2_topology - 36 GPUs, Single chassis, 2 CPUs, 2 GPUs per compute tray
gb200_nvl72r1_c2g4_topology - 72 GPUs, Single chassis, 2 CPUs, 4 GPUs per compute tray
gb200_nvl72r2_c2g4_topology - 72 GPUs, Two chassis, 2 CPUs, 4 GPUs per compute tray
gb200_nvl72r2_c2g2_topology - 72 GPUs, Two chassis, 2 CPUs, 2 GPUs per compute tray
The chassis_mapping includes the mapping between chassis-id and chassis-serial-number, see example for the file format:
chassisId1 ABC123ABC123A
chassisId2 XYZ987XYZ987X
ReturnCode:
NMX_ST_SUCCESS - Client request for static configuration was successful
NMX_ST_NMX_CONTROLLER_INVALID_CONFIG_FILE - config file parsing error
NMX_ST_BADPARAM - Missing or invalid input parameter
GetDomainProperties(GetDomainPropertiesRequest) returns (DomainProperties)
Domain Properties provide an overview of the current topology that NMX-C manages. They provide the maximum number of expected resources(including but not limited to gpus, switches, compute nodes, switch nodes, partitions, nvlinks) in the domain.
The client calls this RPC to get static properties of the domain. The GetDomainPropertiesRequest has the context and gatewayID.
Response:
Message/Parameter
Type
Description
Notes
DomainProperties.serverHeader
ServerHeader
Base server info and return code
DomainProperties.maxComputeNodes
uint32
Maximum number of Compute Nodes in the NVLink Domain
DomainProperties.maxComputeNodesPerChassis
uint32
Maximum number of Compute Nodes in a chassis
Number of chassis in the NVLink Domain = DomainProperties.maxComputeNodes / DomainProperties.maxComputeNodesPerChassis
DomainProperties.maxGpusPerComputeNode
uint32
Maximum number of GPUs in a Compute Node
DomainProperties.maxGpuNvLinks
uint32
Maximum number of NVLinks in a GPU
DomainProperties.lineRateMBps
uint32
Maximum line rate in MBps of an NVLink
DomainProperties.maxSwitchNodes
uint32
Maximum number of Switch Nodes in the NVLink Domain
DomainProperties.maxSwitchNodesPerChassis
uint32
Maximum number of Switch Nodes in a chassis
DomainProperties.maxSwitchesPerSwitchNode
uint32
Maximum number of Switches in a Switch Node
DomainProperties.maxSwitchNvLinks
uint32
Maximum number of NVLinks in a Switch
DomainProperties.maxNumPartitions
uint32
Maximum number of partitions that can be created in the NVLink Domain
DomainProperties.maxNumAlids
uint32
Maximum number of Alids for a GPU
DomainProperties.maxMulticastGroups
uint32
Maximum number of Multicast Groups available in the NvLink Domain
DomainProperties.maxNumPorts
uint32
Total aggregate number of ports across switches and GPUs in the NVLink Domain
DomainProperties.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetDomainStateInfo(GetDomainStateInfoRequest) returns (DomainStateInfo)
The client calls this RPC to get the dynamic properties of the domain. The GetDomainStateInfoRequest has the context and gatewayID.
Response:
Message/Parameter
Type
Description
Notes
DomainStateInfo.serverHeader
ServerHeader
Base server info and return code
DomainStateInfo.controlPlaneState
ControlPlaneState
State of the domain control plane
DomainStateInfo.availableMulticastGroups
uint32
Number of available multicast groups
DomainStateInfo.configStatusDescription
string
Additional details of the control plane state
DomainStateInfo.nmxControllerHealth
NmxControllerHealth
NMX-Controller service health
DomainStateInfo.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
The following are the values of controlPlaneState:
NMX_CONTROL_PLANE_STATE_UNDEFINED =
0;
//!< Control plane state is undefined
NMX_CONTROL_PLANE_STATE_OFFLINE =
1;
//!< Control plane state is offline
NMX_CONTROL_PLANE_STATE_STANDBY =
2;
//!< Control plane state is standby, NvLink Domain is not operational
NMX_CONTROL_PLANE_STATE_CONFIGURED =
3;
//!< Control plane state is configured, NvLink Domain is operational
NMX_CONTROL_PLANE_STATE_TIMEOUT =
4;
//!< Control plane state is timeout
NMX_CONTROL_PLANE_STATE_ERROR =
5;
//!< Control plane state is error, user provided an invalid configuration
NMX_CONTROL_PLANE_STATE_UNCONFIGURED =
6;
//!< Control plane state is unconfigured, pending user provided configuration
Control Plane states and associated configuration status description strings are explained below
NMX_CONTROL_PLANE_STATE_UNCONFIGURED - Pending required FM configuration.
CONFIG_PENDING_UUID - Pending NVLink Domain UUID. FM waits indefinitely until set.
CONFIG_PENDING_TOPOLOGY - Pending MNNVL_TOPOLOGY config. FM waits indefinitely until set.
CONFIG_PENDING_CHASSIS_ID_MAPPING - Pending chassis Id mapping. FM waits for GFM_WAIT_TIMEOUT_SEC until set. If not set, FM allocates the mapping during the initial resource discovery.
CONFIG_RECEIVED - FM received all the required configuration.
NMX_CONTROL_PLANE_STATE_ERROR - Error validating FM configuration. Restart NMX-C after the configuration error is fixed.
CONFIG_ERROR_INCORRECT_TOPOLOGY_FILE - Encountered error while processing the topology file specified in MNNVL_TOPOLOGY
CONFIG_ERROR_CHASSIS_ID_MAPPING_COUNT - Detected mismatch between number of entries in the chassis Id mapping specified and expected number of chassis read from the topology file specified in MNNVL_TOPOLOGY
CONFIG_ERROR_CHASSIS_ID_MAPPING_OUT_OF_RANGE - Detected chassis Id value outside of the allowed range. Allowed range is 1 to n, where n is the number of chassis in the NVLink Domain.
CONFIG_ERROR_DUPLICATE_CHASSIS_SERIAL_NUMBER - Detected duplicate chassis serial number in the chassis Id mapping
CONFIG_ERROR_MISSING_CHASSIS - Detected fewer than expected chassis serial numbers during the initial resource discovery.
CONFIG_ERROR_ADDITIONAL_CHASSIS_DETECTED - Detected more than expected chassis serial number(s) during the initial resource discovery
NMX_CONTROL_PLANE_STATE_DEGRADED - FM detected a misconfiguration. The NVLink Domain continues to work with limited capability. Fixing the misconfiguration restores full functionality.
CONFIG_ERROR_MISWIRED_TRUNK_PORTS - Incorrect/mis-wired trunk connection(s) detected. Use GetConnInfoList() GRPC API to get details.
NMX_CONTROL_PLANE_STATE_CONFIGURED - FM completed configuration validation and initialization
CONFIG_DONE - FM completed configuration validation and initialization
The following are the values of nmxControllerHealth:
NMX_CONTROLLER_HEALTH_UNKNOWN =
0;
NMX_CONTROLLER_HEALTH_HEALTHY =
1;
NMX_CONTROLLER_HEALTH_DEGRADED =
2;
NMX_CONTROLLER_HEALTH_UNHEALTHY =
3;
NMX_CONTROLLER_HEALTH_UNHEALTHY_DB_CORRUPTED =
4;
GetTopologyInfo returns (FmTopologyInfo)
The client calls this RPC to receive information on the currently discovered topology:
Devices (GPUs and Switches) and their properties and state (DeviceTopoInfo contains one of SwitchTopoInfo or GpuTopoInfo)
Devices ports and their properties and state(PortTopoInfo)
Devices connectivity information
Response:
Message/Parameter
Type
Description
Notes
FmTopologyInfo.serverHeader
ServerHeader
Base server info and return code
FmTopologyInfo.deviceTopoInfo
Array of DeviceTopoInfo
List of discovered devices
FmTopologyInfo.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
DeviceTopoInfo message:
Message/Parameter
Type
Description
Notes
switchTopoInfo
SwitchTopoInfo
Details of switch device
Either switchTopoInfo or gpuTopoInfo is provided, depending on the discovered device type
gpuTopoInfo
GpuTopoInfo
Details of GPU device
Either switchTopoInfo or gpuTopoInfo is provided, depending on the discovered device type
gpuTopoInfo message:
Message/Parameter
Type
Description
Notes
loc
LocationInfo
Node location
topologyId
uint64
Indicates switch tray model
Value of 128 is scale-out GB200 NVL switch tray, is 129 non-scale-out GB200 NVL switch tray
deviceUid
uint64
Device unique identifier
deviceId
uint32
Device enumeration within the node
From 1
numPorts
uint64
Total number of ports
systemUid
uint64
Node unique identifier
vendorId
uint32
Device vendor ID
devicePcieId
uint64
Device PCIe ID
description
string
Device description
partitionId
Array of PartitionId
List of partitions the device is associated with
deviceHealth
GpuHealth
NVLink Health of the GPU
portTopoInfo
Array of PortTopoInfo
List of device ports
aLids
Array of uint64
List of device labels for internal routing
SwitchTopoInfo message:
Message/Parameter
Type
Description
Notes
loc
LocationInfo
Node location
topologyId
uint64
Indicates switch tray model
deviceUid
uint64
Device unique identifier
deviceId
uint32
Device enumberation within the node
numPorts
uint64
Total number of ports
systemUid
uint64
Node unique identifier
vendorId
uint32
Device vendor ID
devicePcieId
uint64
Device PCIe ID
description
string
Device description
partitionId
Array of PartitionId
List of partitions the device is associated with
deviceHealth
SwitchHealth
NVLink Health of the Switch
portTopoInfo
Array of PortTopoInfo
List of device ports
PortTopoInfo message:
Message/Parameter
Type
Description
Notes
portType
PortType
Type of device port
portUid
uint64
port identifier
portNum
uint64
Port number of device
From 1
peerPortDeviceUid
uint64
Peer device unique identifier
peerPortNum
uint64
Peer port number on peer device
From 1
physicalState
PhysicalPortState
NVLink port physical state
logicalState
LogicalPortState
NVLink port logical state
subnetPrefix
uint64
NVLink port routing subnet
isSdnPort
boolean
NVLink management port indicator
partitionIdList
Array of PartitionId
List of partitions the port is associated with
cageNum
uint32
Front panel cage number
Provided only when portType=PORT_TYPE_SWITCH_TRUNK
From 1
cagePortNum
uint32
Front panel port number in cage
Provided only when portType=PORT_TYPE_SWITCH_TRUNK
From 1
cageSplitPortNum
uint32
Front split port number in cage
Provided only when portType=PORT_TYPE_SWITCH_TRUNK
From 1
baseLid
uint64
Base label for routing
Provided only when portType=PORT_TYPE_GPU
systemPortNum
uint64
Port number per tray
Provided only when portType=PORT_TYPE_SWITCH_ACCESS
From 1
computePortNum
uint64
Port number per GPU
Provided only when portType=PORT_TYPE_GPU
From 0
containAndDrain
bool
Indication that the port is in contain and drain state
value of TRUE indicates active contain and drain state
rail
uint32
Rail of the port
plane
uint32
Plane of the port
linkRateMbps
uint32
Rate of the port link in Mbits/sec
GetComputeNodeCount(GetComputeNodeCountRequest) returns (GetComputeNodeCountResponse)
The client calls this RPC to get the number of compute nodes. It allows the client to filter the compute nodes based on Attribute, Health and Location.
Attribute reflects the allocation status of a compute node to a partition:
NMX_COMPUTE_NODE_ATTR_ALL = 1 - all compute nodes in the NvLink Domain irrespective of the allocation status
NMX_COMPUTE_NODE_ATTR_FREE = 2 - compute nodes in the NvLink Domain that are not allocated to any partitions
NMX_COMPUTE_NODE_ATTR_FULLY_ALLOCATED = 3 - compute nodes in the NvLink Domain where all of its gpus are allocated to partitions
NMX_COMPUTE_NODE_ATTR_PARTIALLY_ALLOCATED = 4 - compute nodes in the NvLink Domain where one or more(but not all) of its gpus are allocated to partitions
Compute node health:
NMX_COMPUTE_NODE_HEALTH_HEALTHY = 1 - Fully healthy
NMX_COMPUTE_NODE_HEALTH_DEGRADED = 2 - Some GPUs are degraded to NO_NVLINK
NMX_COMPUTE_NODE_HEALTH_UNHEALTHY = 3 - Unable to participate in NVLink
Location and health are optional filters.
Request:
Message/Parameter
Type
Description
Notes
GetComputeNodeCountRequest.attr
ComputeNodeAttr
Filter based on partition allocation status of the compute node
GetComputeNodeCountRequest.chassisId
unit64
Chassis ID
[Optional] Set to 0 when not used
GetComputeNodeCountRequest.nodeHealth
ComputeNodeHealth
NVLink health of the compute node
[Optional] Set to COMPUTE_NODE_HEALTH_UNKNOWN when not used
Response message:
Message/Parameter
Type
Description
Notes
GetComputeNodeCountResponse.numNodes
unit32
Limit on number of compute nodes matching the filters
GetComputeNodeCountResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetComputeNodeLocationList(GetComputeNodeLocationListRequest) returns (GetComputeNodeLocationListResponse)
Client calls this RPC to get the locations of the compute nodes. It allows the client to filter the compute nodes based on Attribute, Health and Location.
Attribute reflects the allocation status of a compute node to a partition:
NMX_COMPUTE_NODE_ATTR_ALL = 1 - all compute nodes in the NvLink Domain irrespective of the allocation status
NMX_COMPUTE_NODE_ATTR_FREE = 2 - compute nodes in the NvLink Domain that are not allocated to any partitions
NMX_COMPUTE_NODE_ATTR_FULLY_ALLOCATED = 3 - compute nodes in the NvLink Domain where all of its gpus are allocated to partitions
NMX_COMPUTE_NODE_ATTR_PARTIALLY_ALLOCATED = 4 - compute nodes in the NvLink Domain where one or more(but not all) of its gpus are allocated to partitions
Compute node health:
NMX_COMPUTE_NODE_HEALTH_HEALTHY = 1 - Fully healthy
NMX_COMPUTE_NODE_HEALTH_DEGRADED = 2 - Some GPUs are degraded to NO_NVLINK
NMX_COMPUTE_NODE_HEALTH_UNHEALTHY = 3 - Unable to participate in NVLink
Location and health are optional filters.
Request message:
Message/Parameter
Type
Description
Notes
GetComputeNodeLocationListRequest.attr
ComputeNodeAttr
Filter based on partition allocation status of the compute node
GetComputeNodeLocationListRequest.chassisId
unit64
Chassis ID
[Optional] Set to 0 when not used
GetComputeNodeLocationListRequest.nodeHealth
ComputeNodeHealth
NVLink health of the compute node
[Optional] Set to COMPUTE_NODE_HEALTH_UNKNOWN when not used
GetComputeNodeLocationListRequest.numNodes
unit32
Limit on number of nodes in the response
Response message:
Message/Parameter
Type
Description
Notes
GetComputeNodeLocationListResponse.locList
Array of Location
List of locations
GetComputeNodeLocationListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetComputeNodeInfoList(GetComputeNodeInfoListRequest) returns (GetComputeNodeInfoListResponse)
The client calls this RPC to get information on the compute nodes. The client uses it get both static(location and number of gpus in the node) and dynamic information(partition Ids and health) about compute nodes.
Request message:
Message/Parameter
Type
Description
Notes
GetComputeNodeInfoListResponse.locList
Array of Location
List of locations
[Optional] If the list is empty, the response includes all nodes
Response message:
Message/Parameter
Type
Description
Notes
GetComputeNodeInfoListResponse.nodeInfoList
Array of ComputeNodeInfo
GetComputeNodeInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
ComputeNodeInfo message:
Message/Parameter
Type
Description
Notes
ComputeNodeInfo.loc
LocationInfo
Node location
ComputeNodeInfo.numGPUs
unit32
Number of GPUs for the node
ComputeNodeInfo.nodeHealth
ComputeNodeHealth
NVLink health of the compute node
ComputeNodeInfo.partitionIdList
Array of PartitionId
List of partitions the device is associated with
GetGpuInfoList(GetGpuInfoListRequest) returns (GetGpuInfoListResponse)
The client calls this RPC to get information on the GPUs.
The client can filter GPUs based on location or partition by passing the attribute NMX_GPU_ATTR_LOCATION or NMX_GPU_ATTR_PARTITION_ID respectively.
Notes:
Specifying NMX_GPU_ATTR_PARTITION_ID and partitionId=0 provide info on GPUs that are not associated with any partition
Specifying NMX_GPU_ATTR_ALL ignores the location or partition values set in the request
Request message:
Message/Parameter
Type
Description
Notes
GetGpuInfoListRequest.attr
GpuAttr
Filter based on GPUs that belong to a partition or location
GetGpuInfoListRequest.numGpus
unit32
Limit on number of GPUs for response
[Optional] Set to 0 when not used, and response includes all GPUs matching the filter
GetGpuInfoListRequest.loc
Location
Location
Values used only when attr=GPU_ATTR_LOCATION_ID
GetGpuInfoListRequest.partitionId
PartitionId
Partition ID
Value used only when attr=GPU_ATTR_PARTITION_ID
GetGpuInfoListRequest.gpuHealth
GpuHealth
GPU health
[Optional] Set to 0 (GPU_HEALTH_UNKNOWN), when not used
GPU health values:
NMX_ GPU_HEALTH_HEALTHY = 1 //!< Fully healthy
NMX_ GPU_HEALTH_DEGRADED = 2 //!< One or more links are down
NMX_ GPU_HEALTH_NO_NVLINK = 3 //!< Unable to participate in NVLink partition
NMX_GPU_HEALTH_DEGRADED_BW = 4 //!< GPU operates in degraded bandwidth
Note: to get all healthy GPU not associated with any partition set:
attr to GPU_PARTITION_ID
partitionId to 0
Response message:
Message/Parameter
Type
Description
Notes
GetGpuInfoListResponse.gpuInfoList
Array of GpuInfo
A list of GPU information
GetGpuInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GpuInfo message:
Message/Parameter
Type
Description
Notes
GpuInfo.loc
LocationInfo
Node location
GpuInfo.gpuId
uint32
GPU enumeration within the node
From 1
GpuInfo.gpuUid
uint64
GPU unique identifier
GpuInfo.gpuHealth
GpuHealth
NVLink health of the GPU
GpuInfo.partitionId
PartitionId
Partitions the device is associated with
GetSwitchNodeCount(GetSwitchNodeCountRequest) returns (GetSwitchNodeCountResponse)
The client calls this RPC to get the number of switch nodes.
The client can filter the switch nodes using attribute and health. Note that in the NVLink5 topologies do not have switch nodes that are of type NMX_SWITCH_NODE_ATTR_L2.
Request message:
Message/Parameter
Type
Description
Notes
GetSwitchNodeCountRequest.attr
SwitchNodeAttr
Filter based on the type of switch node
GetSwitchNodeCountRequest.nodeHealth
SwitchNodeHealth
NVLink health of the switch node
[Optional] Set to SWITCH_NODE_HEALTH_UNKNOWN when not used
GetSwitchNodeCountRequest.numNodes
uint32
Limit on number of nodes for the response
[Optional] Set to 0 for no limit
Response message:
Message/Parameter
Type
Description
Notes
GetSwitchNodeCountResponse.numNodes
uint32
Number of nodes matching the filter
GetSwitchNodeCountResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetSwitchNodeLocationList(GetSwitchNodeLocationListRequest) returns (GetSwitchNodeLocationListResponse)
The client calls this RPC to get the location of switch nodes.
The client can filter the switch nodes using attribute and health. Note that in the NVLink5 topologies do not have switch nodes that are of type NMX_SWITCH_NODE_ATTR_L2
Request message:
Message/Parameter
Type
Description
Notes
GetSwitchNodeLocationListRequest.attr
SwitchNodeAttr
Filter based on the type of switch node
GetSwitchNodeLocationListRequest.nodeHealth
SwitchNodeHealth
NVLink health of the switch node
[Optional] Set to SWITCH_NODE_HEALTH_UNKNOWN when not used
GetSwitchNodeLocationListRequest.numNodes
uint32
Limit on number of nodes for the response
[Optional] Set to 0 for no limit
Response message:
Message/Parameter
Type
Description
Notes
GetSwitchNodeLocationListResponse.locList
Array of Location
List of nodes locations
GetSwitchNodeLocationListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetSwitchNodeInfoList(GetSwitchNodeInfoListRequest) returns (GetSwitchNodeInfoListResponse)
The client calls this RPC to get information on the switch nodes. The client uses it get both static(location and number of switches in the node) and dynamic information(partition Ids and health) about switch nodes
Request message:
Message/Parameter
Type
Description
Notes
GetSwitchNodeInfoListRequest.locList
Array of Location
List of nodes locations
[Optional] I the list is empty, the response includes all nodes
Response message:
Message/Parameter
Type
Description
Notes
GetSwitchNodeInfoListResponse.nodeInfoList
Array of SwitchNodeInfo
List of switch nodes
GetSwitchNodeInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
SwitchNodeInfo message:
Message/Parameter
Type
Description
Notes
SwitchNodeInfo.loc
LocationInfo
Node location
SwitchNodeInfo.numSwitches
uint32
Number of switches in the node
SwitchNodeInfo.nodeHealth
SwitchNodeHealth
NVLink health of the switch node
SwitchNodeInfo.partitionIdList
Array of PartitionId
List of partitions the node is associated with
GetSwitchInfoList(GetSwitchInfoListRequest) returns (GetSwitchInfoListResponse)
The client calls this RPC to get information on the switches.
Request message:
Message/Parameter
Type
Description
Notes
GetSwitchInfoListRequest.numSwitches
unit32
Limit on number of switches for response
[Optional] Set to 0 when not used. The response includes all switches matching the filter
GetSwitchInfoListRequest.loc
Location
Location
Response message:
Message/Parameter
Type
Description
Notes
GetSwitchInfoListResponse.switchInfoList
Array of SwitchInfo
List of switch info
GetSwitchInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
SwitchInfo message:
Message/Parameter
Type
Description
Notes
SwitchInfo.loc
LocationInfo
Node location
SwitchInfo.switchId
uint32
Switch enumeration with n the node
From 1
SwitchInfo.switchUid
uint64
Switch unique identifier
SwitchInfo.health
SwitchHealth
NVLink health of the switch
SwitchInfo.numPorts
uint32
Number of ports
GetPartitionCount(GetPartitionCountRequest) returns (GetPartitionCountResponse)
The client calls this RPC to get the number of partitions. The number of partitions can be filtered using the numGpus, numNodes or health filters. The infoAttr value decides which filter to choose between numGpus and numNodes. If set to ATTR_ALL, numGpus and numNodes are ignored.
enum PartitionInfoAttr {
NMX_PARTITION_INFO_ATTR_UNDEFINED =
0;
NMX_PARTITION_INFO_ATTR_ALL =
1;
//!< All Partitions
NMX_PARTITION_INFO_ATTR_NUM_GPUS =
2;
//!< Number of Partitions with a specific GPU size
NMX_PARTITION_INFO_ATTR_NUM_COMPUTE_NODES =
3;
//!< Number of Partitions with a specific number of Compute nodes
}
enum PartitionHealth {
NMX_PARTITION_HEALTH_UNKNOWN =
0;
NMX_PARTITION_HEALTH_HEALTHY =
1;
//!< Partition is healthy
NMX_PARTITION_HEALTH_DEGRADED_BANDWIDTH =
2;
//!< Partition is in degraded bandwidth
NMX_PARTITION_HEALTH_DEGRADED =
3;
//!< One or more GPUs has routing disabled
NMX_PARTITION_HEALTH_UNHEALTHY =
4;
//!< Partition is unhealthy
}
Request message:
Message/Parameter
Type
Description
Notes
GetPartitionCountRequest.infoAttr
PartitionInfoAttr
Filter based on number of GPUs or compute nodes in a partition
GetPartitionCountRequest.numGpus
uint32
Number of GPUs in a partition
Values used only when attr=PARITITION_INFO_ATTR_NUM_GPUS
GetPartitionCountRequest.numNodes
uint32
Number of compute nodes in a partition
Values used only when attr=PARTITION_INFO_ATTR_NUM_COMPUTE_NODES
GetPartitionCountRequest.health
PartitionHealth
NVLink health of the partition
Optional, set to PARTITION_HEALTH_UNKNOWN when not used
Response message:
Message/Parameter
Type
Description
Notes
GetPartitionCountResponse.numPartitions
uint32
Number of partitions matching the filter
GetPartitionCountResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetPartitionIdList(GetPartitionIdListRequest) returns (GetPartitionIdListResponse)
The client calls this RPC to get the partitions IDs. The partitionIds can be filtered using the numGpus, numNodes or health filters. The infoAttr value decides which filter to choose between numGpus and numNodes. If set to ATTR_ALL, numGpus and numNodes are ignored.
enum PartitionInfoAttr {
NMX_PARTITION_INFO_ATTR_UNDEFINED =
0;
NMX_PARTITION_INFO_ATTR_ALL =
1;
//!< All Partitions
NMX_PARTITION_INFO_ATTR_NUM_GPUS =
2;
//!< Number of Partitions with a specific GPU size
NMX_PARTITION_INFO_ATTR_NUM_COMPUTE_NODES =
3;
//!< Number of Partitions with a specific number of Compute nodes
}
enum PartitionHealth {
NMX_PARTITION_HEALTH_UNKNOWN =
0;
NMX_PARTITION_HEALTH_HEALTHY =
1;
//!< Partition is healthy
NMX_PARTITION_HEALTH_DEGRADED_BANDWIDTH =
2;
//!< Partition is in degraded bandwidth
NMX_PARTITION_HEALTH_DEGRADED =
3;
//!< One or more GPUs has routing disabled
NMX_PARTITION_HEALTH_UNHEALTHY =
4;
//!< Partition is unhealthy
}
Request message:
Message/Parameter
Type
Description
Notes
GetPartitionIdListRequest.infoAttr
PartitionInfoAttr
Filter based on number of GPUs or compute nodes in a partition
GetPartitionIdListRequest.numGpus
uint32
Number of GPUs in a partition
Values used only when attr=PARITIION_INFO_ATTR_NUM_GPUS
GetPartitionIdListRequest.numNodes
uint32
Number of compute nodes in a partition
Values used only when attr=PARTITION_INFO_ATTR_NUM_COMPUTE_NODES
GetPartitionIdListRequest.health
PartitionHealth
NVLink health of the partition
[Optional] Set to PARTITION_HEALTH_UNKNOWN when not used
GetPartitionIdListRequest.numPartitions
uint32
Number of partitions
Response message:
Message/Parameter
Type
Description
Notes
GetPartitionIdListResponse.partitionList
Array of Partition
List of partitions
GetPartitionIdListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
Partition message:
Message/Parameter
Type
Description
Notes
Partition.partitionId
uint32
Partition ID
Partition.numGpus
uint32
Number of GPUs in partition
GetPartitionInfoList(GetPartitionInfoListRequest) returns (GetPartitionInfoListResponse)
The client calls this RPC to get the partitions information. User can pass in a list of partitionIds or partitionNames or both. Only valid partitionIds or partitionNames are considered. If both lists are empty, information for all partitions are returned.
Request message:
Message/Parameter
Type
Description
Notes
GetPartitionInfoListRequest.partitionIdList
Array of PartitionId
List of partition IDs
[Optional] if IDs/Names not provided response includes all provisioned Partitions
GetPartitionInfoListRequest.partitionNameList
Array of strings
List of partition names
[Optional] if IDs/Names not provided response includes all provisioned Partitions
Response message:
Message/Parameter
Type
Description
Notes
GetPartitionInfoListResponse.partitionInfoList
Array of PartitionInfo
List of partition info
GetPartitionInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
PartitionInfo message:
Message/Parameter
Type
Description
Notes
partitionId
PartitionId
Partition ID
name
string
Partition name
numGpus
uint32
Number of GPUs in partition
gpuLocationList
Array of GpuLocation
GPUs location
gpuUidList
Array of uint64
GPUs unique IDs
health
PartitionHealth
NVLink health of the partition
partitionType
PartitionType
Partition type
PARTITION_TYPE_LOCATION_BASED if GPUs are associated by location
PARTITION_TYPE_GPUUID_BASED if GPUs are associated by gpuUid
numAllocatedMulticastGroups
uint32
Number of allocated multicast groups to the partition
attr.resiliencyMode
ResiliencyMode
Resiliency mode
RESILIENCY_MODE_UNDEFINED
RESILIENCY_MODE_FULL_BANDWIDTH
RESILIENCY_MODE_ADAPTIVE_BANDWIDTH
RESILIENCY_MODE_USER_ACTION_REQUIRED
attr.MulticastGroupsLimit
uint32
Limit on number of multicast groups in partition
The meaning of the ResiliencyMode values are:
Full Bandwidth: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, GPUs will be excluded from the fabric to maintain full bandwidth for the rest of the GPUs.
Adaptive Bandwidth: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, the partition's GPUs will operate with a lower bandwidth than optimal.
User Action Required: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, the partition will go into an unhealthy state which requires user action for recovery. Example actions would be providing additional trunk links or removing GPUs from the partition.
CreatePartition(CreatePartitionRequest) returns (CreatePartitionResponse)
The client calls this RPC to create a partition. The user provides a list of GPU UIDs or a list of GPU Locations. If both lists are empty, the RPC returns
NMX_ST_BADPARAM. The user can specify the attributes for the partition to be created.
If the partition creation is requested with the resiliency mode
NMX_RESILIENCY_MODE_UNDEFINED, a default resiliency mode as specified in the configuration file
MNNVL_DEFAULT_RESILIENCY_MODEis used
If the partition creation is requested with a
multicastGroupsLimitthat cannot be satisfied, the RPC returns
NMX_ST_RESOURCE_EXHAUSTED. If the value is not a multiple of 4, the RPC returns
NMX_ST_BADPARAM
partitionId's are allocated from 1 to 0x7FFD. Partition Id 0x7FFE is reserved for Default Partition. The user can specify a partitionId as part of the creation request. If the specified ID is greater than 0x7FFD, the RPC returns
NMX_ST_BADPARAM.
If a partition creation request succeeds, and a later request to create another partition with the same set of parameters is received, the RPC returns
NMX_ST_PARTITION_EXISTS
If a partition cannot be created owing to exhaustion of partitionIds, the RPC returns
NMX_ST_RESOURCE_EXHAUSTED
If a requested GPU is already part of another partition, the RPC returns
NMX_ST_RESOURCE_IN_USE
If a requested GPU does not have a valid UID or a valid location, the RPC returns
NMX_ST_RESOURCE_BAD
If the requested partitionId is already in use by another partition, the RPC returns
NMX_ST_PARTITION_ID_IN_USE
If the requested partitionName is already in use by another partition, the RPC returns
NMX_ST_PARTITION_NAME_IN_USE
If the creation fails due to any other error, the RPC returns
NMX_ST_GENERIC_ERROR
Request message:
Message/Parameter
Type
Description
Notes
string
Partition name
[Optional] Must be unique in domain if provided
CreatePartitionRequest.gpuResourceId
Array of GpuResourceId
Either gpuLocation or gpuUid
The GPU can be allocated either by GPU location or GPU unique ID
CreatePartitionRequest.attr.resiliencyMode
ResiliencyMode
Resiliency mode
RESILIENCY_MODE_UNDEFINED
RESILIENCY_MODE_FULL_BANDWIDTH
RESILIENCY_MODE_ADAPTIVE_BANDWIDTH
RESILIENCY_MODE_USER_ACTION_REQUIRED
CreatePartitionRequest.attr.MulticastGroupsLimit
uint32
Limit on number of multicast groups in partition
CreatePartitionRequest.partitionId
PartitionId
Partition ID
[Optional] Set to 0 for system to auto-generate ID
Response message:
Message/Parameter
Type
Description
Notes
CreatePartitionResponse.partitionId
PartitionId
Partition ID
User provided partition ID, or system auto-generated ID (if user did not specify)
CreatePartitionResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter, requested multicastGroupsLimit is not a multiple of 4, requested partitionId is greater than 0x7FFD(allowed range is 0x1-0x7FFD)
NMX_ST_GENERIC_ERROR - Call to GFM API has failed due to an internal error
NMX_ST_RESOURCE_EXHAUSTED - Requested multicastGroupsLimit cannot be satisfied, partitionIds are exhausted
NMX_ST_PARTITION_EXISTS - Partition with requested parameters already exists
NMX_ST_RESOURCE_USED_IN_ANOTHER_PARTITION - Requested GPU is already member of another partition
NMX_ST_RESOURCE_USED_IN_THIS_PARTITION - Requested GPU is already member of the same partition
NMX_ST_RESOURCE_BAD - Requested resource does not have a valid UID or a location
NMX_ST_PARTITION_ID_IN_USE - Requested partitionId is already in use
NMX_ST_PARTITION_NAME_IN_USE - Requested partitionName is already in use
NMX_ST_PARTITION_MISWIRED_TRUNKS - miswired trunks are detected
NMX_ST_PARTITION_INSUFFICIENT_TRUNKS - insufficient trunk links to complete the operation
NMX_ST_PARTITION_MISSING_SWITCHES - missing switches in the NVL domain
NMX_ST_PARTITION_SUBNET_ERROR - error occurred in subnet manager
NMX_ST_PARTITION_ROUTE_PROGRAMING_ERROR - partition has route programming error
DeletePartition(DeletePartitionRequest) returns (DeletePartitionResponse)
The client calls this RPC to delete a partition. User provides either a valid name or a partition Id or both as part of the request
Request message:
Message/Parameter
Type
Description
Notes
DeletePartitionRequest.partitionId
PartitionId
Partition ID
partition ID is optional if partition name is provided
string
Partition name
partition name is optional if partition ID is provided
Response message:
Message/Parameter
Type
Description
Notes
DeletePartitionResponse.partitionId
PartitionId
DeletePartitionResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
NMX_ST_PARTITION_ID_NOT_IN_USE - Requested partitionId not in use
AddGpusToPartition(UpdatePartitionRequest) returns (UpdatePartitionResponse)
Client calls this RPC to add GPUs to a partition. In a partition that is contained within a chassis boundary, the reroute flag is ignored. In a partition that uses trunk links and hence crosses the chassis boundary, the reroute flag determines if the trunk link routing is adjusted:
If reroute = true(default), the trunk link routing is adjusted when additional GPUs are added to the partition. This can disrupt applications running in the partition.
If reroute = false, the trunk link routing is not adjusted when additional GPUs are added to the partition.
The RPC also allows a special operation called "reroute" where both the locationList and gpuUidList are empty, and reroute is set to True. This allows the client to adjust the trunk link routing (i.e. "reroute") for the partition to use an optimal number of trunk links. This operation can cause traffic disruption and must be used with caution.
When one of the location list or the gpuUid list is populated and this does not match the type (location or GPU UID) with which the partition currently operates, the RPC returns
NMX_ST_NOT_SUPPORTED. The type of the partition can be determined from the PartitionInfo message which is returned from the GetPartitionInfoList() RPC call
If a requested GPU does not have a valid GUID or a valid location, the RPC returns
NMX_ST_RESOURCE_BAD
If the requested partition ID is not in use, the RPC returns
NMX_ST_PARTITION_ID_NOT_IN_USE
If the GPU to be added is already part of the partition, the RPC returns
NMX_ST_RESOURCE_USED_IN_THIS_PARTITION
If the GPU to be added is already part of another partition, the RPC returns
NMX_ST_RESOURCE_USED_IN_ANOTHER_PARTITION
Request message:
Message/Parameter
Type
Description
Notes
UpdatePartitionRequest.partitionId
PartitionId
Partition ID is optional if partition name is provided
UpdatePartitionRequest.locationList
Array of GpuLocation
List of GPU locations
Provide only if PartitionType=PARTITION_TYPE_LOCATION_BASED
UpdatePartitionRequest.gpuUid
Array of gpuUid
Provide only if PartitionType=PARTITION_TYPE_GPUUID_BASED
string
partition name
Partition name is optional if partition ID is provided
UpdatePartitionRequest.reroute
Boolean
Reroute partition on update
Default is true, will be deprecated and then removed in future releases
User can request partition reroute by setting:
locationList to empty array
gpuUid to empty array
reroute to true
Response message:
Message/Parameter
Type
Description
Notes
UpdatePartitionResponse.partitionId
PartitionId
UpdatePartitionResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter, requested partitionId does not exist
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
NMX_ST_NOT_SUPPORTED - Requested type(location or gpu UID) does not match the partition type
NMX_ST_RESOURCE_BAD - Requested resource does not have a valid UID or a location
NMX_ST_PARTITION_ID_NOT_IN_USE - Requested partitionId not in use
NMX_ST_RESOURCE_USED_IN_ANOTHER_PARTITION - Requested GPU is already member of another partition
NMX_ST_RESOURCE_USED_IN_THIS_PARTITION - Requested GPU is already member of the same partition
NMX_ST_PARTITION_MISWIRED_TRUNKS - miswired trunks are detected
NMX_ST_PARTITION_INSUFFICIENT_TRUNKS - insufficient trunk links to complete the operation
NMX_ST_PARTITION_MISSING_SWITCHES - missing switches in the NVL domain
NMX_ST_PARTITION_SUBNET_ERROR - error occurred in subnet manager
NMX_ST_PARTITION_ROUTE_PROGRAMING_ERROR - partition has route programming error
RemoveGpusFromPartition(UpdatePartitionRequest) returns (UpdatePartitionResponse)
The client calls this RPC to remove GPUs from a partition.
UpdatePartitionRequest and
UpdatePartitionResponse messages are the same as in
AddGpusToPartition.
If the number of GPUs to be removed is equal to the number of GPUs in the partition, the RPC returns
NMX_ST_NOT_SUPPORTED
If the number of GPUs to be removed is greater than the number of GPUs in the partition, the RPC returns
NMX_ST_BADPARAM
If the GPU to be removed is not part of the partition, the RPC returns
NMX_ST_RESOURCE_NOT_IN_USE
UpdatePartitionResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
NMX_ST_NOT_SUPPORTED - Requested number of gpus to be removed is equal to the number of gpus in the partition
NMX_ST_BADPARAM - Requested number of gpus to be removed is greater than the number of GPUs in the partition
NMX_ST_RESOURCE_NOT_IN_USE - Requested resource to be removed is not part of the partition
NMX_ST_PARTITION_MISWIRED_TRUNKS - miswired trunks are detected
NMX_ST_PARTITION_INSUFFICIENT_TRUNKS - insufficient trunk links to complete the operation
NMX_ST_PARTITION_MISSING_SWITCHES - missing switches in the NVL domain
NMX_ST_PARTITION_SUBNET_ERROR - error occurred in subnet manager
NMX_ST_PARTITION_ROUTE_PROGRAMING_ERROR - partition has route programming error
GetConnCount(GetConnCountRequest) returns (GetConnCountResponse)
The client calls this RPC to get the number of fabric connections. It allows the client to filter the connections based on Attribute, Type and Location.
enum ConnAttr {
NMX_NVLINK_CONN_ATTR_UNKNOWN =
0;
NMX_NVLINK_CONN_ATTR_EXPECTED =
1;
//!< All expected connections as per FM Topology
NMX_NVLINK_CONN_ATTR_DISCOVERED =
2;
//!< All discovered connections
NMX_NVLINK_CONN_ATTR_EXPECTED_ACTIVE =
3;
//!< All expected active connections as per FM Topology
NMX_NVLINK_CONN_ATTR_EXPECTED_INACTIVE =
4;
//!< All expected inactive connections
NMX_NVLINK_CONN_ATTR_UNEXPECTED =
5;
//!< All unexpected connections which are discovered
}
enum ConnType {
NMX_NVLINK_CONN_TYPE_UNKNOWN =
0;
NMX_NVLINK_CONN_TYPE_ALL =
1;
//!< Dump the GPU and switch connections
NMX_NVLINK_CONN_TYPE_GPU =
2;
//!< Dump the GPU connections
NMX_NVLINK_CONN_TYPE_SWITCH =
3;
//!< Dump the Switch connections
}
Various combinations of attributes and types are provided below to provide an idea on how connection information can be mined:
Connection category
Connection Attribute
Connection Type
Access discovered
CONN_ATTR_DISCOVERED
CONN_TYPE_GPU
Trunk discovered
CONN_ATTR_DISCOVERED
CONN_TYPE_SWITCH
All discovered
CONN_ATTR_DISCOVERED
CONN_TYPE_ALL
Access expected
CONN_ATTR_EXPECTED
CONN_TYPE_GPU
Access inactive
CONN_ATTR_EXPECTED_INACTIVE
CONN_TYPE_GPU
Trunk unexpected
CONN_ATTR_UNEXPECTED
CONN_TYPE_SWITCH
Request message:
Message/Parameter
Type
Description
Notes
GetConnCountRequest.connType
ConnType
Filter based on connection type
Connection Types are NVLINK_CONN_TYPE_GPU, NVLINK_CONN_TYPE_SWITCH.
Specify NVLINK_CONN_TYPE_ALL to include both
GetConnCountRequest.connAttr
ConnAttr
Filter based on discovered and expected connections
Discovered connections can be Active/Unexpected.
Expected connections can be Active/Inactive/Missing.
GetConnCountRequest.loc
Location
Filter connections for a specific location
Response message:
Message/Parameter
Type
Description
Notes
GetConnCountResponse.numConns
uint32
Number of connections
GetConnCountResponse.timestamp
string
Timestamp from when the connection database was last updated
GetConnCountResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
GetConnInfoList(GetConnInfoListRequest) returns (GetConnInfoListResponse)
The client calls this RPC to get information on the fabric connections. It allows the client to filter the connections based on Attribute, Type and Location.
enum ConnAttr {
NMX_NVLINK_CONN_ATTR_UNKNOWN =
0;
NMX_NVLINK_CONN_ATTR_EXPECTED =
1;
//!< All expected connections as per FM Topology
NMX_NVLINK_CONN_ATTR_DISCOVERED =
2;
//!< All discovered connections
NMX_NVLINK_CONN_ATTR_EXPECTED_ACTIVE =
3;
//!< All expected active connections as per FM Topology
NMX_NVLINK_CONN_ATTR_EXPECTED_INACTIVE =
4;
//!< All expected inactive connections
NMX_NVLINK_CONN_ATTR_UNEXPECTED =
5;
//!< All unexpected connections which are discovered
}
enum ConnType {
NMX_NVLINK_CONN_TYPE_UNKNOWN =
0;
NMX_NVLINK_CONN_TYPE_ALL =
1;
//!< Dump the GPU and switch connections
NMX_NVLINK_CONN_TYPE_GPU =
2;
//!< Dump the GPU connections
NMX_NVLINK_CONN_TYPE_SWITCH =
3;
//!< Dump the Switch connections
The API returns a list of connections and the state of each connection:
enum ConnState {
NMX_NVLINK_CONN_STATE_UNKNOWN =
0;
NMX_NVLINK_CONN_STATE_ACTIVE =
1;
//!< Active link or connection state
NMX_NVLINK_CONN_STATE_INACTIVE =
2;
//!< Inactive link or connection state
}
GetConnInfoListRequest message:
Message/Parameter
Type
Description
Notes
GetConnInfoListRequest.connType
ConnType
Filter based on connection type
GetConnInfoListRequest.connAttr
ConnAttr
Filter based on discovered and expected connections
GetConnInfoListRequest.loc
Location
Filter connections for a specific location
GetConnInfoListRequest.numConns
uint32
Number of connections
Response message:
Message/Parameter
Type
Description
Notes
GetConnInfoListResponse.connInfoList
Array of ConnInfo
List of connection information
GetConnInfoListResponse.timestamp
string
Timestamp from when the connection database was last updated
GetConnInfoListResponse.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
ConnInfo message:
Message/Parameter
Type
Description
Notes
endPointA
LinkEndPoint
One end of the connection(a device port)
endPointB
LinkEndPoint
Another end of the connection(a device port)
connType
ConnType
Connection type
connState
ConnState
Connection state
NVLINK_CONN_STATE_ACTIVE or NVLINK_CONN_STATE_INACTIVE
LinkEndPoint message:
Message/Parameter
Type
Description
Notes
loc
Location
Location of the node of the endpoint
switchOrGpuId
uint32
Location of the device(GPU/switch) within of the endpoint
cageNum
uint32
Cage Number of the endpoint
Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0
cagePortNum
uint32
Cage Port Number of the endpoint
Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0
cagePortSplitNum
uint32
Cage Split Port Number of the endpoint
Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0
portNum
uint32
Port Number of the endpoint on the device(GPU/switch)
GetConnInfoCombined(GetConnInfoCombinedRequest) returns (ConnInfoCombined)
The client calls this RPC to get information on the fabric trunk connections that are mis-wired.
Response message:
Message/Parameter
Type
Description
Notes
ConnInfoCombined.unexpectedConnList
Array of ConnInfo
List of mis-wired trunk connections
ConnInfoCombined.serverHeader.returnCode:
NMX_ST_SUCCESS - successfully completed the call
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call to GFM API has failed
FactoryReset(FactoryResetRequest) returns (ReturnCode)
The client calls this RPC to perform factory reset to the NMX-Controller. After this call is completed the NMX-Controller configuration and state is as initially delivered from factory.
ReturnCode:
NMX_ST_SUCCESS - successfully completed the call
NMX_ST_BADPARAM - Invalid input parameter
NMX_ST_GENERIC_ERROR - Call has failed