What can I help you with?
NMX-Controller (NMX-C) Documentation v1.0.0

On This Page

gRPC API

Note

To ensure compatibility, use the NVIDIA provided NMX-Controller gRPC proto file to implement the gRPC client.

Property

Type

Description

Notes

partitionId

unit32

The partition APIs allow the usage of partitionId to identify specific partition instance.

Values of 1 to 32766

Note: The value of 32766 is reserved for default partition

partitionName

String

The partition APIs allow the usage of partitionName to identify specific partition instance.

partitionName (if provided) must be unique in the domain.

Values up to 255 ASCII characters

Note: The value of "Default Partition" is reserved for default partition

returnCode

enum (ST_ReturnCode)

The value of NMX_ST_SUCCESS indicates the client request was processed successfully by the server. All other values indicate the reason the client request has failed.

In a unsolicited server notification, returnCode value is always NMX_ST_SUCCESS.

GatewayId

String

Uniquely identifies the client establishing the gRPC connection.

ServerHeader

A message sent in every response and notification from the server. The message includes:

  • domain_uuid – string, unique identifier for this NVLink domain

  • app_uuid - string, unique identifier for the NMX-Controller application on this NVLink domain

  • app_ver - string, NMX-Controller version

  • returnCode

KeyValPair

A key (string) and its associated value (string).

ConfigKey

An identifier of a static configuration parameter. Includes configFileName (string) and key (string).

Context

String

A string that should be set to "" (empty string). This string is a placeholder for future usage.

Location

A message that provides details on a node location:

  • chassisId – chassis identifier (uint64) from 1, sequential to number of chassis in the domain topology

  • slotId – chassis physical slot number of the compute/switch tray, (uint64) from 1

  • hostId – node enumeration within the compute/switch tray, (uint64) from 1

LocationInfo

A message that provides details on a node location with additional physical properties:

  • location – chassisId, slotId and hostId

  • chassisSerialNumber – (string) 13 characters chassis serial number from one of the backplane cable cartridges

  • trayIndex – index of the compute/switch tray within the compute/switch trays group in the chassis, (uint64) from 0

gpuLocation

A message that provides details on a GPU location:

  • loc - Node Location message

  • gpuId - GPU enumeration within the node, from 1

Hello(ClientHello) returns (ServerHello)

The first RPC call client that must be send after gRPC connection establishment

Request:

Message/Parameter

Type

Description

Notes

ClientHello.gatewayId

string

Client identifier

ClientHello.major_version

ProtoMsgMajorVersion

Major version

ClientHello.minor_version

ProtoMsgMinorVersion

Minor version

Response:

Message/Parameter

Type

Description

Notes

ServerHello.serverHeader

ServerHeader

Base server info and return code

ServerHello.components_ver

Array of KeyValPair

List of NMX-Controller components, and their version

ServerHello.capabilities

Array of string

List of NMX-Controller capabilities

ServerHello.host_os_details

String

NMX-Controller host OS details

ClientHello.major_version

ProtoMsgMajorVersion

Major version

ClientHello.minor_version

ProtoMsgMinorVersion

Minor version

ServerHello.serverHeader.returnCode:

  • NMX_ST_SUCCESS - Client hello was successful

  • NMX_ST_BADPARAM - Missing or invalid gatewayId (e.g. empty string)

  • NMX_ST_VERSION_MISMATCH - Major version of client and server protos do not match

Server closes the gRPC connection if:

  • Client ProtoMsgMajorVersion is not the same as server

  • Client attempts to call any RPC before successful completion of Hello() RPC

Subscribe(SubscribeRequest) returns (stream ServerNotification)

The client calls this RPC to subscribe for asynchronous push notifications. SubscribeRequest has the gatewayID, and the notifyOnSelfChange that should be set to false.

Asynchronous push notifications:

Message/Parameter

Type

Description

Notes

ServerNotification.subscriptionResponse

SubscriptionResponse

Confirmation of subscription

Sent only to requesting client

ServerNotification.staticConfigResponse

StaticConfigResponse

Notification of static config change

Includes changed items

Sent to all subscribed clients except for the requesting client

ServerNotification.CreatePartitionResponse

CreatePartitionResponse

Notification of partition creation

includes partitionId

Sent to all subscribed clients except for the requesting client

ServerNotification.DeletePartitionResponse

DeletePartitionResponse

Notification of partition deletion

includes partitionId

Sent to all subscribed clients except for the requesting client

ServerNotification.UpdatePartitionResponse

UpdatePartitionResponse

Notification of partition configuration change

includes partitionId

Sent to all subscribed clients except for the requesting client

ServerNotification.fmEvent. fmEventControlPlaneStateChange

FmEventControlPlaneStateChange

Notification of change in control plane state

Sent to all subscribed clients

ServerNotification.fmEvent.fmEventTopologyChange

FmEventTopologyChange

Notification of change in discovered topology

Sent to all subscribed clients

ServerNotification.fmEvent.fmEventPartitionChange

FmEventPartitionChange

Notification of change in partition health

includes partitionId

Sent to all subscribed clients

ServerNotification.healthStateChanged

HealthStateChanged

Notification of change in NMX-C health

Sent to all subscribed clients

ServerNotification.serverHeader.returnCode:

  • NMX_ST_SUCCESS - Client subscription was successful

  • NMX_ST_BADPARAM - Missing or invalid gatewayId

GetStaticConfig(GetStaticConfigRequest) returns (StaticConfigResponse)

The client calls this RPC to read the current static configuration. The client may request to receive either full files, or parameters from the files

Request:

Message/Parameter

Type

Description

Notes

GetStaticConfigRequest.configKeys

Array of configKey

List of configuration parameters

Either configKeys or configFiles should be provided

GetStaticConfigRequest.configFiles

Array of configFileName

List of configuration files

Either configKeys or configFiles should be provided

Supported values for configFileName:

  • sm_config

  • fm_config

  • rdm_config

  • chassis_mapping

Response:

Message/Parameter

Type

Description

Notes

staticConfig.configKeyVals

Array of configKeyVals

List of configuration parameters and their values

Either configKeyVals or configFileContents is provided, depending on request

staticConfig.configFileContents

Array of ConfigFileContent

List of configuration files and their content

Either configKeyVals or configFileContents is provided, depending on request

StaticConfigResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - Client request for static configuration was successful

  • NMX_ST_BADPARAM - Missing or invalid input parameter (e.g. empty strings)

Example of fm_config file content:

Copy
Copied!
            

#    Description: Determine whether a default partition needs to be created #    Possible Values: #       0             - No partitions are created during GFM initialization. GFM disables routing until an API request #                     to create a partition is successful. #       1(default)    - Creates a default partition during GFM initialization. GFM creates the partition to include #                     all GPUs in the topology and enables routing so that all GPUs can communicate to each other. MNNVL_ENABLE_DEFAULT_PARTITION=1   #    Description: Determine resiliency mode for default partition and when it is unspecified on a user partition #    Possible Values: #       1             - resiliency mode RESILIENCY_MODE_FULL_BANDWIDTH #       2(default)    - resiliency mode RESILIENCY_MODE_ADAPTIVE_BANDWIDTH #       3             - resiliency mode RESILIENCY_MODE_USER_ACTION_REQ MNNVL_DEFAULT_RESILIENCY_MODE=2   #    Description: Set type of default partition (specified by location or gpuuid) #    Possible Values: #       1             - Creates a default partition using locations of GPUs #       2(default)    - Creates a default partition using GPU UIDs MNNVL_DEFAULT_PARTITION_TYPE=2   MNNVL_TOPOLOGY=gb200_nvl36r1_c2g4_topology


SetStaticConfig(SetStaticConfigRequest) returns (ReturnCode)

The client calls this RPC to update the static configuration. Client may request to update either full files, or parameters in the files.

Request:

Message/Parameter

Type

Description

Notes

SetStaticConfigRequest.staticConfig.configKeyVals

Array of configKey

List of configuration parameters and their values

Either configKeyVals or configFileContents should be provided

SetStaticConfigRequest. staticConfig.configFileContents

Array of configFileContent

List of configuration files and their content

Either configKeyVals or configFileContents should be provided

Supported values for configFileName:

  • sm_config

  • fm_config

  • rdm_config

  • chassis_mapping

All config files are compliant to Linux INI format (https://en.wikipedia.org/wiki/INI_file), with the following restrictions:

  • format “key = value”

  • No support for sections

  • No support for hierarchy

  • Case insensitive

  • Support for comments

  • No support for duplicate names

  • Support for Quoted values

  • Support for Line continuation

  • Support for Escape characters

The "fm config" file must include the MNNVL_TOPOLOGY parameter with one of the values:

  • gb200_nvl36r1_c2g4_topology - 36 GPUs, Single chassis, 2 CPUs, 4 GPUs per compute tray

  • gb200_nvl36r1_c2g2_topology - 36 GPUs, Single chassis, 2 CPUs, 2 GPUs per compute tray

  • gb200_nvl72r1_c2g4_topology - 72 GPUs, Single chassis, 2 CPUs, 4 GPUs per compute tray

  • gb200_nvl72r2_c2g4_topology - 72 GPUs, Two chassis, 2 CPUs, 4 GPUs per compute tray

  • gb200_nvl72r2_c2g2_topology - 72 GPUs, Two chassis, 2 CPUs, 2 GPUs per compute tray

The chassis_mapping includes the mapping between chassis-id and chassis-serial-number, see example for the file format:

chassisId1 ABC123ABC123A

chassisId2 XYZ987XYZ987X

ReturnCode:

  • NMX_ST_SUCCESS - Client request for static configuration was successful

  • NMX_ST_BADPARAM - Missing or invalid input parameter

GetDomainProperties(GetDomainPropertiesRequest) returns (DomainProperties)

Domain Properties provide an overview of the current topology that NMX-C manages. They provide the maximum number of expected resources(including but not limited to gpus, switches, compute nodes, switch nodes, partitions, nvlinks) in the domain.

The client calls this RPC to get static properties of the domain. The GetDomainPropertiesRequest has the context and gatewayID.

Response:

Message/Parameter

Type

Description

Notes

DomainProperties.serverHeader

ServerHeader

Base server info and return code

DomainProperties.maxComputeNodes

uint32

Maximum number of Compute Nodes in the NVLink Domain

DomainProperties.maxComputeNodesPerChassis

uint32

Maximum number of Compute Nodes in a chassis

Number of chassis in the NVLink Domain = DomainProperties.maxComputeNodes / DomainProperties.maxComputeNodesPerChassis

DomainProperties.maxGpusPerComputeNode

uint32

Maximum number of GPUs in a Compute Node

DomainProperties.maxGpuNvLinks

uint32

Maximum number of NVLinks in a GPU

DomainProperties.lineRateMBps

uint32

Maximum line rate in MBps of an NVLink

DomainProperties.maxSwitchNodes

uint32

Maximum number of Switch Nodes in the NVLink Domain

DomainProperties.maxSwitchNodesPerChassis

uint32

Maximum number of Switch Nodes in a chassis

DomainProperties.maxSwitchesPerSwitchNode

uint32

Maximum number of Switches in a Switch Node

DomainProperties.maxSwitchNvLinks

uint32

Maximum number of NVLinks in a Switch

DomainProperties.maxNumPartitions

uint32

Maximum number of partitions that can be created in the NVLink Domain

DomainProperties.maxNumAlids

uint32

Maximum number of Alids for a GPU

DomainProperties.maxMulticastGroups

uint32

Maximum number of Multicast Groups available in the NvLink Domain

DomainProperties.maxNumPorts

uint32

Total aggregate number of ports across switches and GPUs in the NVLink Domain

DomainProperties.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GetDomainStateInfo(GetDomainStateInfoRequest) returns (DomainStateInfo)

The client calls this RPC to get the dynamic properties of the domain. The GetDomainStateInfoRequest has the context and gatewayID.

Response:

Message/Parameter

Type

Description

Notes

DomainStateInfo.serverHeader

ServerHeader

Base server info and return code

DomainStateInfo.controlPlaneState

ControlPlaneState

State of the domain control plane

DomainStateInfo.availableMulticastGroups

uint32

Number of available multicast groups

DomainStateInfo.configStatusDescription

string

Additional details of the control plane state

DomainStateInfo.nmxControllerHealth

NmxControllerHealth

NMX-Controller service health

DomainStateInfo.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

The following are the values of controlPlaneState:

Copy
Copied!
            

NMX_CONTROL_PLANE_STATE_UNDEFINED = 0; //!< Control plane state is undefined NMX_CONTROL_PLANE_STATE_OFFLINE = 1; //!< Control plane state is offline NMX_CONTROL_PLANE_STATE_STANDBY = 2; //!< Control plane state is standby, NvLink Domain is not operational NMX_CONTROL_PLANE_STATE_CONFIGURED = 3; //!< Control plane state is configured, NvLink Domain is operational NMX_CONTROL_PLANE_STATE_TIMEOUT = 4; //!< Control plane state is timeout NMX_CONTROL_PLANE_STATE_ERROR = 5; //!< Control plane state is error, user provided an invalid configuration NMX_CONTROL_PLANE_STATE_UNCONFIGURED = 6; //!< Control plane state is unconfigured, pending user provided configuration

Control Plane states and associated configuration status description strings are explained below

  • NMX_CONTROL_PLANE_STATE_UNCONFIGURED - Pending required FM configuration.

    • CONFIG_PENDING_UUID - Pending NVLink Domain UUID. FM waits indefinitely until set.

    • CONFIG_PENDING_TOPOLOGY - Pending MNNVL_TOPOLOGY config. FM waits indefinitely until set.

    • CONFIG_PENDING_CHASSIS_ID_MAPPING - Pending chassis Id mapping. FM waits for GFM_WAIT_TIMEOUT_SEC until set. If not set, FM allocates the mapping during the initial resource discovery.

    • CONFIG_RECEIVED - FM received all the required configuration.

  • NMX_CONTROL_PLANE_STATE_ERROR - Error validating FM configuration. Restart NMX-C after the configuration error is fixed.

    • CONFIG_ERROR_INCORRECT_TOPOLOGY_FILE - Encountered error while processing the topology file specified in MNNVL_TOPOLOGY

    • CONFIG_ERROR_CHASSIS_ID_MAPPING_COUNT - Detected mismatch between number of entries in the chassis Id mapping specified and expected number of chassis read from the topology file specified in MNNVL_TOPOLOGY

    • CONFIG_ERROR_CHASSIS_ID_MAPPING_OUT_OF_RANGE - Detected chassis Id value outside of the allowed range. Allowed range is 1 to n, where n is the number of chassis in the NVLink Domain.

    • CONFIG_ERROR_DUPLICATE_CHASSIS_SERIAL_NUMBER - Detected duplicate chassis serial number in the chassis Id mapping

    • CONFIG_ERROR_MISSING_CHASSIS - Detected fewer than expected chassis serial numbers during the initial resource discovery.

    • CONFIG_ERROR_ADDITIONAL_CHASSIS_DETECTED - Detected more than expected chassis serial number(s) during the initial resource discovery

  • NMX_CONTROL_PLANE_STATE_DEGRADED - FM detected a misconfiguration. The NVLink Domain continues to work with limited capability. Fixing the misconfiguration restores full functionality.

    • CONFIG_ERROR_MISWIRED_TRUNK_PORTS - Incorrect/mis-wired trunk connection(s) detected. Use GetConnInfoList() GRPC API to get details.

  • NMX_CONTROL_PLANE_STATE_CONFIGURED - FM completed configuration validation and initialization

    • CONFIG_DONE - FM completed configuration validation and initialization

The following are the values of nmxControllerHealth:

Copy
Copied!
            

NMX_CONTROLLER_HEALTH_UNKNOWN = 0; NMX_CONTROLLER_HEALTH_HEALTHY = 1;       NMX_CONTROLLER_HEALTH_DEGRADED = 2;       NMX_CONTROLLER_HEALTH_UNHEALTHY = 3;


GetTopologyInfo returns (FmTopologyInfo)

The client calls this RPC to receive information on the currently discovered topology:

  • Devices (GPUs and Switches) and their properties and state (DeviceTopoInfo contains one of SwitchTopoInfo or GpuTopoInfo)

  • Devices ports and their properties and state(PortTopoInfo)

  • Devices connectivity information

Response:

Message/Parameter

Type

Description

Notes

FmTopologyInfo.serverHeader

ServerHeader

Base server info and return code

FmTopologyInfo.deviceTopoInfo

Array of DeviceTopoInfo

List of discovered devices

FmTopologyInfo.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

DeviceTopoInfo message:

Message/Parameter

Type

Description

Notes

switchTopoInfo

SwitchTopoInfo

Details of switch device

Either switchTopoInfo or gpuTopoInfo is provided, depending on the discovered device type

gpuTopoInfo

GpuTopoInfo

Details of GPU device

Either switchTopoInfo or gpuTopoInfo is provided, depending on the discovered device type

gpuTopoInfo message:

Message/Parameter

Type

Description

Notes

loc

LocationInfo

Node location

topologyId

uint64

Indicates switch tray model

Value of 128 is scale-out GB200 NVL switch tray, is 129 non-scale-out GB200 NVL switch tray

deviceUid

uint64

Device unique identifier

deviceId

uint32

Device enumeration within the node

From 1

numPorts

uint64

Total number of ports

systemUid

uint64

Node unique identifier

vendorId

uint32

Device vendor ID

devicePcieId

uint64

Device PCIe ID

description

string

Device description

partitionId

Array of PartitionId

List of partitions the device is associated with

deviceHealth

GpuHealth

NVLink Health of the GPU

portTopoInfo

Array of PortTopoInfo

List of device ports

aLids

Array of uint64

List of device labels for internal routing

SwitchTopoInfo message:

Message/Parameter

Type

Description

Notes

loc

LocationInfo

Node location

topologyId

uint64

Indicates switch tray model

deviceUid

uint64

Device unique identifier

deviceId

uint32

Device enumberation within the node

numPorts

uint64

Total number of ports

systemUid

uint64

Node unique identifier

vendorId

uint32

Device vendor ID

devicePcieId

uint64

Device PCIe ID

description

string

Device description

partitionId

Array of PartitionId

List of partitions the device is associated with

deviceHealth

SwitchHealth

NVLink Health of the Switch

portTopoInfo

Array of PortTopoInfo

List of device ports

PortTopoInfo message:

Message/Parameter

Type

Description

Notes

portType

PortType

Type of device port

  • For GPU: PORT_TYPE_GPU

  • For a switch: PORT_TYPE_SWITCH_ACCESS (port towards GPU), PORT_TYPE_SWITCH_TRUNK (relevant only for scale-out switch trays, scale out port), PORT_TYPE_FNM (internal port, for internal management traffic)

portUid

uint64

port identifier

  • For GPU: Unique port identifier

  • For a switch: Equal to device unique identifier

portNum

uint64

Port number of device

From 1

peerPortDeviceUid

uint64

Peer device unique identifier

peerPortNum

uint64

Peer port number on peer device

From 1

physicalState

PhysicalPortState

NVLink port physical state

logicalState

LogicalPortState

NVLink port logical state

subnetPrefix

uint64

NVLink port routing subnet

isSdnPort

boolean

NVLink management port indicator

partitionIdList

Array of PartitionId

List of partitions the port is associated with

cageNum

uint32

Front panel cage number

Provided only when portType=PORT_TYPE_SWITCH_TRUNK

From 1

cagePortNum

uint32

Front panel port number in cage

Provided only when portType=PORT_TYPE_SWITCH_TRUNK

From 1

cageSplitPortNum

uint32

Front split port number in cage

Provided only when portType=PORT_TYPE_SWITCH_TRUNK

From 1

baseLid

uint64

Base label for routing

Provided only when portType=PORT_TYPE_GPU

systemPortNum

uint64

Port number per tray

Provided only when portType=PORT_TYPE_SWITCH_ACCESS

From 1

computePortNum

uint64

Port number per GPU

Provided only when portType=PORT_TYPE_GPU

From 0

containAndDrain

bool

Indication that the port is in contain and drain state

value of TRUE indicates active contain and drain state

rail

uint32

Rail of the port

plane

uint32

Plane of the port

linkRateMbps

uint32

Rate of the port link in Mbits/sec


GetComputeNodeCount(GetComputeNodeCountRequest) returns (GetComputeNodeCountResponse)

The client calls this RPC to get the number of compute nodes. It allows the client to filter the compute nodes based on Attribute, Health and Location.

Attribute reflects the allocation status of a compute node to a partition:

  • NMX_COMPUTE_NODE_ATTR_ALL = 1 - all compute nodes in the NvLink Domain irrespective of the allocation status

  • NMX_COMPUTE_NODE_ATTR_FREE = 2 - compute nodes in the NvLink Domain that are not allocated to any partitions

  • NMX_COMPUTE_NODE_ATTR_FULLY_ALLOCATED = 3 - compute nodes in the NvLink Domain where all of its gpus are allocated to partitions

  • NMX_COMPUTE_NODE_ATTR_PARTIALLY_ALLOCATED = 4 - compute nodes in the NvLink Domain where one or more(but not all) of its gpus are allocated to partitions

Compute node health:

  • NMX_COMPUTE_NODE_HEALTH_HEALTHY = 1 - Fully healthy

  • NMX_COMPUTE_NODE_HEALTH_DEGRADED = 2 - Some GPUs are degraded to NO_NVLINK

  • NMX_COMPUTE_NODE_HEALTH_UNHEALTHY = 3 - Unable to participate in NVLink

Location and health are optional filters.

Request:

Message/Parameter

Type

Description

Notes

GetComputeNodeCountRequest.attr

ComputeNodeAttr

Filter based on partition allocation status of the compute node

GetComputeNodeCountRequest.chassisId

unit64

Chassis ID

[Optional] Set to 0 when not used

GetComputeNodeCountRequest.nodeHealth

ComputeNodeHealth

NVLink health of the compute node

[Optional] Set to COMPUTE_NODE_HEALTH_UNKNOWN when not used

Response message:

Message/Parameter

Type

Description

Notes

GetComputeNodeCountResponse.numNodes

unit32

Limit on number of compute nodes matching the filters

GetComputeNodeCountResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GetComputeNodeLocationList(GetComputeNodeLocationListRequest) returns (GetComputeNodeLocationListResponse)

Client calls this RPC to get the locations of the compute nodes. It allows the client to filter the compute nodes based on Attribute, Health and Location.

Attribute reflects the allocation status of a compute node to a partition:

  • NMX_COMPUTE_NODE_ATTR_ALL = 1 - all compute nodes in the NvLink Domain irrespective of the allocation status

  • NMX_COMPUTE_NODE_ATTR_FREE = 2 - compute nodes in the NvLink Domain that are not allocated to any partitions

  • NMX_COMPUTE_NODE_ATTR_FULLY_ALLOCATED = 3 - compute nodes in the NvLink Domain where all of its gpus are allocated to partitions

  • NMX_COMPUTE_NODE_ATTR_PARTIALLY_ALLOCATED = 4 - compute nodes in the NvLink Domain where one or more(but not all) of its gpus are allocated to partitions

Compute node health:

  • NMX_COMPUTE_NODE_HEALTH_HEALTHY = 1 - Fully healthy

  • NMX_COMPUTE_NODE_HEALTH_DEGRADED = 2 - Some GPUs are degraded to NO_NVLINK

  • NMX_COMPUTE_NODE_HEALTH_UNHEALTHY = 3 - Unable to participate in NVLink

Location and health are optional filters.

Request message:

Message/Parameter

Type

Description

Notes

GetComputeNodeLocationListRequest.attr

ComputeNodeAttr

Filter based on partition allocation status of the compute node

GetComputeNodeLocationListRequest.chassisId

unit64

Chassis ID

[Optional] Set to 0 when not used

GetComputeNodeLocationListRequest.nodeHealth

ComputeNodeHealth

NVLink health of the compute node

[Optional] Set to COMPUTE_NODE_HEALTH_UNKNOWN when not used

GetComputeNodeLocationListRequest.numNodes

unit32

Limit on number of nodes in the response

Response message:

Message/Parameter

Type

Description

Notes

GetComputeNodeLocationListResponse.locList

Array of Location

List of locations

GetComputeNodeLocationListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GetComputeNodeInfoList(GetComputeNodeInfoListRequest) returns (GetComputeNodeInfoListResponse)

The client calls this RPC to get information on the compute nodes. The client uses it get both static(location and number of gpus in the node) and dynamic information(partition Ids and health) about compute nodes.

Request message:

Message/Parameter

Type

Description

Notes

GetComputeNodeInfoListResponse.locList

Array of Location

List of locations

[Optional] If the list is empty, the response includes all nodes

Response message:

Message/Parameter

Type

Description

Notes

GetComputeNodeInfoListResponse.nodeInfoList

Array of ComputeNodeInfo

GetComputeNodeInfoListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

ComputeNodeInfo message:

Message/Parameter

Type

Description

Notes

ComputeNodeInfo.loc

LocationInfo

Node location

ComputeNodeInfo.numGPUs

unit32

Number of GPUs for the node

ComputeNodeInfo.nodeHealth

ComputeNodeHealth

NVLink health of the compute node

ComputeNodeInfo.partitionIdList

Array of PartitionId

List of partitions the device is associated with


GetGpuInfoList(GetGpuInfoListRequest) returns (GetGpuInfoListResponse)

The client calls this RPC to get information on the GPUs.

The client can filter GPUs based on location or partition by passing the attribute NMX_GPU_ATTR_LOCATION or NMX_GPU_ATTR_PARTITION_ID respectively.

Notes:

  • Specifying NMX_GPU_ATTR_PARTITION_ID and partitionId=0 provide info on GPUs that are not associated with any partition

  • Specifying NMX_GPU_ATTR_ALL ignores the location or partition values set in the request

Request message:

Message/Parameter

Type

Description

Notes

GetGpuInfoListRequest.attr

GpuAttr

Filter based on GPUs that belong to a partition or location

GetGpuInfoListRequest.numGpus

unit32

Limit on number of GPUs for response

[Optional] Set to 0 when not used, and response includes all GPUs matching the filter

GetGpuInfoListRequest.loc

Location

Location

Values used only when attr=GPU_ATTR_LOCATION_ID

GetGpuInfoListRequest.partitionId

PartitionId

Partition ID

Value used only when attr=GPU_ATTR_PARTITION_ID

GetGpuInfoListRequest.gpuHealth

GpuHealth

GPU health

[Optional] Set to 0 (GPU_HEALTH_UNKNOWN), when not used

GPU health values:

  • NMX_ GPU_HEALTH_HEALTHY = 1 //!< Fully healthy

  • NMX_ GPU_HEALTH_DEGRADED = 2 //!< One or more links are down

  • NMX_ GPU_HEALTH_NO_NVLINK = 3 //!< Unable to participate in NVLink partition

  • NMX_GPU_HEALTH_DEGRADED_BW = 4 //!< GPU operates in degraded bandwidth

Note: to get all healthy GPU not associated with any partition set:

  • attr to GPU_PARTITION_ID

  • partitionId to 0

Response message:

Message/Parameter

Type

Description

Notes

GetGpuInfoListResponse.gpuInfoList

Array of GpuInfo

A list of GPU information

GetGpuInfoListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GpuInfo message:

Message/Parameter

Type

Description

Notes

GpuInfo.loc

LocationInfo

Node location

GpuInfo.gpuId

uint32

GPU enumeration within the node

From 1

GpuInfo.gpuUid

uint64

GPU unique identifier

GpuInfo.gpuHealth

GpuHealth

NVLink health of the GPU

GpuInfo.partitionId

PartitionId

Partitions the device is associated with


GetSwitchNodeCount(GetSwitchNodeCountRequest) returns (GetSwitchNodeCountResponse)

The client calls this RPC to get the number of switch nodes.

The client can filter the switch nodes using attribute and health. Note that in the NVLink5 topologies do not have switch nodes that are of type NMX_SWITCH_NODE_ATTR_L2.

Request message:

Message/Parameter

Type

Description

Notes

GetSwitchNodeCountRequest.attr

SwitchNodeAttr

Filter based on the type of switch node

GetSwitchNodeCountRequest.nodeHealth

SwitchNodeHealth

NVLink health of the switch node

[Optional] Set to SWITCH_NODE_HEALTH_UNKNOWN when not used

GetSwitchNodeCountRequest.numNodes

uint32

Limit on number of nodes for the response

[Optional] Set to 0 for no limit

Response message:

Message/Parameter

Type

Description

Notes

GetSwitchNodeCountResponse.numNodes

uint32

Number of nodes matching the filter

GetSwitchNodeCountResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GetSwitchNodeLocationList(GetSwitchNodeLocationListRequest) returns (GetSwitchNodeLocationListResponse)

The client calls this RPC to get the location of switch nodes.

The client can filter the switch nodes using attribute and health. Note that in the NVLink5 topologies do not have switch nodes that are of type NMX_SWITCH_NODE_ATTR_L2

Request message:

Message/Parameter

Type

Description

Notes

GetSwitchNodeLocationListRequest.attr

SwitchNodeAttr

Filter based on the type of switch node

GetSwitchNodeLocationListRequest.nodeHealth

SwitchNodeHealth

NVLink health of the switch node

[Optional] Set to SWITCH_NODE_HEALTH_UNKNOWN when not used

GetSwitchNodeLocationListRequest.numNodes

uint32

Limit on number of nodes for the response

[Optional] Set to 0 for no limit

Response message:

Message/Parameter

Type

Description

Notes

GetSwitchNodeLocationListResponse.locList

Array of Location

List of nodes locations

GetSwitchNodeLocationListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GetSwitchNodeInfoList(GetSwitchNodeInfoListRequest) returns (GetSwitchNodeInfoListResponse)

The client calls this RPC to get information on the switch nodes. The client uses it get both static(location and number of switches in the node) and dynamic information(partition Ids and health) about switch nodes

Request message:

Message/Parameter

Type

Description

Notes

GetSwitchNodeInfoListRequest.locList

Array of Location

List of nodes locations

[Optional] I the list is empty, the response includes all nodes

Response message:

Message/Parameter

Type

Description

Notes

GetSwitchNodeInfoListResponse.nodeInfoList

Array of SwitchNodeInfo

List of switch nodes

GetSwitchNodeInfoListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

SwitchNodeInfo message:

Message/Parameter

Type

Description

Notes

SwitchNodeInfo.loc

LocationInfo

Node location

SwitchNodeInfo.numSwitches

uint32

Number of switches in the node

SwitchNodeInfo.nodeHealth

SwitchNodeHealth

NVLink health of the switch node

SwitchNodeInfo.partitionIdList

Array of PartitionId

List of partitions the node is associated with


GetSwitchInfoList(GetSwitchInfoListRequest) returns (GetSwitchInfoListResponse)

The client calls this RPC to get information on the switches.

Request message:

Message/Parameter

Type

Description

Notes

GetSwitchInfoListRequest.numSwitches

unit32

Limit on number of switches for response

[Optional] Set to 0 when not used. The response includes all switches matching the filter

GetSwitchInfoListRequest.loc

Location

Location

Response message:

Message/Parameter

Type

Description

Notes

GetSwitchInfoListResponse.switchInfoList

Array of SwitchInfo

List of switch info

GetSwitchInfoListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

SwitchInfo message:

Message/Parameter

Type

Description

Notes

SwitchInfo.loc

LocationInfo

Node location

SwitchInfo.switchId

uint32

Switch enumeration with n the node

From 1

SwitchInfo.switchUid

uint64

Switch unique identifier

SwitchInfo.health

SwitchHealth

NVLink health of the switch

SwitchInfo.numPorts

uint32

Number of ports


GetPartitionCount(GetPartitionCountRequest) returns (GetPartitionCountResponse)

The client calls this RPC to get the number of partitions. The number of partitions can be filtered using the numGpus, numNodes or health filters. The infoAttr value decides which filter to choose between numGpus and numNodes. If set to ATTR_ALL, numGpus and numNodes are ignored.

Copy
Copied!
            

enum PartitionInfoAttr { NMX_PARTITION_INFO_ATTR_UNDEFINED = 0; NMX_PARTITION_INFO_ATTR_ALL = 1; //!< All Partitions NMX_PARTITION_INFO_ATTR_NUM_GPUS = 2; //!< Number of Partitions with a specific GPU size NMX_PARTITION_INFO_ATTR_NUM_COMPUTE_NODES = 3; //!< Number of Partitions with a specific number of Compute nodes }   enum PartitionHealth { NMX_PARTITION_HEALTH_UNKNOWN = 0; NMX_PARTITION_HEALTH_HEALTHY = 1; //!< Partition is healthy NMX_PARTITION_HEALTH_DEGRADED_BANDWIDTH = 2; //!< Partition is in degraded bandwidth NMX_PARTITION_HEALTH_DEGRADED = 3; //!< One or more GPUs has routing disabled NMX_PARTITION_HEALTH_UNHEALTHY = 4; //!< Partition is unhealthy }

Request message:

Message/Parameter

Type

Description

Notes

GetPartitionCountRequest.infoAttr

PartitionInfoAttr

Filter based on number of GPUs or compute nodes in a partition

GetPartitionCountRequest.numGpus

uint32

Number of GPUs in a partition

Values used only when attr=PARITITION_INFO_ATTR_NUM_GPUS

GetPartitionCountRequest.numNodes

uint32

Number of compute nodes in a partition

Values used only when attr=PARTITION_INFO_ATTR_NUM_COMPUTE_NODES

GetPartitionCountRequest.health

PartitionHealth

NVLink health of the partition

Optional, set to PARTITION_HEALTH_UNKNOWN when not used

Response message:

Message/Parameter

Type

Description

Notes

GetPartitionCountResponse.numPartitions

uint32

Number of partitions matching the filter

GetPartitionCountResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GetPartitionIdList(GetPartitionIdListRequest) returns (GetPartitionIdListResponse)

The client calls this RPC to get the partitions IDs. The partitionIds can be filtered using the numGpus, numNodes or health filters. The infoAttr value decides which filter to choose between numGpus and numNodes. If set to ATTR_ALL, numGpus and numNodes are ignored.

Copy
Copied!
            

enum PartitionInfoAttr { NMX_PARTITION_INFO_ATTR_UNDEFINED = 0; NMX_PARTITION_INFO_ATTR_ALL = 1; //!< All Partitions NMX_PARTITION_INFO_ATTR_NUM_GPUS = 2; //!< Number of Partitions with a specific GPU size NMX_PARTITION_INFO_ATTR_NUM_COMPUTE_NODES = 3; //!< Number of Partitions with a specific number of Compute nodes }   enum PartitionHealth { NMX_PARTITION_HEALTH_UNKNOWN = 0; NMX_PARTITION_HEALTH_HEALTHY = 1; //!< Partition is healthy NMX_PARTITION_HEALTH_DEGRADED_BANDWIDTH = 2; //!< Partition is in degraded bandwidth NMX_PARTITION_HEALTH_DEGRADED = 3; //!< One or more GPUs has routing disabled NMX_PARTITION_HEALTH_UNHEALTHY = 4; //!< Partition is unhealthy }

Request message:

Message/Parameter

Type

Description

Notes

GetPartitionIdListRequest.infoAttr

PartitionInfoAttr

Filter based on number of GPUs or compute nodes in a partition

GetPartitionIdListRequest.numGpus

uint32

Number of GPUs in a partition

Values used only when attr=PARITIION_INFO_ATTR_NUM_GPUS

GetPartitionIdListRequest.numNodes

uint32

Number of compute nodes in a partition

Values used only when attr=PARTITION_INFO_ATTR_NUM_COMPUTE_NODES

GetPartitionIdListRequest.health

PartitionHealth

NVLink health of the partition

[Optional] Set to PARTITION_HEALTH_UNKNOWN when not used

GetPartitionIdListRequest.numPartitions

uint32

Number of partitions

Response message:

Message/Parameter

Type

Description

Notes

GetPartitionIdListResponse.partitionList

Array of Partition

List of partitions

GetPartitionIdListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

Partition message:

Message/Parameter

Type

Description

Notes

Partition.partitionId

uint32

Partition ID

Partition.numGpus

uint32

Number of GPUs in partition


GetPartitionInfoList(GetPartitionInfoListRequest) returns (GetPartitionInfoListResponse)

The client calls this RPC to get the partitions information. User can pass in a list of partitionIds or partitionNames or both. Only valid partitionIds or partitionNames are considered. If both lists are empty, information for all partitions are returned.

Request message:

Message/Parameter

Type

Description

Notes

GetPartitionInfoListRequest.partitionIdList

Array of PartitionId

List of partition IDs

[Optional] if IDs/Names not provided response includes all provisioned Partitions

GetPartitionInfoListRequest.partitionNameList

Array of strings

List of partition names

[Optional] if IDs/Names not provided response includes all provisioned Partitions

Response message:

Message/Parameter

Type

Description

Notes

GetPartitionInfoListResponse.partitionInfoList

Array of PartitionInfo

List of partition info

GetPartitionInfoListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

PartitionInfo message:

Message/Parameter

Type

Description

Notes

partitionId

PartitionId

Partition ID

name

string

Partition name

numGpus

uint32

Number of GPUs in partition

gpuLocationList

Array of GpuLocation

GPUs location

gpuUidList

Array of uint64

GPUs unique IDs

health

PartitionHealth

NVLink health of the partition

partitionType

PartitionType

Partition type

PARTITION_TYPE_LOCATION_BASED if GPUs are associated by location

PARTITION_TYPE_GPUUID_BASED if GPUs are associated by gpuUid

numAllocatedMulticastGroups

uint32

Number of allocated multicast groups to the partition

attr.resiliencyMode

ResiliencyMode

Resiliency mode

RESILIENCY_MODE_UNDEFINED

RESILIENCY_MODE_FULL_BANDWIDTH

RESILIENCY_MODE_ADAPTIVE_BANDWIDTH

RESILIENCY_MODE_USER_ACTION_REQUIRED

attr.MulticastGroupsLimit

uint32

Limit on number of multicast groups in partition

The meaning of the ResiliencyMode values are:

  • Full Bandwidth: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, GPUs will be excluded from the fabric to maintain full bandwidth for the rest of the GPUs.

  • Adaptive Bandwidth: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, the partition's GPUs will operate with a lower bandwidth than optimal.

  • User Action Required: On a trunk link failure, partition will attempt an automatic recovery. If spare trunk links are not available, the partition will go into an unhealthy state which requires user action for recovery. Example actions would be providing additional trunk links or removing GPUs from the partition.

CreatePartition(CreatePartitionRequest) returns (CreatePartitionResponse)

The client calls this RPC to create a partition. The user provides a list of GPU UIDs or a list of GPU Locations. If both lists are empty, the RPC returns NMX_ST_BADPARAM. The user can specify the attributes for the partition to be created.

  • If the partition creation is requested with the resiliency mode NMX_RESILIENCY_MODE_UNDEFINED, a default resiliency mode as specified in the configuration file MNNVL_DEFAULT_RESILIENCY_MODE is used

  • If the partition creation is requested with a multicastGroupsLimit that cannot be satisfied, the RPC returns NMX_ST_RESOURCE_EXHAUSTED. If the value is not a multiple of 4, the RPC returns NMX_ST_BADPARAM

partitionId's are allocated from 1 to 0x7FFD. Partition Id 0x7FFE is reserved for Default Partition. The user can specify a partitionId as part of the creation request. If the specified ID is greater than 0x7FFD, the RPC returns NMX_ST_BADPARAM.

  • If a partition creation request succeeds, and a later request to create another partition with the same set of parameters is received, the RPC returns NMX_ST_PARTITION_EXISTS

  • If a partition cannot be created owing to exhaustion of partitionIds, the RPC returns NMX_ST_RESOURCE_EXHAUSTED

  • If a requested GPU is already part of another partition, the RPC returns NMX_ST_RESOURCE_IN_USE

  • If a requested GPU does not have a valid UID or a valid location, the RPC returns NMX_ST_RESOURCE_BAD

  • If the requested partitionId is already in use by another partition, the RPC returns NMX_ST_PARTITION_ID_IN_USE

  • If the requested partitionName is already in use by another partition, the RPC returns NMX_ST_PARTITION_NAME_IN_USE

  • If the creation fails due to any other error, the RPC returns NMX_ST_GENERIC_ERROR

Request message:

Message/Parameter

Type

Description

Notes

CreatePartitionRequest.name

string

Partition name

[Optional] Must be unique in domain if provided

CreatePartitionRequest.gpuResourceId

Array of GpuResourceId

Either gpuLocation or gpuUid

The GPU can be allocated either by GPU location or GPU unique ID

CreatePartitionRequest.attr.resiliencyMode

ResiliencyMode

Resiliency mode

RESILIENCY_MODE_UNDEFINED

RESILIENCY_MODE_FULL_BANDWIDTH

RESILIENCY_MODE_ADAPTIVE_BANDWIDTH

RESILIENCY_MODE_USER_ACTION_REQUIRED

CreatePartitionRequest.attr.MulticastGroupsLimit

uint32

Limit on number of multicast groups in partition

CreatePartitionRequest.partitionId

PartitionId

Partition ID

[Optional] Set to 0 for system to auto-generate ID

Response message:

Message/Parameter

Type

Description

Notes

CreatePartitionResponse.partitionId

PartitionId

Partition ID

User provided partition ID, or system auto-generated ID (if user did not specify)

CreatePartitionResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter, requested multicastGroupsLimit is not a multiple of 4, requested partitionId is greater than 0x7FFD(allowed range is 0x1-0x7FFD)

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed due to an internal error

  • NMX_ST_RESOURCE_EXHAUSTED - Requested multicastGroupsLimit cannot be satisfied, partitionIds are exhausted

  • NMX_ST_PARTITION_EXISTS - Partition with requested parameters already exists

  • NMX_ST_RESOURCE_IN_USE - Requested resource is already part of another partition

  • NMX_ST_RESOURCE_BAD - Requested resource does not have a valid UID or a location

  • NMX_ST_PARTITION_ID_IN_USE - Requested partitionId is already in use

  • NMX_ST_PARTITION_NAME_IN_USE - Requested partitionName is already in use

DeletePartition(DeletePartitionRequest) returns (DeletePartitionResponse)

The client calls this RPC to delete a partition. User provides either a valid name or a partition Id or both as part of the request

Request message:

Message/Parameter

Type

Description

Notes

DeletePartitionRequest.partitionId

PartitionId

Partition ID

partition ID is optional if partition name is provided

DeletePartitionRequest.name

string

Partition name

partition name is optional if partition ID is provided

Response message:

Message/Parameter

Type

Description

Notes

DeletePartitionResponse.partitionId

PartitionId

DeletePartitionResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

  • NMX_ST_PARTITION_ID_NOT_IN_USE - Requested partitionId not in use

AddGpusToPartition(UpdatePartitionRequest) returns (UpdatePartitionResponse)

Client calls this RPC to add GPUs to a partition. In a partition that is contained within a chassis boundary, the reroute flag is ignored. In a partition that uses trunk links and hence crosses the chassis boundary, the reroute flag determines if the trunk link routing is adjusted:

  • If reroute = true(default), the trunk link routing is adjusted when additional GPUs are added to the partition. This can disrupt applications running in the partition.

  • If reroute = false, the trunk link routing is not adjusted when additional GPUs are added to the partition.

The RPC also allows a special operation called "reroute" where both the locationList and gpuUidList are empty, and reroute is set to True. This allows the client to adjust the trunk link routing (i.e. "reroute") for the partition to use an optimal number of trunk links. This operation can cause traffic disruption and must be used with caution.

  • When one of the location list or the gpuUid list is populated and this does not match the type (location or GPU UID) with which the partition currently operates, the RPC returns NMX_ST_NOT_SUPPORTED. The type of the partition can be determined from the PartitionInfo message which is returned from the GetPartitionInfoList() RPC call

  • If a requested GPU does not have a valid GUID or a valid location, the RPC returns NMX_ST_RESOURCE_BAD

  • If the requested partition ID is not in use, the RPC returns NMX_ST_PARTITION_ID_NOT_IN_USE

  • If the GPU to be added is already part of a partition, the RPC returns NMX_ST_RESOURCE_IN_USE

Request message:

Message/Parameter

Type

Description

Notes

UpdatePartitionRequest.partitionId

PartitionId

Partition ID is optional if partition name is provided

UpdatePartitionRequest.locationList

Array of GpuLocation

List of GPU locations

Provide only if PartitionType=PARTITION_TYPE_LOCATION_BASED

UpdatePartitionRequest.gpuUid

Array of gpuUid

Provide only if PartitionType=PARTITION_TYPE_GPUUID_BASED

UpdatePartitionRequest.name

string

partition name

Partition name is optional if partition ID is provided

UpdatePartitionRequest.reroute

Boolean

Reroute partition on update

Default is true, will be deprecated and then removed in future releases

User can request partition reroute by setting:

  • locationList to empty array

  • gpuUid to empty array

  • reroute to true

Response message:

Message/Parameter

Type

Description

Notes

UpdatePartitionResponse.partitionId

PartitionId

UpdatePartitionResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter, requested partitionId does not exist

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

  • NMX_ST_NOT_SUPPORTED - Requested type(location or gpu UID) does not match the partition type

  • NMX_ST_RESOURCE_BAD - Requested resource does not have a valid UID or a location

  • NMX_ST_PARTITION_ID_NOT_IN_USE - Requested partitionId not in use

  • NMX_ST_RESOURCE_IN_USE - Requested resource is already part of another partition

RemoveGpusFromPartition(UpdatePartitionRequest) returns (UpdatePartitionResponse)

The client calls this RPC to remove GPUs from a partition. UpdatePartitionRequest and UpdatePartitionResponse messages are the same as in AddGpusToPartition.

  • If the number of GPUs to be removed is equal to the number of GPUs in the partition, the RPC returns NMX_ST_NOT_SUPPORTED

  • If the number of GPUs to be removed is greater than the number of GPUs in the partition, the RPC returns NMX_ST_BADPARAM

  • If the GPU to be removed is not part of the partition, the RPC returns NMX_ST_RESOURCE_NOT_IN_USE

UpdatePartitionResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

  • NMX_ST_NOT_SUPPORTED - Requested number of gpus to be removed is equal to the number of gpus in the partition

  • NMX_ST_BADPARAM - Requested number of gpus to be removed is greater than the number of GPUs in the partition

  • NMX_ST_RESOURCE_NOT_IN_USE - Requested resource to be removed is not part of the partition

GetConnCount(GetConnCountRequest) returns (GetConnCountResponse)

The client calls this RPC to get the number of fabric connections. It allows the client to filter the connections based on Attribute, Type and Location.

Copy
Copied!
            

enum ConnAttr { NMX_NVLINK_CONN_ATTR_UNKNOWN = 0; NMX_NVLINK_CONN_ATTR_EXPECTED = 1; //!< All expected connections as per FM Topology NMX_NVLINK_CONN_ATTR_DISCOVERED = 2; //!< All discovered connections NMX_NVLINK_CONN_ATTR_EXPECTED_ACTIVE = 3; //!< All expected active connections as per FM Topology NMX_NVLINK_CONN_ATTR_EXPECTED_INACTIVE = 4; //!< All expected inactive connections NMX_NVLINK_CONN_ATTR_UNEXPECTED = 5; //!< All unexpected connections which are discovered }

Copy
Copied!
            

enum ConnType { NMX_NVLINK_CONN_TYPE_UNKNOWN = 0; NMX_NVLINK_CONN_TYPE_ALL = 1; //!< Dump the GPU and switch connections NMX_NVLINK_CONN_TYPE_GPU = 2; //!< Dump the GPU connections NMX_NVLINK_CONN_TYPE_SWITCH = 3; //!< Dump the Switch connections }

Various combinations of attributes and types are provided below to provide an idea on how connection information can be mined:

Connection category

Connection Attribute

Connection Type

Access discovered

CONN_ATTR_DISCOVERED

CONN_TYPE_GPU

Trunk discovered

CONN_ATTR_DISCOVERED

CONN_TYPE_SWITCH

All discovered

CONN_ATTR_DISCOVERED

CONN_TYPE_ALL

Access expected

CONN_ATTR_EXPECTED

CONN_TYPE_GPU

Access inactive

CONN_ATTR_EXPECTED_INACTIVE

CONN_TYPE_GPU

Trunk unexpected

CONN_ATTR_UNEXPECTED

CONN_TYPE_SWITCH

Request message:

Message/Parameter

Type

Description

Notes

GetConnCountRequest.connType

ConnType

Filter based on connection type

Connection Types are NVLINK_CONN_TYPE_GPU, NVLINK_CONN_TYPE_SWITCH.

Specify NVLINK_CONN_TYPE_ALL to include both

GetConnCountRequest.connAttr

ConnAttr

Filter based on discovered and expected connections

Discovered connections can be Active/Unexpected.

Expected connections can be Active/Inactive/Missing.

GetConnCountRequest.loc

Location

Filter connections for a specific location

Response message:

Message/Parameter

Type

Description

Notes

GetConnCountResponse.numConns

uint32

Number of connections

GetConnCountResponse.timestamp

string

Timestamp from when the connection database was last updated

GetConnCountResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

GetConnInfoList(GetConnInfoListRequest) returns (GetConnInfoListResponse)

The client calls this RPC to get information on the fabric connections. It allows the client to filter the connections based on Attribute, Type and Location.

Copy
Copied!
            

enum ConnAttr { NMX_NVLINK_CONN_ATTR_UNKNOWN = 0; NMX_NVLINK_CONN_ATTR_EXPECTED = 1; //!< All expected connections as per FM Topology NMX_NVLINK_CONN_ATTR_DISCOVERED = 2; //!< All discovered connections NMX_NVLINK_CONN_ATTR_EXPECTED_ACTIVE = 3; //!< All expected active connections as per FM Topology NMX_NVLINK_CONN_ATTR_EXPECTED_INACTIVE = 4; //!< All expected inactive connections NMX_NVLINK_CONN_ATTR_UNEXPECTED = 5; //!< All unexpected connections which are discovered }

Copy
Copied!
            

enum ConnType { NMX_NVLINK_CONN_TYPE_UNKNOWN = 0; NMX_NVLINK_CONN_TYPE_ALL = 1; //!< Dump the GPU and switch connections NMX_NVLINK_CONN_TYPE_GPU = 2; //!< Dump the GPU connections NMX_NVLINK_CONN_TYPE_SWITCH = 3; //!< Dump the Switch connections

The API returns a list of connections and the state of each connection:

Copy
Copied!
            

enum ConnState { NMX_NVLINK_CONN_STATE_UNKNOWN = 0; NMX_NVLINK_CONN_STATE_ACTIVE = 1; //!< Active link or connection state NMX_NVLINK_CONN_STATE_INACTIVE = 2; //!< Inactive link or connection state }

GetConnInfoListRequest message:

Message/Parameter

Type

Description

Notes

GetConnInfoListRequest.connType

ConnType

Filter based on connection type

GetConnInfoListRequest.connAttr

ConnAttr

Filter based on discovered and expected connections

GetConnInfoListRequest.loc

Location

Filter connections for a specific location

GetConnInfoListRequest.numConns

uint32

Number of connections

Response message:

Message/Parameter

Type

Description

Notes

GetConnInfoListResponse.connInfoList

Array of ConnInfo

List of connection information

GetConnInfoListResponse.timestamp

string

Timestamp from when the connection database was last updated

GetConnInfoListResponse.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully populated the NVLink Domain Properties

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

ConnInfo message:

Message/Parameter

Type

Description

Notes

endPointA

LinkEndPoint

One end of the connection(a device port)

endPointB

LinkEndPoint

Another end of the connection(a device port)

connType

ConnType

Connection type

connState

ConnState

Connection state

NVLINK_CONN_STATE_ACTIVE or NVLINK_CONN_STATE_INACTIVE

LinkEndPoint message:

Message/Parameter

Type

Description

Notes

loc

Location

Location of the node of the endpoint

switchOrGpuId

uint32

Location of the device(GPU/switch) within of the endpoint

cageNum

uint32

Cage Number of the endpoint

Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0

cagePortNum

uint32

Cage Port Number of the endpoint

Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0

cagePortSplitNum

uint32

Cage Split Port Number of the endpoint

Valid only for NVLINK_CONN_TYPE_SWITCH, otherwise set to 0

portNum

uint32

Port Number of the endpoint on the device(GPU/switch)


GetConnInfoCombined(GetConnInfoCombinedRequest) returns (ConnInfoCombined)

The client calls this RPC to get information on the fabric trunk connections that are mis-wired.

Response message:

Message/Parameter

Type

Description

Notes

ConnInfoCombined.unexpectedConnList

Array of ConnInfo

List of mis-wired trunk connections

ConnInfoCombined.serverHeader.returnCode:

  • NMX_ST_SUCCESS - successfully completed the call

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call to GFM API has failed

FactoryReset(FactoryResetRequest) returns (ReturnCode)

The client calls this RPC to perform factory reset to the NMX-Controller. After this call is completed the NMX-Controller configuration and state is as initially delivered from factory.

ReturnCode:

  • NMX_ST_SUCCESS - successfully completed the call

  • NMX_ST_BADPARAM - Invalid input parameter

  • NMX_ST_GENERIC_ERROR - Call has failed

© Copyright 2025, NVIDIA. Last updated on Apr 30, 2025.