NVLink Plugin
Changes and New Features
Plugin Version
Feature
Description
1.2.2-0
Partitions Management
Added the ability to manage the partitions by creating, updating and removing partitions. For more information, refer to Partitions View.
Compute and Switch Nodes Views
Added the ability to view the available compute and switch nodes. For more information, refer to Switch Nodes View and Compute Nodes View.
Bug Fixes
Plugin Version
Bug Fix
1.2.2-0
N/A
The NVLink plugin enables centralized monitoring and management of multiple NVLink domains through both the UFM UI and REST APIs. At its core is the NMX Aggregator (NMXAGGR), which connects to multiple NVLink domains, gathers data from their NMX Controllers (NMX-C), and consolidates information about monitored components. By default, the plugin includes a built-in NMXAGGR, but it can also be configured to connect to an external NMXAGGR instance—either on the same host or a different system. Communication with NVLink domains is performed via the NMX-C using a gRPC-based API.
Download the Plugin Image
Run the following command to download the NVLink plugin image:
docker pull mellanox/ufm-plugin-nvlink
Load the Plugin into UFM
After downloading, you can load the plugin into UFM using one of the following methods:
Via UFM UI:
Navigate to Settings → Plugins Management in the UFM web interface.
Via Command Line:
Execute the following command on the UFM server terminal:
/opt/ufm/scripts/manage_ufm_plugins.sh add -p nvlink
Container Volume Mapping
The UFM plugin management system creates the following mappings between the plugin docker container file system and the host machine one:
Container Directory
Host Directory
Any file system path mentioned in this document refers to the container's file system, unless stated otherwise.
NVLink Domains Connection Security
The plugin, specifically its NMXAGGR component, interacts with NVLink domains over a gRPC connection. In this setup, the domain controller (NMX-C) acts as the server, while NMXAGGR functions as the client.
NMXAGGR supports three modes of gRPC communication:
Insecure – No encryption is used. This is the default mode.
Server-side TLS – Communication is encrypted. Only the server needs to present a certificate to the client. This mode is enabled by setting the
cacertoption (refer to the Configuration section).
Mutual TLS (mTLS) – Communication is encrypted, and both the client and server must authenticate each other using certificates. This mode requires setting the
cacert,
cert, and
keyoptions (refer to the Configuration section).
Managed Domains List
All NVLink domains that are managed or monitored by NMXAGGR are recorded in a list stored in the file
<data_dir>/domains.csv (see the Configuration section for the definition of
data_dir).
This file serves as an alternative method—alongside the Web UI and REST API—for adding or removing managed domains. Each line in the file contains three comma separated values:
host - a hostname or an IP address, both of which can incorporate numerical ranges to define multiple hosts in one line; required
controller port - a port number of the gRPC endpoint of a domain controller (NMX-C); not required; default value is 9370
telemetry port - a port number of the management gRPC endpoint of a telemetry (NMX-T); not required; default value is 9351
Examples:
10.222.16.333,9370,
nv-dmn-01,,6666
10.222.[16,17,20-28].[330-350],,
nv-dmn-[01-8],9371,9355
For any changes made directly to the file to take effect, the plugin must be restarted.
When NMXAGGR writes the file (as a result of changes to the managed domains list performed via UI or REST API), it expands addresses containing ranges and writes one address per line.
The plugin can be configured by editing the config file
/config/nvlink_plugin.conf.
There are two sections in the config file:
nmxaggr
Option
Description
Default
If
The address of the NMXAGGR REST API server.
The path to a file containing trusted root certificates for verifying NMX-C servers. If not set, insecure gRPC connections will be used.
The path to a file containing client certificate to present to NMX-C servers. Must be used with
The path to a file containing client private key to present to NMX-C servers. Must be used with
The path to a directory where the internal NXAGGR will store its persistent data.
/config
In the case the plugin fails to subscribe to domain change notifications, the periodic data fetches from a domain will be performed. This option specifies the delay between those periodic fetches in a duration string format1.
Normally, after the initial data fetch, data will be fetched from a domain only upon receiving a change notification from a domain controller. Additionally, supplementary fetch will be initiated if there is a long delay since the last fetch. This option specifies the delay in a duration string format1.
1 A duration string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as
300ms,
1.5h or
2h45m . Valid time units are
ns,
us,
ms,
s,
m,
h .
logging
Option
Description
Default
The path to the plugin log file.
The path to the internal NMXAGGR log file.
The log level. Possible values:
The maximal size of a log file after which the file is rotated.
The number of rotated log files to retain.
1 10 MB
After the plugin is activated, an "NVLink" section becomes available in the dashboards.
NVLink Dashboard View
This view presents an overview of inventory elements—such as domains, switches, GPUs, Partitions and Compute Nodes Allocations—along with a filter for their health status.
Users can drill down from overall status indicators to specific elements, and further into the individual ports or links associated with each selected element.
The user can select a specific domain, upon which a list of associated switches and GPUs will be displayed, as illustrated in the example below.
If the selected domain has any health issues, a detailed breakdown of the affected devices will also be presented.
When an unhealthy device is selected, a list of all its ports and links will be displayed.
Additionally, the "Recent Events" notification panel on the right side of the screen is updated with the most recent health status changes of the devices.
Managed Elements View
The Managed Elements view is a tree-tabular display that shows all inventory elements, allowing users to browse through them. It also provides the option to add or remove domains.
Domains View
Add New Domain Model
Click the + icon in the upper dashboard to add a new domain.
Available Actions for the Selected Domain
The following actions are available when you right-click on the selected domain's row.
Action
Description
Remove
Removes the selected domain and its elements from the inventory.
Go To Switches
Redirects you to the switches of the selected domain.
Go To GPUs
Redirects you to the GPUs of the selected domain.
Go To Ports
Redirects you to all Ports of the selected domain.
Go To Links
Redirects you to all Links of the selected domain.
Switch Nodes View
This screen presents a table listing all the switch nodes, including key details.
Available Actions for the Selected Switch Node
The following actions are available when you right-click on the selected switch node's row.
Action
Description
Go To Domain
It redirects you to the parent domain of the selected switch node.
Go To Switches
It redirects you to the Switches of the selected node.
Switches View
This screen presents a table listing all the switches, including key details.
Available Actions for the Selected Switch
The following actions are available when you right-click on the selected switch's row.
Action
Description
Go To Domain
It redirects you to the parent domain of the selected switch.
Go To Ports
It redirects you to the Ports of the selected switch.
Compute Nodes View
This screen presents a table listing all the compute nodes, including key details.
Available Actions for the Selected Compute Node
The following actions are available when you right-click on the selected compute node's row.
Action
Description
Go To Domain
It redirects you to the parent domain of the selected switch node.
Go To GPUs
It redirects you to the GPUs of the selected node.
Go To Partitions
It redirects you to the assigned partition of the selected node.
GPUs View
This screen presents a table listing all the GPUs, including key details.
Available Actions for the Selected GPU
The following actions are available when you right-click on the selected GPU's row.
Action
Description
Go To Domain
It redirects you to the parent domain of the selected GPU.
Go To Ports
It redirects you to the Ports of the selected GPU.
Ports View
This screen presents a table listing all the ports, including key details.
Available Actions for the Selected Port
The following actions are available when you right-click on the selected port's row.
Action
Description
Go To Domain
It redirects you to the parent domain of the selected Port.
Links View
This screen presents a table listing all the links, including key details.
Available Actions for the Selected Link
The following actions are available when you right-click on the selected link's row.
Action
Description
Go To Domain
It redirects you to the parent domain of the selected Link.
Partitions View
This screen presents a table listing and managing all the available partitions, including key details.
Add New Partition
Click the + icon in the upper dashboard to add a new partition and assign compute nodes to that partition.
To Create a new partition, a wizard with two steps will be shown to fill below fields:
Specify the partition ID in hex.
Specify the partition type (UID Or Location) Based.
Specify to which domain the partition should be assigned.
Select the compute nodes members of the new partition
You can not assign the same compute nodes for multiple partitions.
Available Actions for the Selected Domain
The following actions are available when you right-click on the selected domain's row.
Action
Description
Remove
Removes the selected partition.
Edit
Edit the compute nodes members of the selected partition
Go To Domain
Redirects you to the assigned domain of the selected partition.
Go To Compute Nodes
Redirects you to all Compute Nodes Members of the selected partition.
The REST API documentation is available separately (see NVLink REST API).
In addition, the API specification in the OpenAPI format can be accessed at
/ufmRestV2/plugin/nmxaggr/v1/app/swagger endpoint of the running plugin.
When operating with a standalone NMXAGGR instance, the
/ufmRestV2/plugin/nmxaggr prefix is not required. In contrast, when operating with the NVLink plugin, the prefix must be used.
Therefore, depending on your deployment scenario—plugin mode or standalone—you should adjust the API endpoint URLs accordingly to ensure proper communication.