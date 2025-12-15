NVIDIA UFM Enterprise User Manual v6.23.11 LTS (2025 LTS U1)
NVLink Plugin

Plugin Release Notes

Changes and New Features

Plugin Version

Feature

Description

1.2.2-0

Partitions Management

Added the ability to manage the partitions by creating, updating and removing partitions. For more information, refer to Partitions View.

Compute and Switch Nodes Views

Added the ability to view the available compute and switch nodes. For more information, refer to Switch Nodes View and Compute Nodes View.


Bug Fixes

Plugin Version

Bug Fix

1.2.2-0

N/A


Overview

The NVLink plugin enables centralized monitoring and management of multiple NVLink domains through both the UFM UI and REST APIs. At its core is the NMX Aggregator (NMXAGGR), which connects to multiple NVLink domains, gathers data from their NMX Controllers (NMX-C), and consolidates information about monitored components. By default, the plugin includes a built-in NMXAGGR, but it can also be configured to connect to an external NMXAGGR instance—either on the same host or a different system. Communication with NVLink domains is performed via the NMX-C using a gRPC-based API.

Deployment

  1. Download the Plugin Image

    Run the following command to download the NVLink plugin image:

    Copy
    Copied!
                
    
            
    docker pull mellanox/ufm-plugin-nvlink

  2. Load the Plugin into UFM

    After downloading, you can load the plugin into UFM using one of the following methods:

    • Via UFM UI:

      Navigate to Settings → Plugins Management in the UFM web interface.

    • Via Command Line:

      Execute the following command on the UFM server terminal:

      Copy
      Copied!
                  
      
            
      /opt/ufm/scripts/manage_ufm_plugins.sh add -p nvlink

Usage

Container Volume Mapping

The UFM plugin management system creates the following mappings between the plugin docker container file system and the host machine one:

Container Directory

Host Directory

/config

/opt/ufm/files/conf/plugins/nvlink

/log

/opt/ufm/files/log/plugins/nvlink

Any file system path mentioned in this document refers to the container's file system, unless stated otherwise.

NVLink Domains Connection Security

The plugin, specifically its NMXAGGR component, interacts with NVLink domains over a gRPC connection. In this setup, the domain controller (NMX-C) acts as the server, while NMXAGGR functions as the client.

NMXAGGR supports three modes of gRPC communication:

  1. Insecure – No encryption is used. This is the default mode.

  2. Server-side TLS – Communication is encrypted. Only the server needs to present a certificate to the client. This mode is enabled by setting the cacert option (refer to the Configuration section).

  3. Mutual TLS (mTLS) – Communication is encrypted, and both the client and server must authenticate each other using certificates. This mode requires setting the cacert, cert, and key options (refer to the Configuration section).

Managed Domains List

All NVLink domains that are managed or monitored by NMXAGGR are recorded in a list stored in the file <data_dir>/domains.csv (see the Configuration section for the definition of data_dir).

This file serves as an alternative method—alongside the Web UI and REST API—for adding or removing managed domains. Each line in the file contains three comma separated values:

  • host - a hostname or an IP address, both of which can incorporate numerical ranges to define multiple hosts in one line; required

  • controller port - a port number of the gRPC endpoint of a domain controller (NMX-C); not required; default value is 9370

  • telemetry port - a port number of the management gRPC endpoint of a telemetry (NMX-T); not required; default value is 9351

Examples:

  • 10.222.16.333,9370,

  • nv-dmn-01,,6666

  • 10.222.[16,17,20-28].[330-350],,

  • nv-dmn-[01-8],9371,9355

Note

For any changes made directly to the file to take effect, the plugin must be restarted.

Note

When NMXAGGR writes the file (as a result of changes to the managed domains list performed via UI or REST API), it expands addresses containing ranges and writes one address per line.

Configuration

The plugin can be configured by editing the config file /config/nvlink_plugin.conf.

There are two sections in the config file:

nmxaggr

Option

Description

Default

use_internal

If true an internal NMXAGGR will be used, otherwise an external one will be used.

true

api_address

The address of the NMXAGGR REST API server.

http://127.0.0.1:8966

cacert

The path to a file containing trusted root certificates for verifying NMX-C servers. If not set, insecure gRPC connections will be used.

-

cert

The path to a file containing client certificate to present to NMX-C servers. Must be used with cacert and key options.

-

key

The path to a file containing client private key to present to NMX-C servers. Must be used with cacert and cert options.

-

data_dir

The path to a directory where the internal NXAGGR will store its persistent data.

/config

periodic_fetch_delay

In the case the plugin fails to subscribe to domain change notifications, the periodic data fetches from a domain will be performed. This option specifies the delay between those periodic fetches in a duration string format1.

30s

supplementary_fetch_delay

Normally, after the initial data fetch, data will be fetched from a domain only upon receiving a change notification from a domain controller. Additionally, supplementary fetch will be initiated if there is a long delay since the last fetch. This option specifies the delay in a duration string format1.

1m

1 A duration string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as 300ms, 1.5h or 2h45m . Valid time units are ns, us, ms, s, m, h .

logging

Option

Description

Default

nvlink_log

The path to the plugin log file.

/log/nvlink.log

nmxaggr_log

The path to the internal NMXAGGR log file.

/log/nmxaggr.log

log_level

The log level. Possible values: DEBUG, INFO, WARNING, ERROR, CRITICAL.

INFO

log_file_max_size

The maximal size of a log file after which the file is rotated.

10485760 1

log_file_retain_count

The number of rotated log files to retain.

5

1 10 MB

UI Views

After the plugin is activated, an "NVLink" section becomes available in the dashboards.

NVLink Dashboard View

This view presents an overview of inventory elements—such as domains, switches, GPUs, Partitions and Compute Nodes Allocations—along with a filter for their health status.

Users can drill down from overall status indicators to specific elements, and further into the individual ports or links associated with each selected element.

image-2025-10-23_18-11-39-version-1-modificationdate-1765811007747-api-v2.png

The user can select a specific domain, upon which a list of associated switches and GPUs will be displayed, as illustrated in the example below.

If the selected domain has any health issues, a detailed breakdown of the affected devices will also be presented.

dashboard-unhealthy-gpus-version-1-modificationdate-1765811008920-api-v2.png

When an unhealthy device is selected, a list of all its ports and links will be displayed.

image-2025-5-7_14-24-2-version-1-modificationdate-1765811010093-api-v2.png

Additionally, the "Recent Events" notification panel on the right side of the screen is updated with the most recent health status changes of the devices.

image-2025-5-7_14-24-33-version-1-modificationdate-1765811009767-api-v2.png

Managed Elements View

The Managed Elements view is a tree-tabular display that shows all inventory elements, allowing users to browse through them. It also provides the option to add or remove domains.

Domains View

image-2025-10-23_18-16-34-version-1-modificationdate-1765811007407-api-v2.png

Add New Domain Model

Click the + icon in the upper dashboard to add a new domain.

image-2025-10-23_18-19-0-version-1-modificationdate-1765811007023-api-v2.png

Available Actions for the Selected Domain

The following actions are available when you right-click on the selected domain's row.

Action

Description

Remove

Removes the selected domain and its elements from the inventory.

Go To Switches

Redirects you to the switches of the selected domain.

Go To GPUs

Redirects you to the GPUs of the selected domain.

Go To Ports

Redirects you to all Ports of the selected domain.

Go To Links

Redirects you to all Links of the selected domain.

Switch Nodes View

This screen presents a table listing all the switch nodes, including key details.

image-2025-10-23_18-20-8-version-1-modificationdate-1765811006620-api-v2.png

Available Actions for the Selected Switch Node

The following actions are available when you right-click on the selected switch node's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected switch node.

Go To Switches

It redirects you to the Switches of the selected node.

Switches View

This screen presents a table listing all the switches, including key details.

image-2025-10-23_18-20-35-version-1-modificationdate-1765811006307-api-v2.png

Available Actions for the Selected Switch

The following actions are available when you right-click on the selected switch's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected switch.

Go To Ports

It redirects you to the Ports of the selected switch.

Compute Nodes View

This screen presents a table listing all the compute nodes, including key details.

image-2025-10-23_18-23-18-version-1-modificationdate-1765811005950-api-v2.png

Available Actions for the Selected Compute Node

The following actions are available when you right-click on the selected compute node's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected switch node.

Go To GPUs

It redirects you to the GPUs of the selected node.

Go To Partitions

It redirects you to the assigned partition of the selected node.

GPUs View

This screen presents a table listing all the GPUs, including key details.

image-2025-10-23_18-25-12-version-1-modificationdate-1765811005587-api-v2.png

Available Actions for the Selected GPU

The following actions are available when you right-click on the selected GPU's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected GPU.

Go To Ports

It redirects you to the Ports of the selected GPU.

Ports View

This screen presents a table listing all the ports, including key details.

image-2025-10-23_18-25-49-version-1-modificationdate-1765811004937-api-v2.png

Available Actions for the Selected Port

The following actions are available when you right-click on the selected port's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected Port.

Links View

This screen presents a table listing all the links, including key details.

image-2025-10-23_18-26-26-version-1-modificationdate-1765811004583-api-v2.png

The following actions are available when you right-click on the selected link's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected Link.

Partitions View

This screen presents a table listing and managing all the available partitions, including key details.

image-2025-10-23_18-28-33-version-1-modificationdate-1765811004213-api-v2.png

Add New Partition

Click the + icon in the upper dashboard to add a new partition and assign compute nodes to that partition.

To Create a new partition, a wizard with two steps will be shown to fill below fields:

  1. Specify the partition ID in hex.

  2. Specify the partition type (UID Or Location) Based.

  3. Specify to which domain the partition should be assigned.

  4. Select the compute nodes members of the new partition

  5. You can not assign the same compute nodes for multiple partitions.

image-2025-10-23_18-31-7-version-1-modificationdate-1765811003130-api-v2.png

image-2025-10-23_18-33-19-version-1-modificationdate-1765811002787-api-v2.png

Available Actions for the Selected Domain

The following actions are available when you right-click on the selected domain's row.

Action

Description

Remove

Removes the selected partition.

Edit

Edit the compute nodes members of the selected partition

Go To Domain

Redirects you to the assigned domain of the selected partition.

Go To Compute Nodes

Redirects you to all Compute Nodes Members of the selected partition.

REST API

The REST API documentation is available separately (see NVLink REST API).

In addition, the API specification in the OpenAPI format can be accessed at /ufmRestV2/plugin/nmxaggr/v1/app/swagger endpoint of the running plugin.

Info

When operating with a standalone NMXAGGR instance, the /ufmRestV2/plugin/nmxaggr prefix is not required. In contrast, when operating with the NVLink plugin, the prefix must be used.

Therefore, depending on your deployment scenario—plugin mode or standalone—you should adjust the API endpoint URLs accordingly to ensure proper communication.
