What can I help you with?
NVIDIA UFM Enterprise User Manual v6.21.2

NVLink Plugin

The NVLink plugin enables centralized monitoring and management of multiple NVLink domains through both the UFM UI and REST APIs. At its core is the NMX Aggregator (NMXAGGR), which connects to multiple NVLink domains, gathers data from their NMX Controllers (NMX-C), and consolidates information about monitored components. By default, the plugin includes a built-in NMXAGGR, but it can also be configured to connect to an external NMXAGGR instance—either on the same host or a different system. Communication with NVLink domains is performed via the NMX-C using a gRPC-based API.

  1. Download the Plugin Image

    Run the following command to download the NVLink plugin image:

    Copy
    Copied!
                

    docker pull mellanox/ufm-plugin-nvlink

  2. Load the Plugin into UFM

    After downloading, you can load the plugin into UFM using one of the following methods:

    • Via UFM UI:

      Navigate to Settings → Plugins Management in the UFM web interface.

    • Via Command Line:

      Execute the following command on the UFM server terminal:

      Copy
      Copied!
                  

      /opt/ufm/scripts/manage_ufm_plugins.sh add -p nvlink

Container Volume Mapping

The UFM plugin management system creates the following mappings between the plugin docker container file system and the host machine one:

Container Directory

Host Directory

/config

/opt/ufm/files/conf/plugins/nvlink

/log

/opt/ufm/files/log/plugins/nvlink

Any file system path mentioned in this document refers to the container's file system, unless stated otherwise.

NVLink Domains Connection Security

The plugin, specifically its NMXAGGR component, interacts with NVLink domains over a gRPC connection. In this setup, the domain controller (NMX-C) acts as the server, while NMXAGGR functions as the client.

NMXAGGR supports three modes of gRPC communication:

  1. Insecure – No encryption is used. This is the default mode.

  2. Server-side TLS – Communication is encrypted. Only the server needs to present a certificate to the client. This mode is enabled by setting the cacert option (refer to the Configuration section).

  3. Mutual TLS (mTLS) – Communication is encrypted, and both the client and server must authenticate each other using certificates. This mode requires setting the cacert, cert, and key options (refer to the Configuration section).

Managed Domains List

All NVLink domains that are managed or monitored by NMXAGGR are recorded in a list stored in the file <data_dir>/domains.txt (see the Configuration section for the definition of data_dir).

This file serves as an alternative method—alongside the Web UI and REST API—for adding or removing managed domains. Each line in the file represents the gRPC endpoint address of a domain controller (NMX-C), including its port number.

Addresses may include hostnames or IP addresses, and both can incorporate numerical ranges to define multiple addresses in one line.

Examples:

  • 10.222.16.333:9370

  • nv-dmn-01:6666

  • 10.222.[16,17,20-28].[330-350]:9346

  • nv-dmn-[01-8]:9371

Note

For any changes made directly to the file to take effect, the plugin must be restarted.

Note

When NMXAGGR writes the file (as a result of changes to the managed domains list performed via UI or REST API), it expands addresses containing ranges and writes one address per line.

The plugin can be configured by editing the config file /config/nvlink_plugin.conf.

There are two sections in the config file:

nmxaggr

Option

Description

Default

use_internal

If true an internal NMXAGGR will be used, otherwise an external one will be used.

true

api_address

The address of the NMXAGGR REST API server.

http://127.0.0.1:8966

cacert

The path to a file containing trusted root certificates for verifying NMX-C servers. If not set, insecure gRPC connections will be used.

-

cert

The path to a file containing client certificate to present to NMX-C servers. Must be used with cacert and key options.

-

key

The path to a file containing client private key to present to NMX-C servers. Must be used with cacert and cert options.

-

data_dir

The path to a directory where the internal NXAGGR will store its persistent data.

/config

periodic_fetch_delay

In the case the plugin fails to subscribe to domain change notifications, the periodic data fetches from a domain will be performed. This option specifies the delay between those periodic fetches in a duration string format1.

30s

supplementary_fetch_delay

Normally, after the initial data fetch, data will be fetched from a domain only upon receiving a change notification from a domain controller. Additionally, supplementary fetch will be initiated if there is a long delay since the last fetch. This option specifies the delay in a duration string format1.

30m

1 A duration string is a sequence of decimal numbers, each with optional fraction and a unit suffix, such as 300ms, 1.5h or 2h45m . Valid time units are ns, us, ms, s, m, h .

logging

Option

Description

Default

nvlink_log

The path to the plugin log file.

/log/nvlink.log

nmxaggr_log

The path to the internal NMXAGGR log file.

/log/nmxaggr.log

log_level

The log level. Possible values: DEBUG, INFO, WARNING, ERROR, CRITICAL.

INFO

log_file_max_size

The maximal size of a log file after which the file is rotated.

10485760 1

log_file_retain_count

The number of rotated log files to retain.

5

1 10 MB

After the plugin is activated, an "NVLink" section becomes available in the dashboards.

NVLink Dashboard View

This view presents an overview of inventory elements—such as domains, switches, and GPUs—along with a filter for their health status.

Users can drill down from overall status indicators to specific elements, and further into the individual ports or links associated with each selected element.

image-2025-5-7_8-48-49-version-1-modificationdate-1748450698030-api-v2.png

The user can select a specific domain, upon which a list of associated switches and GPUs will be displayed, as illustrated in the example below.

If the selected domain has any health issues, a detailed breakdown of the affected devices will also be presented.

image-2025-5-7_14-22-16-version-1-modificationdate-1748450694097-api-v2.png

When an unhealthy device is selected, a list of all its ports and links will be displayed.

image-2025-5-7_14-24-2-version-1-modificationdate-1748450693657-api-v2.png

Additionally, the "Recent Events" notification panel on the right side of the screen is updated with the most recent health status changes of the devices.

image-2025-5-7_14-24-33-version-1-modificationdate-1748450693240-api-v2.png

Managed Elements View

The Managed Elements view is a tree-tabular display that shows all inventory elements, allowing users to browse through them. It also provides the option to add or remove domains.

Domains View

image-2025-5-7_8-53-57-version-1-modificationdate-1748450697660-api-v2.png

Add New Domain Model

Click the + icon in the upper dashboard to add a new domain.

image-2025-5-7_8-54-48-version-1-modificationdate-1748450697237-api-v2.png

Available Actions for the Selected Domain

The following actions are available when you right-click on the selected domain's row.

Action

Description

Remove

Removes the selected domain and its elements from the inventory.

Go To Switches

Redirects you to the switches of the selected domain.

Go To GPUs

Redirects you to the GPUs of the selected domain.

Go To Ports

Redirects you to all Ports of the selected domain.

Go To Links

Redirects you to all Links of the selected domain.

Switches View

This screen presents a table listing all the switches, including key details.

image-2025-5-7_9-5-34-version-1-modificationdate-1748450696140-api-v2.png

Available Actions for the Selected Switch

The following actions are available when you right-click on the selected switch's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected switch.

Go To Ports

It redirects you to the Ports of the selected switch.

GPUs View

This screen presents a table listing all the GPUs, including key details.

image-2025-5-7_9-9-23-version-1-modificationdate-1748450695673-api-v2.png

Available Actions for the Selected GPU

The following actions are available when you right-click on the selected GPU's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected GPU.

Go To Ports

It redirects you to the Ports of the selected GPU.

Ports View

This screen presents a table listing all the ports, including key details.

image-2025-5-7_9-10-19-version-1-modificationdate-1748450694847-api-v2.png

Available Actions for the Selected Port

The following actions are available when you right-click on the selected port's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected Port.

Links View

This screen presents a table listing all the links, including key details.

image-2025-5-7_9-12-5-version-1-modificationdate-1748450694467-api-v2.png

The following actions are available when you right-click on the selected link's row.

Action

Description

Go To Domain

It redirects you to the parent domain of the selected Link.

This section contains short descriptions of available REST API endpoints. The more detailed documentation is available at /ufmRestV2/plugin/nmxaggr/v1/app/swagger endpoint of the running plugin.

Info

When operating with a standalone NMXAGGR instance, the /ufmRestV2/plugin/nmxaggr prefix is not required. In contrast, when operating with the NVLink plugin, the prefix must be used.

Therefore, depending on your deployment scenario—plugin mode or standalone—you should adjust the API endpoint URLs accordingly to ensure proper communication.

App Operations

Endpoint

Method

Description

/v1/app/version

GET

Gets a version of NMXAGGR component

/v1/app/swagger

GET

Gets a Swagger UI for browsing API documentation


Managed Domains Operations

Endpoint

Method

Description

/v1/managed_domains

GET

Gets a list of managed domains

/v1/managed_domains/add

POST

Adds a new managed domain

/v1/managed_domains/remove

POST

Removes an existing managed domain


Inventory Operations

Endpoint

Method

Description

/v1/inventory/domains

GET

Gets a list of domains

/v1/inventory/gpus

GET

Gets a list of GPUs

/v1/inventory/switches

GET

Gets a list of switches

/v1/inventory/ports

GET

Gets a list of ports

/v1/inventory/links

GET

Gets a list of links

/v1/inventory/stats

GET

Gets statistics about inventory elements

© Copyright 2025, NVIDIA. Last updated on Jun 3, 2025.