Install and Configure NMX Manager (NMX-M)#

NMX-M provides a single interface for management and telemetry collection of NV Link switches. NMX-M is deployed on Kubernetes, along with the other components that make up Mission Control.

NMX-M Kubernetes Setup#

NMX-M Permanent License Generation and Application Guide#

Generating a License File#

Before you generate the license file, you need to do the following:

Prepare a list of servers with the MAC address of each server on which you plan to install the NMX-M software.
Access to NVIDIA’s NVIDIA Licensing Portal (NLP) with valid credentials.

To generate the license file, follow the steps below:

Access the NVIDIA Licensing Portal
- Go to NVIDIA’s NVIDIA Licensing Portal (NLP).
- Log in using your credentials.
Navigate to Network Entitlements
- Click on the Network Entitlements tab.
- You’ll see a list with the serial licenses of all your software products and software product license information and status.
Select and Activate License
- Select the license you want to activate.
- Click on the “Actions” button.
Configure MAC Addresses
- In the MAC Address field, enter the MAC address of the delegated license-registered host.
- If applicable, in the HA MAC Address field, enter your High Availability (HA) server MAC address.
- Note: If you have more than one NIC installed on a UFM Server, use any of the MAC addresses.
Generate and Download License
- Click on Generate License File to create the license key file for the software
- Click on Download License File and save it on your local computer

Important Notes about License Regeneration#

When you regenerate a license, you need to keep the following in mind:

If you replace your NIC or server, repeat the process of generating the license to set new MAC addresses.
You can only regenerate a license two times.
To regenerate the license after that, contact NVIDIA Sales Administration at enterprisesupport@nvidia.com.

NMX-M deployment in a shared Kubernetes environment requires shared storage for persistence of PV and PVCs. This is accomplished using Longhorn which distributed provides block storage for this. To enable this, iSCSI client must be installed on the nodes used for Kubernetes. To install the iSCSI client, follow the steps below:

Enter the software image chroot:

cm-chroot-sw-img /cm/images/k8s-admin-image

root@k8s-admin-image:/ apt-get update; apt-get install -y open-iscsi
root@k8s-admin-image:/ systemctl enable iscsid open-iscsi
root@k8s-admin-image:/ echo "fs.inotify.max_user_instances = 1024" >> /etc/sysctl.d/60-local.conf
exit

Push changes to Kubernetes nodes:

cmsh -c "device; foreach -c k8s-admin (imageupdate -w)"

Reload and configure iSCSI service:

pdsh -g category=k8s-admin "systemctl daemon-reload; /sbin/iscsi-iname -p \"InitiatorName=iqn.2005-03.org.open-iscsi\" > /etc/iscsi/initiatorname.iscsi; chmod 0600 /etc/iscsi/initiatorname.iscsi"

Create a configuration overlay to persist:

cmsh -c "configurationoverlay; add open-iscsi; set categories k8s-admin; roles; assign generic::open-iscsi; set services open-iscsi; excludelistsnippets; add initiatorname.iscsi; append excludelist /etc/iscsi/initiatorname.iscsi; set modefull yes; set modegrab yes; set modegrabnew yes; commit"

Validate configuration overlay:

cmsh -c "configurationoverlay; use open-iscsi; use open-iscsi; roles; use generic::open-iscsi; excludelistsnippets; list"

Name (key)           Lines            Disabled     Mode sync      Mode full      Mode update    Mode grab      Mode grab new
-------------------- ---------------- ------------ -------------- -------------- -------------- -------------- --------------
initiatorname.iscsi  1                no           yes            yes            yes            yes            yes

Reboot nodes:
```
cmsh -c "device; reboot -c k8s-admin"
```

Validate that initiatorname file has persisted:

pdsh -g category=k8s-admin "cat /etc/iscsi/initiatorname.iscsi" | dshbak -c

Example output:

node001: InitiatorName=iqn.2005-03.org.open-iscsi:229655aa846
node002: InitiatorName=iqn.2005-03.org.open-iscsi:6ffb48fb233d
node003: InitiatorName=iqn.2005-03.org.open-iscsi:581ac1b2a151

Download the NMX-M install package#

The NMX-M install package can be downloaded from NVIDIA’s NVIDIA Licensing Portal (NLP).

Downloading NMX-M#

To download the package, follow the steps below:

Go to NVIDIA’s NVIDIA Licensing Portal (NLP) and log in using your credentials.
Click on Software Downloads, filter the product family to NMX-M, and select the relevant version of the software. Click on Download.
Save the file on your local drive.
Click Close.

Copy the .tar.gz file to the BCM head node:

rsync -azP NMX-MGR-85.1.2000.tar.gz root@bcm11-head-01:/root

Uncompress the package:

tar xvzf NMX-MGR-85.1.2000.tar.gz
cd NMX-M

Uncompress the package:

find . -type f -name "*gz" -exec tar -xvzf {} \;

Install Zarf components#

NMX-M is packaged using Zarf which allows for air-gapped installation on Kubernetes. This works in part by using a local Docker registry and intercepting subsequent public container pull requests back to itself. This behavior isn’t desired for non-NMX-M components in the shared Kubernetes cluster.

Exclude existing namespaces from Zarf#

To exclude existing namespaces from Zarf, use the following code:

for i in $(kubectl get ns -o custom-columns=NAME:.metadata.name --no-headers | grep -Ev "infra|longhorn|kafka|nmx|zarf"); do kubectl label ns $i zarf.dev/agent=ignore; done

Setup Zarf#

To set up Zarf, use the following code:

mv ./Installation/prerequisites/zarf /usr/local/bin/
chmod +x /usr/local/bin/zarf
cd Installation/Zarf_init
zarf init -a amd64 --storage-class local-path --confirm

Deploy local registry daemonset#

To deploy the local registry daemonset, use the following code:

kubectl apply -f registry-image-spread-daemonset.yaml
zarf tools wait-for ds registry-image-spread '{.status.numberReady}'=3 \
-n zarf \
--timeout=300s

Pin registry containers on Kubernetes nodes#

To pin registry containers on Kubernetes nodes, use the following code:

pdsh -g category=k8s-admin 'PATH="/cm/local/apps/cmd/bin/:$PATH"; source /etc/profile; module load containerd; ctr --namespace k8s.io images label 127.0.0.1:31999/library/registry:2.8.3 io.cri-containerd.pinned=pinned' | dshbak -c

Remove local registry daemonset#

To remove the local registry daemonset, use the following code:

kubectl delete -f registry-image-spread-daemonset.yaml

Install Longhorn#

Longhorn is a distributed, shared block storage solution for Kubernetes to provide persistent storage. The variable LONGHORN_DATA_PATH defines which local storage we’ll configure. On control nodes, this should be set to /local as this is where we have a multi-terabyte SW RAID configured.

Deploy Longhorn using Zarf#

To deploy Longhorn using Zarf, use the following code:

cd ../..
zarf package deploy Infra/zarf-package-infra-amd64-*.tar.zst \
--components="longhorn" \
--confirm \
--set LONGHORN_DATA_PATH=/local

Wait for this to complete:

zarf tools wait-for ds longhorn-csi-plugin '{.status.numberReady}'=3 \
-n longhorn-system \
--timeout=300s

Verify instance managers are all up:

while true; do
count=$(kubectl -n longhorn-system get pods --field-selector=status.phase=Running -l longhorn.io/component=instance-manager --no-headers | wc -l)
if [ "$count" -eq 3 ]; then
echo "Found 3 running instance-manager pods"
break
else
echo "Currently $count running pods, waiting for 3..."
sleep 5
fi
done

Remove Zarf from local storage and redeploy on Longhorn shared storage#

To remove Zarf from local storage and redeploy on Longhorn shared storage, use the following code:

cd Installation/Zarf_init
zarf destroy --confirm --no-progress --no-color
zarf init --no-progress --no-color -a amd64 --storage-class longhorn --set REGISTRY_PVC_ENABLED=true --confirm
cd ../..
zarf package deploy Infra/zarf-package-infra-amd64-*.tar.zst \
--components="*" \
--set LONGHORN_DATA_PATH=/local \
--set ALERTMANAGER_WEBHOOK_URL=http://kube-prometheus-stack-alertmanager.prometheus:9093 \
--set RW_PASSWORD=rw-password \
--set RO_PASSWORD=ro-password \
--confirm

After you have installed Zarf and Longhorn, you can deploy the NMX-M Zarf packages.

Run the zarf package deploy to install NMX-M:

zarf package deploy ./Services/zarf-package-services-amd64-*.tar.zst --components="*" --confirm

Configure Longhorn to not be a default storageclass#

After installation of NMX-M, the default behavior is to set Longhorn as a default storageclass. We only want to use Longhorn for the NMX-M components and not for storage.

To configure Longhorn to not be a default storageclass, use the following steps:

Get the current storageclasses:

kubectl get storageclass
NAME                      PROVISIONER                                      RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)      cluster.local/local-path-provisioner             Delete          WaitForFirstConsumer   true                   25h
longhorn (default)        driver.longhorn.io                               Delete          Immediate              true                   24h
longhorn-no-replication   driver.longhorn.io                               Delete          Immediate              true                   24h
longhorn-static           driver.longhorn.io                               Delete          Immediate              true                   24h
shoreline-local-path-sc   cluster.local/shoreline-local-path-provisioner   Delete          WaitForFirstConsumer   true                   18h

Run the following code to patch the Longhorn storageclass:

kubectl patch storageclass longhorn -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'


kubectl get storageclass
NAME                      PROVISIONER                                      RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)      cluster.local/local-path-provisioner             Delete          WaitForFirstConsumer   true                   26h
longhorn                  driver.longhorn.io                               Delete          Immediate              true                   25h
longhorn-no-replication   driver.longhorn.io                               Delete          Immediate              true                   25h
longhorn-static           driver.longhorn.io                               Delete          Immediate              true                   25h
shoreline-local-path-sc   cluster.local/shoreline-local-path-provisioner   Delete          WaitForFirstConsumer   true                   19h

NMX-M configuration#

This section describes how to configure NMX-M.

Installing Certificates and Configuring NMX-C and NMX-T on NV Link Switch#

We’ll need to first validate which NV Link switches have been selected as leaders. We can do this through cmsh.

To list active leaders (These will be denoted as “Active”), use the following code:

cmsh -c "device; nvfabricinfo"

Example output:

Domain   Active          Switches
-------- --------------- -------------------------------
A05      a05-p1-nvsw-01  a05-p1-nvsw-01..a05-p1-nvsw-09
A06      a06-p1-nvsw-01  a06-p1-nvsw-01..a06-p1-nvsw-09
A07      a07-p1-nvsw-01  a07-p1-nvsw-01..a07-p1-nvsw-09
B05      b05-p1-nvsw-01  b05-p1-nvsw-01..b05-p1-nvsw-09
B06      b06-p1-nvsw-01  b06-p1-nvsw-01..b06-p1-nvsw-09
B07      b07-p1-nvsw-01  b07-p1-nvsw-01..b07-p1-nvsw-09
B08      b08-p1-nvsw-01  b08-p1-nvsw-01..b08-p1-nvsw-09

A script is provided as part of the installation package which will generate certificates for mTLS authentication between the NV Link switch leader and the NMX-M deployment.

To generate certificates, use the following code:

cd ../Ansible/tools
./create-certificate.sh a06-p1-nvsw-01

Example output:

certificate.cert-manager.io/a06-p1-nvsw-01-certificate created
Certificate is ready after 10 seconds.
Extracting secret data to local files...
Files created:
-rw-r--r-- 1 root root 1094 Jul 10 14:33 a06-p1-nvsw-01-ca.crt
-rw-r--r-- 1 root root 1432 Jul 10 14:33 a06-p1-nvsw-01-tls.crt
-rw-r--r-- 1 root root 3247 Jul 10 14:33 a06-p1-nvsw-01-tls.key
-rw------- 1 root root 3907 Jul 10 14:33 a06-p1-nvsw-01-tls.p12

To copy the generated certificates to the NV Link switch leader, use the following code:

scp a06-p1-nvsw-01-ca.crt a06-p1-nvsw-01-tls.p12 admin@a06-p1-nvsw-01:/home/admin

Then, SSH onto the NV Link switch leader and enable NMX-Controller(NMC-C) and NMX-Telemetry apps(NMX-T):

ssh admin@a06-p1-nvsw-01 "nv action update cluster apps nmx-controller manager enabled; nv action update cluster apps nmx-telemetry manager enabled"

Example output:

NVOS switch
admin@a06-p1-nvsw-01s password:
Action executing ...
Cluster Manager Port updated successfully
Action succeeded
Action executing ...
Cluster Manager Port updated successfully
Action succeeded

To install previously generated certificates by importing through the NV Link switch leader, use the following code:

ssh admin@a06-p1-nvsw-01 "nv action import system security certificate nmxm-cert uri-bundle file:///home/admin/a06-p1-nvsw-01-tls.p12; nv action import system security ca-certificate manager-ca-cert uri file:///home/admin/a06-p1-nvsw-01-ca.crt"

Example output:

NVOS switch
admin@a06-p1-nvsw-01s password:
Action executing ...
Succeeded in importing X.509 entity certificate `nmxm-cert`.
NOTE: Certificate `nmxm-cert` is self-signed.
Action succeeded
Action executing ...
Succeeded in importing X.509 CA certificate `manager-ca-cert`.
Action succeeded

To enable services and enable mTLS, use the following code:

ssh admin@a06-p1-nvsw-01 "nv action update cluster apps nmx-controller manager ca-certificate manager-ca-cert; nv action update cluster apps nmx-telemetry manager ca-certificate manager-ca-cert; nv action update cluster apps nmx-controller manager certificate nmxm-cert; nv action update cluster apps nmx-telemetry manager certificate nmxm-cert; nv action update cluster apps nmx-controller manager encryption mtls; nv action update cluster apps nmx-telemetry manager encryption mtls"

Example output:

NVOS switch
admin@a06-p1-nvsw-01s password:
Action executing ...
Cluster Manager CA Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager CA Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Encryption updated successfully
Action succeeded
Action executing ...
Cluster Manager Encryption updated successfully
Action succeeded

To restart NMX-C and NMX-T services, use the following code:

ssh admin@a06-p1-nvsw-01 "nv action stop cluster apps nmx-telemetry; nv action start cluster apps nmx-telemetry; nv action stop cluster apps nmx-controller; nv action start cluster apps nmx-controller"

Example output:

NVOS switch
admin@a06-p1-nvsw-01s password:
Action executing ...
Running app stop command: nmx-telemetry
Action executing ...
App has been successfully stopped
Action succeeded
Action executing ...
Running app start command: nmx-telemetry
Action executing ...
App has been successfully started
Action succeeded
Action executing ...
Running app stop command: nmx-controller
Action executing ...
App has been successfully stopped
Action succeeded
Action executing ...
Running app start command: nmx-controller
Action executing ...
App has been successfully started
Action succeeded

NMX-M’s primary method of interaction is through its REST API. We’ll use curl for the next steps, adding the NV Link switch services.

With the Kubernetes cluster installed by BCM, default behavior is for traffic to be directed to the head nodes. This uses Nginx running on the head nodes to proxy requests to Kubernetes nodes running ingress-nginx. This is the reason for the examples using curl to https://master/nmx.

In the payload we will use POST, we’ll need to include the IP of the leader switch we’re configuring. This can be found using cmsh, for example the leader switch for rack A06:

cmsh -c "device; use a06-p1-nvsw-01; get ip"
7.241.3.31

To add the NV Link switch leader’s NMX-C to NMX-M by making a POST request to the /v1/services endpoint, use the following code:

curl -sk -X POST 'https://master/nmx/v1/services' -u rw-user:rw-password --header 'Content-Type: application/json' \
--data '{
  "Name": "a06-p1-nvsw-01",
  "Description": "a06-p1-nvsw-01",
  "ServiceType": "CONTROLLER",
  "ServiceConnectionInformation": {
    "Address": "7.241.3.31",
    "PortNumber": 9370
  }
}'

Example response:

{
    "Address": "7.241.3.31",
    "Description": "a06-p1-nvsw-01",
    "ID": "68703777cf6f5852a7316906",
    "Name": "a06-p1-nvsw-01",
    "PortNumber": 9370,
    "ServiceType": "CONTROLLER",
    "Status": "IN_PROGRESS",
    "StatusInfo": "",
    "Version": ""
}

Verify that this was successful by making a GET request to the /v1/services endpoint, use the following code:

curl -sk -X GET 'https://master/nmx/v1/services/68703777cf6f5852a7316906' -u rw-user:rw-password --header 'Content-Type: application/json' | jq .

Example response:

{
  "Address": "7.241.3.31",
  "ApplicationUUID": "c9bd7a13-ccb2-4a90-95ff-9dcf5e9038bc",
  "ClusterDomainUUID": "c2b42a4c-e407-4f98-af6a-8c96823a807e",
  "Description": "a06-p1-nvsw-01",
  "ID": "68703777cf6f5852a7316906",
  "Name": "a06-p1-nvsw-01",
  "PortNumber": 9370,
  "RegisteredAt": "2025-07-10T21:58:15.908Z",
  "ServiceType": "CONTROLLER",
  "Status": "UP",
  "StatusInfo": "",
  "UpSince": "2025-07-10T21:58:15.908Z",
  "Version": "1.2.0_2025-06-07_10-33"
}

To add the NV Link switch leader’s NMX-T to NMX-M by making a POST request to the /v1/services endpoint, use the following code:

curl -sk -X POST 'https://master/nmx/v1/services' -u rw-user:rw-password --header 'Content-Type: application/json' \
--data '{
  "Name": "a06-p1-nvsw-01",
  "Description": "a06-p1-nvsw-01",
  "ServiceType": "TELEMETRY",
  "ServiceConnectionInformation": {
    "Address": "7.241.3.31",
    "PortNumber": 9351
  }
}'

Example response:

{
    "Address":     "7.241.3.31",
    "Description": "a06-p1-nvsw-01",
    "ID":          "6870386b8c7b451eeafddfda",
    "Name":        "a06-p1-nvsw-01",
    "PortNumber":   9351,
    "ServiceType": "TELEMETRY",
    "Status":      "IN_PROGRESS",
    "StatusInfo":  "",
    "Version":     ""
}

Verify that this was successful by making a GET request to the /v1/services endpoint, using the ID from the prior response:

curl -sk -X GET 'https://master/nmx/v1/services/6870386b8c7b451eeafddfda' -u rw-user:rw-password --header 'Content-Type: application/json' | jq .

Example response:

{
  "Address":           "7.241.3.31",
  "ApplicationUUID":   "6c164aa5-2aa7-4789-9587-31b79dc62897",
  "ClusterDomainUUID": "c2b42a4c-e407-4f98-af6a-8c96823a807e",
  "Description":       "a06-p1-nvsw-01",
  "ID":                "6870386b8c7b451eeafddfda",
  "Name":              "a06-p1-nvsw-01",
  "PortNumber":         9351,
  "RegisteredAt":      "2025-07-10T22:02:19.865Z",
  "ServiceType":       "TELEMETRY",
  "Status":            "UP",
  "StatusInfo":        "",
  "UpSince":           "2025-07-10T22:02:19.865Z",
  "Version":           "1.1.3"
}

NMX-M provides a Prometheus exporter interface for collected metrics. We can use this to validate that NMX-M is working with the newly added NV Link switch leader, use the following code:

curl -sk "https://master/nmx/v1/metrics?id=$(curl -sk -X GET 'https://master /nmx/v1/services/6870386b8c7b451eeafddfda' -u rw-user:rw-password --header 'Content-Type: application/json' | jq -r '.ClusterDomainUUID')" \
-u rw-user:rw-password \
| head -n 20

Example output:

diag_supply_voltage{domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x330aa4e54b8d4c2d",Port="11"} 0 1752185097534
diag_supply_voltage{Port="11",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x2c5eab0300ca6700"} 0 1752185097534
diag_supply_voltage{domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x9f6028016bbe9123",Port="11"} 0 1752185097534
diag_supply_voltage{node_guid="0x2c5eab0300ca6720",Port="29",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e"} 0 1752185097534
diag_supply_voltage{Port="5",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x2bad4538ad47b824"}  0 1752185097534

Apply a Permanent NMX-M License#

To apply a permanent NMX-M license, use the following steps:

Copy the license file to the /opt/nvidia/nmx/licenses directory.
Run the License Configuration Script by executing the following code:

`/opt/nvidia/nmx/scripts/license-config.sh`

Apply the new license by selecting Option 1.
Then confirm the license details when prompted.