Install and Configure NMX Manager (NMX-M)#
NMX-M provides a single interface for management and telemetry collection of NV Link switches. NMX-M is deployed on Kubernetes, along with the other components that make up Mission Control.
NMX-M Kubernetes Setup#
Prerequisites#
NMX-M Permanent License Generation and Application Guide
Generating License File
Prerequisites
Prepare a list of servers with the MAC address of each server on which you plan to install the NMX-M software
Access to NVIDIA’s NVIDIA Licensing Portal (NLP) with valid credentials
Steps to Generate License File
Access the NVIDIA Licensing Portal - Go to NVIDIA’s NVIDIA Licensing Portal (NLP) - Log in using your credentials
Navigate to Network Entitlements - Click on the Network Entitlements tab - You’ll see a list with the serial licenses of all your software products and software product license information and status
Select and Activate License - Select the license you want to activate - Click on the “Actions” button
Configure MAC Addresses - In the MAC Address field, enter the MAC address of the delegated license-registered host - If applicable, in the HA MAC Address field, enter your High Availability (HA) server MAC address - Note: If you have more than one NIC installed on a UFM Server, use any of the MAC addresses
Generate and Download License - Click on Generate License File to create the license key file for the software - Click on Download License File and save it on your local computer
Important Notes about License Regeneration - If you replace your NIC or server, repeat the process of generating the license to set new MAC addresses - You can only regenerate a license two times - To regenerate the license after that, contact NVIDIA Sales Administration at enterprisesupport@nvidia.com
NMX-M deployment in a shared Kubernetes environment requires shared storage for persistence of PV and PVCs. This is accomplished using Longhorn which distributed provides block storage for this. To enable this, iSCSI client must be installed on the nodes used for Kubernetes.
Installation on software image:
cm-chroot-sw-img /cm/images/k8s-admin-image
root@k8s-admin-image:/ apt-get update; apt-get install -y open-iscsi
root@k8s-admin-image:/ systemctl enable iscsid open-iscsi
root@k8s-admin-image:/ echo "fs.inotify.max_user_instances = 1024" >> /etc/sysctl.d/60-local.conf
exit
Push changes to Kubernetes nodes:
cmsh -c "device; foreach -c k8s-admin (imageupdate -w)"
Reload and configure iSCSI service:
pdsh -g category=k8s-admin "systemctl daemon-reload; /sbin/iscsi-iname -p \"InitiatorName=iqn.2005-03.org.open-iscsi\" > /etc/iscsi/initiatorname.iscsi; chmod 0600 /etc/iscsi/initiatorname.iscsi"
Create a configuration overlay to persist:
cmsh -c "configurationoverlay; add open-iscsi; set categories k8s-admin; roles; assign generic::open-iscsi; set services open-iscsi; excludelistsnippets; add initiatorname.iscsi; append excludelist /etc/iscsi/initiatorname.iscsi; set modefull yes; set modegrab yes; set modegrabnew yes; commit"
Validate configuration overlay:
cmsh -c "configurationoverlay; use open-iscsi; use open-iscsi; roles; use generic::open-iscsi; excludelistsnippets; list"
Name (key) Lines Disabled Mode sync Mode full Mode update Mode grab Mode grab new
-------------------- ---------------- ------------ -------------- -------------- -------------- -------------- --------------
initiatorname.iscsi 1 no yes yes yes yes yes
Reboot nodes:
cmsh -c "device; reboot -c k8s-admin"
Validate that initiatorname file has persisted:
pdsh -g category=k8s-admin "cat /etc/iscsi/initiatorname.iscsi" | dshbak -c
Example output:
node001: InitiatorName=iqn.2005-03.org.open-iscsi:229655aa846
node002: InitiatorName=iqn.2005-03.org.open-iscsi:6ffb48fb233d
node003: InitiatorName=iqn.2005-03.org.open-iscsi:581ac1b2a151
Download the NMX-M install package#
Downloading NMX-M#
Go to NVIDIA’s NVIDIA Licensing Portal (NLP) and log in using your credentials.
Click on Software Downloads, filter the product family to NMX-M, and select the relevant version of the software. Click on Download.
Save the file on your local drive.
Click Close.
Copy the .tar.gz file to the BCM head node:
rsync -azP NMX-MGR-85.1.2000.tar.gz root@bcm11-head-01:/root
Uncompress the package:
tar xvzf NMX-MGR-85.1.2000.tar.gz
cd NMX-M
find . -type f -name "*gz" -exec tar -xvzf {} \;
Install Zarf components#
NMX-M is packaged using Zarf which allows for air-gapped installation on Kubernetes. This works in part by using a local Docker registry and intercepting subsequent public container pull requests back to itself. This behavior isn’t desired for non-NMX-M components in the shared Kubernetes cluster.
Exclude existing namespaces from Zarf#
for i in $(kubectl get ns -o custom-columns=NAME:.metadata.name --no-headers | grep -Ev "infra|longhorn|kafka|nmx|zarf"); do kubectl label ns $i zarf.dev/agent=ignore; done
Setup Zarf#
mv ./Installation/prerequisites/zarf /usr/local/bin/
chmod +x /usr/local/bin/zarf
cd Installation/Zarf_init
zarf init -a amd64 --storage-class local-path --confirm
Deploy local registry daemonset#
kubectl apply -f registry-image-spread-daemonset.yaml
zarf tools wait-for ds registry-image-spread '{.status.numberReady}'=3 \
-n zarf \
--timeout=300s
Pin registry containers on Kubernetes nodes#
pdsh -g category=k8s-admin 'PATH="/cm/local/apps/cmd/bin/:$PATH"; source /etc/profile; module load containerd; ctr --namespace k8s.io images label 127.0.0.1:31999/library/registry:2.8.3 io.cri-containerd.pinned=pinned' | dshbak -c
Remove local registry daemonset#
kubectl delete -f registry-image-spread-daemonset.yaml
Install Longhorn#
Longhorn is a distributed, shared block storage solution for Kubernetes to provide persistent storage. The variable LONGHORN_DATA_PATH is defines which local storate we’ll configure. On control nodes, this should be set to /local as this is where we have a multi-terabyte SW RAID configured.
Deploy Longhorn via Zarf#
cd ../..
zarf package deploy Infra/zarf-package-infra-amd64-*.tar.zst \
--components="longhorn" \
--confirm \
--set LONGHORN_DATA_PATH=/local
Wait for this to complete:
zarf tools wait-for ds longhorn-csi-plugin '{.status.numberReady}'=3 \
-n longhorn-system \
--timeout=300s
Verify instance managers are all up:
while true; do
count=$(kubectl -n longhorn-system get pods --field-selector=status.phase=Running -l longhorn.io/component=instance-manager --no-headers | wc -l)
if [ "$count" -eq 3 ]; then
echo "Found 3 running instance-manager pods"
break
else
echo "Currently $count running pods, waiting for 3..."
sleep 5
fi
done
Configure Longhorn to not be a default storageclass#
After installation of NMX-M, the default behavior is to have Longhorn as a default storageclass. This is not desirable as Longhorn is only to be used by the NMX-M components.
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path (default) cluster.local/local-path-provisioner Delete WaitForFirstConsumer true 25h
longhorn (default) driver.longhorn.io Delete Immediate true 24h
longhorn-no-replication driver.longhorn.io Delete Immediate true 24h
longhorn-static driver.longhorn.io Delete Immediate true 24h
shoreline-local-path-sc cluster.local/shoreline-local-path-provisioner Delete WaitForFirstConsumer true 18h
We can run a patch to correct this behavior.
kubectl patch storageclass longhorn -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path (default) cluster.local/local-path-provisioner Delete WaitForFirstConsumer true 26h
longhorn driver.longhorn.io Delete Immediate true 25h
longhorn-no-replication driver.longhorn.io Delete Immediate true 25h
longhorn-static driver.longhorn.io Delete Immediate true 25h
shoreline-local-path-sc cluster.local/shoreline-local-path-provisioner Delete WaitForFirstConsumer true 19h
NMX-M configuration#
Installing Certs & Configuring NMX-C & NMX-T on NV Link switch#
We’ll need to first validate which NV Link switches have been selected as leaders. We can do this through cmsh.
List leaders (These will be denoted as “Active”):
cmsh -c "device; nvfabricinfo"
Example output:
Domain Active Switches
-------- --------------- -------------------------------
A05 a05-p1-nvsw-01 a05-p1-nvsw-01..a05-p1-nvsw-09
A06 a06-p1-nvsw-01 a06-p1-nvsw-01..a06-p1-nvsw-09
A07 a07-p1-nvsw-01 a07-p1-nvsw-01..a07-p1-nvsw-09
B05 b05-p1-nvsw-01 b05-p1-nvsw-01..b05-p1-nvsw-09
B06 b06-p1-nvsw-01 b06-p1-nvsw-01..b06-p1-nvsw-09
B07 b07-p1-nvsw-01 b07-p1-nvsw-01..b07-p1-nvsw-09
B08 b08-p1-nvsw-01 b08-p1-nvsw-01..b08-p1-nvsw-09
A script is provided as part of the installation package which will generate certificates for mTLS authentication between the NV Link switch leader and the NMX-M deployment.
Generate certificates:
cd ../Ansible/tools
./create-certificate.sh a06-p1-nvsw-01
Example output:
certificate.cert-manager.io/a06-p1-nvsw-01-certificate created
Certificate is ready after 10 seconds.
Extracting secret data to local files...
Files created:
-rw-r--r-- 1 root root 1094 Jul 10 14:33 a06-p1-nvsw-01-ca.crt
-rw-r--r-- 1 root root 1432 Jul 10 14:33 a06-p1-nvsw-01-tls.crt
-rw-r--r-- 1 root root 3247 Jul 10 14:33 a06-p1-nvsw-01-tls.key
-rw------- 1 root root 3907 Jul 10 14:33 a06-p1-nvsw-01-tls.p12
Copy the generated certificates to the NV Link switch leader:
scp a06-p1-nvsw-01-ca.crt a06-p1-nvsw-01-tls.p12 admin@a06-p1-nvsw-01:/home/admin
SSH onto the NV Link switch leader and enable NMX-Controller(NMC-C) and NMX-Telemetry apps(NMX-T):
ssh admin@a06-p1-nvsw-01 "nv action update cluster apps nmx-controller manager enabled; nv action update cluster apps nmx-telemetry manager enabled"
Example output:
NVOS switch
admin@a06-p1-nvsw-01's password:
Action executing ...
Cluster Manager Port updated successfully
Action succeeded
Action executing ...
Cluster Manager Port updated successfully
Action succeeded
Install previously generated certificates by importing through the NV Link switch leader:
ssh admin@a06-p1-nvsw-01 "nv action import system security certificate nmxm-cert uri-bundle file:///home/admin/a06-p1-nvsw-01-tls.p12; nv action import system security ca-certificate manager-ca-cert uri file:///home/admin/a06-p1-nvsw-01-ca.crt"
Example output:
NVOS switch
admin@a06-p1-nvsw-01's password:
Action executing ...
Succeeded in importing X.509 entity certificate `nmxm-cert`.
NOTE: Certificate `nmxm-cert` is self-signed.
Action succeeded
Action executing ...
Succeeded in importing X.509 CA certificate `manager-ca-cert`.
Action succeeded
Enable services, enabling mTLS:
ssh admin@a06-p1-nvsw-01 "nv action update cluster apps nmx-controller manager ca-certificate manager-ca-cert; nv action update cluster apps nmx-telemetry manager ca-certificate manager-ca-cert; nv action update cluster apps nmx-controller manager certificate nmxm-cert; nv action update cluster apps nmx-telemetry manager certificate nmxm-cert; nv action update cluster apps nmx-controller manager encryption mtls; nv action update cluster apps nmx-telemetry manager encryption mtls"
Example output:
NVOS switch
admin@a06-p1-nvsw-01's password:
Action executing ...
Cluster Manager CA Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager CA Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Encryption updated successfully
Action succeeded
Action executing ...
Cluster Manager Encryption updated successfully
Action succeeded
Restart NMX-C and NMX-T services:
ssh admin@a06-p1-nvsw-01 "nv action stop cluster apps nmx-telemetry; nv action start cluster apps nmx-telemetry; nv action stop cluster apps nmx-controller; nv action start cluster apps nmx-controller"
Example output:
NVOS switch
admin@a06-p1-nvsw-01's password:
Action executing ...
Running app stop command: nmx-telemetry
Action executing ...
App has been successfully stopped
Action succeeded
Action executing ...
Running app start command: nmx-telemetry
Action executing ...
App has been successfully started
Action succeeded
Action executing ...
Running app stop command: nmx-controller
Action executing ...
App has been successfully stopped
Action succeeded
Action executing ...
Running app start command: nmx-controller
Action executing ...
App has been successfully started
Action succeeded
NMX-M’s primary method of interaction is through its REST API. We’ll use curl for the next steps, adding the NV Link switch services.
With the Kubernetes cluster installed by BCM, default behavior is for traffic to be directed to the head nodes. This uses Nginx running on the head nodes to proxy requests to Kubernetes nodes running ingress-nginx. This is the reason for the examples using curl to https://master/nmx.
In the payload we POST, we’ll need to include the IP of the leader switch we’re configuring. This can be found via cmsh, for example the leader switch for rack A06:
cmsh -c "device; use a06-p1-nvsw-01; get ip"
7.241.3.31
Add the NV Link switch leader’s NMX-C to NMX-M by making a POST request to the /v1/services endpoint:
curl -sk -X POST 'https://master/nmx/v1/services' -u rw-user:rw-password --header 'Content-Type: application/json' \
--data '{
"Name": "a06-p1-nvsw-01",
"Description": "a06-p1-nvsw-01",
"ServiceType": "CONTROLLER",
"ServiceConnectionInformation": {
"Address": "7.241.3.31",
"PortNumber": 9370
}
}'
Example response:
{
"Address": "7.241.3.31",
"Description": "a06-p1-nvsw-01",
"ID": "68703777cf6f5852a7316906",
"Name": "a06-p1-nvsw-01",
"PortNumber": 9370,
"ServiceType": "CONTROLLER",
"Status": "IN_PROGRESS",
"StatusInfo": "",
"Version": ""
}
Verify that this was successful by making a GET request to the /v1/services endpoint:
curl -sk -X GET 'https://master/nmx/v1/services/68703777cf6f5852a7316906' -u rw-user:rw-password --header 'Content-Type: application/json' | jq .
Example response:
{
"Address": "7.241.3.31",
"ApplicationUUID": "c9bd7a13-ccb2-4a90-95ff-9dcf5e9038bc",
"ClusterDomainUUID": "c2b42a4c-e407-4f98-af6a-8c96823a807e",
"Description": "a06-p1-nvsw-01",
"ID": "68703777cf6f5852a7316906",
"Name": "a06-p1-nvsw-01",
"PortNumber": 9370,
"RegisteredAt": "2025-07-10T21:58:15.908Z",
"ServiceType": "CONTROLLER",
"Status": "UP",
"StatusInfo": "",
"UpSince": "2025-07-10T21:58:15.908Z",
"Version": "1.2.0_2025-06-07_10-33"
}
Add the NV Link switch leader’s NMX-T to NMX-M by making a POST request to the /v1/services endpoint:
curl -sk -X POST 'https://master/nmx/v1/services' -u rw-user:rw-password --header 'Content-Type: application/json' \
--data '{
"Name": "a06-p1-nvsw-01",
"Description": "a06-p1-nvsw-01",
"ServiceType": "TELEMETRY",
"ServiceConnectionInformation": {
"Address": "7.241.3.31",
"PortNumber": 9351
}
}'
Example response:
{
"Address": "7.241.3.31",
"Description": "a06-p1-nvsw-01",
"ID": "6870386b8c7b451eeafddfda",
"Name": "a06-p1-nvsw-01",
"PortNumber": 9351,
"ServiceType": "TELEMETRY",
"Status": "IN_PROGRESS",
"StatusInfo": "",
"Version": ""
}
Verify that this was successful by making a GET request to the /v1/services endpoint, using the ID from the prior response:
curl -sk -X GET 'https://master/nmx/v1/services/6870386b8c7b451eeafddfda' -u rw-user:rw-password --header 'Content-Type: application/json' | jq .
Example response:
{
"Address": "7.241.3.31",
"ApplicationUUID": "6c164aa5-2aa7-4789-9587-31b79dc62897",
"ClusterDomainUUID": "c2b42a4c-e407-4f98-af6a-8c96823a807e",
"Description": "a06-p1-nvsw-01",
"ID": "6870386b8c7b451eeafddfda",
"Name": "a06-p1-nvsw-01",
"PortNumber": 9351,
"RegisteredAt": "2025-07-10T22:02:19.865Z",
"ServiceType": "TELEMETRY",
"Status": "UP",
"StatusInfo": "",
"UpSince": "2025-07-10T22:02:19.865Z",
"Version": "1.1.3"
}
NMX-M provides a Prometheus exporter interface for collected metrics. We can use this to validate that NMX-M is working with the newly added NV Link switch leader:
curl -sk "https://master/nmx/v1/metrics?id=$(curl -sk -X GET 'https://master /nmx/v1/services/6870386b8c7b451eeafddfda' -u rw-user:rw-password --header 'Content-Type: application/json' | jq -r '.ClusterDomainUUID')" \
-u rw-user:rw-password \
| head -n 20
Example output:
diag_supply_voltage{domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x330aa4e54b8d4c2d",Port="11"} 0 1752185097534
diag_supply_voltage{Port="11",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x2c5eab0300ca6700"} 0 1752185097534
diag_supply_voltage{domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x9f6028016bbe9123",Port="11"} 0 1752185097534
diag_supply_voltage{node_guid="0x2c5eab0300ca6720",Port="29",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e"} 0 1752185097534
diag_supply_voltage{Port="5",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x2bad4538ad47b824"} 0 1752185097534
Applying Permanent NMX-M License
Steps to Apply License
Copy License File - Copy the license file to: /opt/nvidia/nmx/licenses
Run License Configuration Script - Execute the script: /opt/nvidia/nmx/scripts/license-config.sh
Apply New License - Select Option 1 to apply a new license
Confirm License Details - Confirm the license details when prompted