Install and Configure NMX Manager (NMX-M)#
NMX-M provides a single interface for management and telemetry collection of NV Link switches. NMX-M is deployed on Kubernetes, along with the other components that make up Mission Control.
NMX-M Kubernetes Setup#
NMX-M Permanent License Generation and Application Guide#
Generating a License File#
Before you generate the license file, you need to do the following:
Prepare a list of servers with the MAC address of each server on which you plan to install the NMX-M software.
Access to NVIDIA’s NVIDIA Licensing Portal (NLP) with valid credentials.
To generate the license file, follow the steps below:
Access the NVIDIA Licensing Portal
Go to NVIDIA’s NVIDIA Licensing Portal (NLP).
Log in using your credentials.
Navigate to Network Entitlements
Click on the Network Entitlements tab.
You’ll see a list with the serial licenses of all your software products and software product license information and status.
Select and Activate License
Select the license you want to activate.
Click on the “Actions” button.
Configure MAC Addresses
In the MAC Address field, enter the MAC address of the delegated license-registered host.
If applicable, in the HA MAC Address field, enter your High Availability (HA) server MAC address.
Note: If you have more than one NIC installed on a UFM Server, use any of the MAC addresses.
Generate and Download License
Click on Generate License File to create the license key file for the software
Click on Download License File and save it on your local computer
Important Notes about License Regeneration#
When you regenerate a license, you need to keep the following in mind:
If you replace your NIC or server, repeat the process of generating the license to set new MAC addresses.
You can only regenerate a license two times.
To regenerate the license after that, contact NVIDIA Sales Administration at enterprisesupport@nvidia.com.
NMX-M deployment in a shared Kubernetes environment requires shared storage for persistence of PV and PVCs. This is accomplished using Longhorn which distributed provides block storage for this. To enable this, iSCSI client must be installed on the nodes used for Kubernetes. To install the iSCSI client, follow the steps below:
Enter the software image chroot:
cm-chroot-sw-img /cm/images/k8s-admin-image root@k8s-admin-image:/ apt-get update; apt-get install -y open-iscsi root@k8s-admin-image:/ systemctl enable iscsid open-iscsi root@k8s-admin-image:/ echo "fs.inotify.max_user_instances = 1024" >> /etc/sysctl.d/60-local.conf exit
Push changes to Kubernetes nodes:
cmsh -c "device; foreach -c k8s-admin (imageupdate -w)"
Reload and configure iSCSI service:
pdsh -g category=k8s-admin "systemctl daemon-reload; /sbin/iscsi-iname -p \"InitiatorName=iqn.2005-03.org.open-iscsi\" > /etc/iscsi/initiatorname.iscsi; chmod 0600 /etc/iscsi/initiatorname.iscsi"
Create a configuration overlay to persist:
cmsh -c "configurationoverlay; add open-iscsi; set categories k8s-admin; roles; assign generic::open-iscsi; set services open-iscsi; excludelistsnippets; add initiatorname.iscsi; append excludelist /etc/iscsi/initiatorname.iscsi; set modefull yes; set modegrab yes; set modegrabnew yes; commit"
Validate configuration overlay:
cmsh -c "configurationoverlay; use open-iscsi; use open-iscsi; roles; use generic::open-iscsi; excludelistsnippets; list" Name (key) Lines Disabled Mode sync Mode full Mode update Mode grab Mode grab new -------------------- ---------------- ------------ -------------- -------------- -------------- -------------- -------------- initiatorname.iscsi 1 no yes yes yes yes yes
Reboot nodes:
cmsh -c "device; reboot -c k8s-admin"
Validate that initiatorname file has persisted:
pdsh -g category=k8s-admin "cat /etc/iscsi/initiatorname.iscsi" | dshbak -c
Example output:
node001: InitiatorName=iqn.2005-03.org.open-iscsi:229655aa846 node002: InitiatorName=iqn.2005-03.org.open-iscsi:6ffb48fb233d node003: InitiatorName=iqn.2005-03.org.open-iscsi:581ac1b2a151
Download the NMX-M install package#
The NMX-M install package can be downloaded from NVIDIA’s NVIDIA Licensing Portal (NLP).
Downloading NMX-M#
To download the package, follow the steps below:
Go to NVIDIA’s NVIDIA Licensing Portal (NLP) and log in using your credentials.
Click on Software Downloads, filter the product family to NMX-M, and select the relevant version of the software. Click on Download.
Save the file on your local drive.
Click Close.
Copy the .tar.gz file to the BCM head node:
rsync -azP NMX-MGR-85.1.2000.tar.gz root@bcm11-head-01:/root
Uncompress the package:
tar xvzf NMX-MGR-85.1.2000.tar.gz cd NMX-M
Uncompress the package:
find . -type f -name "*gz" -exec tar -xvzf {} \;
Install Zarf components#
NMX-M is packaged using Zarf which allows for air-gapped installation on Kubernetes. This works in part by using a local Docker registry and intercepting subsequent public container pull requests back to itself. This behavior isn’t desired for non-NMX-M components in the shared Kubernetes cluster.
Exclude existing namespaces from Zarf#
To exclude existing namespaces from Zarf, use the following code:
for i in $(kubectl get ns -o custom-columns=NAME:.metadata.name --no-headers | grep -Ev "infra|longhorn|kafka|nmx|zarf"); do kubectl label ns $i zarf.dev/agent=ignore; done
Setup Zarf#
To set up Zarf, use the following code:
mv ./Installation/prerequisites/zarf /usr/local/bin/
chmod +x /usr/local/bin/zarf
cd Installation/Zarf_init
zarf init -a amd64 --storage-class local-path --confirm
Deploy local registry daemonset#
To deploy the local registry daemonset, use the following code:
kubectl apply -f registry-image-spread-daemonset.yaml
zarf tools wait-for ds registry-image-spread '{.status.numberReady}'=3 \
-n zarf \
--timeout=300s
Pin registry containers on Kubernetes nodes#
To pin registry containers on Kubernetes nodes, use the following code:
pdsh -g category=k8s-admin 'PATH="/cm/local/apps/cmd/bin/:$PATH"; source /etc/profile; module load containerd; ctr --namespace k8s.io images label 127.0.0.1:31999/library/registry:2.8.3 io.cri-containerd.pinned=pinned' | dshbak -c
Remove local registry daemonset#
To remove the local registry daemonset, use the following code:
kubectl delete -f registry-image-spread-daemonset.yaml
Install Longhorn#
Longhorn is a distributed, shared block storage solution for Kubernetes to provide persistent storage. The variable LONGHORN_DATA_PATH defines which local storage we’ll configure. On control nodes, this should be set to /local as this is where we have a multi-terabyte SW RAID configured.
Deploy Longhorn using Zarf#
To deploy Longhorn using Zarf, use the following code:
cd ../..
zarf package deploy Infra/zarf-package-infra-amd64-*.tar.zst \
--components="longhorn" \
--confirm \
--set LONGHORN_DATA_PATH=/local
Wait for this to complete:
zarf tools wait-for ds longhorn-csi-plugin '{.status.numberReady}'=3 \
-n longhorn-system \
--timeout=300s
Verify instance managers are all up:
while true; do
count=$(kubectl -n longhorn-system get pods --field-selector=status.phase=Running -l longhorn.io/component=instance-manager --no-headers | wc -l)
if [ "$count" -eq 3 ]; then
echo "Found 3 running instance-manager pods"
break
else
echo "Currently $count running pods, waiting for 3..."
sleep 5
fi
done
Configure Longhorn to not be a default storageclass#
After installation of NMX-M, the default behavior is to set Longhorn as a default storageclass. We only want to use Longhorn for the NMX-M components and not for storage.
To configure Longhorn to not be a default storageclass, use the following steps:
Get the current storageclasses:
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path (default) cluster.local/local-path-provisioner Delete WaitForFirstConsumer true 25h
longhorn (default) driver.longhorn.io Delete Immediate true 24h
longhorn-no-replication driver.longhorn.io Delete Immediate true 24h
longhorn-static driver.longhorn.io Delete Immediate true 24h
shoreline-local-path-sc cluster.local/shoreline-local-path-provisioner Delete WaitForFirstConsumer true 18h
Run the following code to patch the Longhorn storageclass:
kubectl patch storageclass longhorn -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'
kubectl get storageclass
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
local-path (default) cluster.local/local-path-provisioner Delete WaitForFirstConsumer true 26h
longhorn driver.longhorn.io Delete Immediate true 25h
longhorn-no-replication driver.longhorn.io Delete Immediate true 25h
longhorn-static driver.longhorn.io Delete Immediate true 25h
shoreline-local-path-sc cluster.local/shoreline-local-path-provisioner Delete WaitForFirstConsumer true 19h
NMX-M configuration#
This section describes how to configure NMX-M.
Installing Certificates and Configuring NMX-C and NMX-T on NV Link Switch#
We’ll need to first validate which NV Link switches have been selected as leaders. We can do this through cmsh.
To list active leaders (These will be denoted as “Active”), use the following code:
cmsh -c "device; nvfabricinfo"
Example output:
Domain Active Switches
-------- --------------- -------------------------------
A05 a05-p1-nvsw-01 a05-p1-nvsw-01..a05-p1-nvsw-09
A06 a06-p1-nvsw-01 a06-p1-nvsw-01..a06-p1-nvsw-09
A07 a07-p1-nvsw-01 a07-p1-nvsw-01..a07-p1-nvsw-09
B05 b05-p1-nvsw-01 b05-p1-nvsw-01..b05-p1-nvsw-09
B06 b06-p1-nvsw-01 b06-p1-nvsw-01..b06-p1-nvsw-09
B07 b07-p1-nvsw-01 b07-p1-nvsw-01..b07-p1-nvsw-09
B08 b08-p1-nvsw-01 b08-p1-nvsw-01..b08-p1-nvsw-09
A script is provided as part of the installation package which will generate certificates for mTLS authentication between the NV Link switch leader and the NMX-M deployment.
To generate certificates, use the following code:
cd ../Ansible/tools
./create-certificate.sh a06-p1-nvsw-01
Example output:
certificate.cert-manager.io/a06-p1-nvsw-01-certificate created
Certificate is ready after 10 seconds.
Extracting secret data to local files...
Files created:
-rw-r--r-- 1 root root 1094 Jul 10 14:33 a06-p1-nvsw-01-ca.crt
-rw-r--r-- 1 root root 1432 Jul 10 14:33 a06-p1-nvsw-01-tls.crt
-rw-r--r-- 1 root root 3247 Jul 10 14:33 a06-p1-nvsw-01-tls.key
-rw------- 1 root root 3907 Jul 10 14:33 a06-p1-nvsw-01-tls.p12
To copy the generated certificates to the NV Link switch leader, use the following code:
scp a06-p1-nvsw-01-ca.crt a06-p1-nvsw-01-tls.p12 admin@a06-p1-nvsw-01:/home/admin
Then, SSH onto the NV Link switch leader and enable NMX-Controller(NMC-C) and NMX-Telemetry apps(NMX-T):
ssh admin@a06-p1-nvsw-01 "nv action update cluster apps nmx-controller manager enabled; nv action update cluster apps nmx-telemetry manager enabled"
Example output:
NVOS switch
admin@a06-p1-nvsw-01s password:
Action executing ...
Cluster Manager Port updated successfully
Action succeeded
Action executing ...
Cluster Manager Port updated successfully
Action succeeded
To install previously generated certificates by importing through the NV Link switch leader, use the following code:
ssh admin@a06-p1-nvsw-01 "nv action import system security certificate nmxm-cert uri-bundle file:///home/admin/a06-p1-nvsw-01-tls.p12; nv action import system security ca-certificate manager-ca-cert uri file:///home/admin/a06-p1-nvsw-01-ca.crt"
Example output:
NVOS switch
admin@a06-p1-nvsw-01s password:
Action executing ...
Succeeded in importing X.509 entity certificate `nmxm-cert`.
NOTE: Certificate `nmxm-cert` is self-signed.
Action succeeded
Action executing ...
Succeeded in importing X.509 CA certificate `manager-ca-cert`.
Action succeeded
To enable services and enable mTLS, use the following code:
ssh admin@a06-p1-nvsw-01 "nv action update cluster apps nmx-controller manager ca-certificate manager-ca-cert; nv action update cluster apps nmx-telemetry manager ca-certificate manager-ca-cert; nv action update cluster apps nmx-controller manager certificate nmxm-cert; nv action update cluster apps nmx-telemetry manager certificate nmxm-cert; nv action update cluster apps nmx-controller manager encryption mtls; nv action update cluster apps nmx-telemetry manager encryption mtls"
Example output:
NVOS switch
admin@a06-p1-nvsw-01s password:
Action executing ...
Cluster Manager CA Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager CA Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Cert updated successfully
Action succeeded
Action executing ...
Cluster Manager Encryption updated successfully
Action succeeded
Action executing ...
Cluster Manager Encryption updated successfully
Action succeeded
To restart NMX-C and NMX-T services, use the following code:
ssh admin@a06-p1-nvsw-01 "nv action stop cluster apps nmx-telemetry; nv action start cluster apps nmx-telemetry; nv action stop cluster apps nmx-controller; nv action start cluster apps nmx-controller"
Example output:
NVOS switch
admin@a06-p1-nvsw-01s password:
Action executing ...
Running app stop command: nmx-telemetry
Action executing ...
App has been successfully stopped
Action succeeded
Action executing ...
Running app start command: nmx-telemetry
Action executing ...
App has been successfully started
Action succeeded
Action executing ...
Running app stop command: nmx-controller
Action executing ...
App has been successfully stopped
Action succeeded
Action executing ...
Running app start command: nmx-controller
Action executing ...
App has been successfully started
Action succeeded
NMX-M’s primary method of interaction is through its REST API. We’ll use curl for the next steps, adding the NV Link switch services.
With the Kubernetes cluster installed by BCM, default behavior is for traffic to be directed to the head nodes. This uses Nginx running on the head nodes to proxy requests to Kubernetes nodes running ingress-nginx. This is the reason for the examples using curl to https://master/nmx.
In the payload we will use POST, we’ll need to include the IP of the leader switch we’re configuring. This can be found using cmsh, for example the leader switch for rack A06:
cmsh -c "device; use a06-p1-nvsw-01; get ip"
7.241.3.31
To add the NV Link switch leader’s NMX-C to NMX-M by making a POST request to the /v1/services endpoint, use the following code:
curl -sk -X POST 'https://master/nmx/v1/services' -u rw-user:rw-password --header 'Content-Type: application/json' \
--data '{
"Name": "a06-p1-nvsw-01",
"Description": "a06-p1-nvsw-01",
"ServiceType": "CONTROLLER",
"ServiceConnectionInformation": {
"Address": "7.241.3.31",
"PortNumber": 9370
}
}'
Example response:
{
"Address": "7.241.3.31",
"Description": "a06-p1-nvsw-01",
"ID": "68703777cf6f5852a7316906",
"Name": "a06-p1-nvsw-01",
"PortNumber": 9370,
"ServiceType": "CONTROLLER",
"Status": "IN_PROGRESS",
"StatusInfo": "",
"Version": ""
}
Verify that this was successful by making a GET request to the /v1/services endpoint, use the following code:
curl -sk -X GET 'https://master/nmx/v1/services/68703777cf6f5852a7316906' -u rw-user:rw-password --header 'Content-Type: application/json' | jq .
Example response:
{
"Address": "7.241.3.31",
"ApplicationUUID": "c9bd7a13-ccb2-4a90-95ff-9dcf5e9038bc",
"ClusterDomainUUID": "c2b42a4c-e407-4f98-af6a-8c96823a807e",
"Description": "a06-p1-nvsw-01",
"ID": "68703777cf6f5852a7316906",
"Name": "a06-p1-nvsw-01",
"PortNumber": 9370,
"RegisteredAt": "2025-07-10T21:58:15.908Z",
"ServiceType": "CONTROLLER",
"Status": "UP",
"StatusInfo": "",
"UpSince": "2025-07-10T21:58:15.908Z",
"Version": "1.2.0_2025-06-07_10-33"
}
To add the NV Link switch leader’s NMX-T to NMX-M by making a POST request to the /v1/services endpoint, use the following code:
curl -sk -X POST 'https://master/nmx/v1/services' -u rw-user:rw-password --header 'Content-Type: application/json' \
--data '{
"Name": "a06-p1-nvsw-01",
"Description": "a06-p1-nvsw-01",
"ServiceType": "TELEMETRY",
"ServiceConnectionInformation": {
"Address": "7.241.3.31",
"PortNumber": 9351
}
}'
Example response:
{
"Address": "7.241.3.31",
"Description": "a06-p1-nvsw-01",
"ID": "6870386b8c7b451eeafddfda",
"Name": "a06-p1-nvsw-01",
"PortNumber": 9351,
"ServiceType": "TELEMETRY",
"Status": "IN_PROGRESS",
"StatusInfo": "",
"Version": ""
}
Verify that this was successful by making a GET request to the /v1/services endpoint, using the ID from the prior response:
curl -sk -X GET 'https://master/nmx/v1/services/6870386b8c7b451eeafddfda' -u rw-user:rw-password --header 'Content-Type: application/json' | jq .
Example response:
{
"Address": "7.241.3.31",
"ApplicationUUID": "6c164aa5-2aa7-4789-9587-31b79dc62897",
"ClusterDomainUUID": "c2b42a4c-e407-4f98-af6a-8c96823a807e",
"Description": "a06-p1-nvsw-01",
"ID": "6870386b8c7b451eeafddfda",
"Name": "a06-p1-nvsw-01",
"PortNumber": 9351,
"RegisteredAt": "2025-07-10T22:02:19.865Z",
"ServiceType": "TELEMETRY",
"Status": "UP",
"StatusInfo": "",
"UpSince": "2025-07-10T22:02:19.865Z",
"Version": "1.1.3"
}
NMX-M provides a Prometheus exporter interface for collected metrics. We can use this to validate that NMX-M is working with the newly added NV Link switch leader, use the following code:
curl -sk "https://master/nmx/v1/metrics?id=$(curl -sk -X GET 'https://master /nmx/v1/services/6870386b8c7b451eeafddfda' -u rw-user:rw-password --header 'Content-Type: application/json' | jq -r '.ClusterDomainUUID')" \
-u rw-user:rw-password \
| head -n 20
Example output:
diag_supply_voltage{domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x330aa4e54b8d4c2d",Port="11"} 0 1752185097534
diag_supply_voltage{Port="11",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x2c5eab0300ca6700"} 0 1752185097534
diag_supply_voltage{domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x9f6028016bbe9123",Port="11"} 0 1752185097534
diag_supply_voltage{node_guid="0x2c5eab0300ca6720",Port="29",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e"} 0 1752185097534
diag_supply_voltage{Port="5",domain_id="c2b42a4c-e407-4f98-af6a-8c96823a807e",node_guid="0x2bad4538ad47b824"} 0 1752185097534
Apply a Permanent NMX-M License#
To apply a permanent NMX-M license, use the following steps:
Copy the license file to the /opt/nvidia/nmx/licenses directory.
Run the License Configuration Script by executing the following code:
`/opt/nvidia/nmx/scripts/license-config.sh`
Apply the new license by selecting Option 1.
Then confirm the license details when prompted.