Install and Configure NMX Manager (NMX-M)#

NMX-M provides a single interface for management and telemetry collection of NV Link switches. NMX-M is deployed on Kubernetes, along with the other components that make up Mission Control.

NMX-M Kubernetes Setup#

Prerequisites#

NMX-M Permanent License Generation and Application Guide

Generating License File

Prerequisites

  • Prepare a list of servers with the MAC address of each server on which you plan to install the NMX-M software

  • Access to NVIDIA’s NVIDIA Licensing Portal (NLP) with valid credentials

Steps to Generate License File

  1. Access the NVIDIA Licensing Portal - Go to NVIDIA’s NVIDIA Licensing Portal (NLP) - Log in using your credentials

  2. Navigate to Network Entitlements - Click on the Network Entitlements tab - You’ll see a list with the serial licenses of all your software products and software product license information and status

  3. Select and Activate License - Select the license you want to activate - Click on the “Actions” button

  4. Configure MAC Addresses - In the MAC Address field, enter the MAC address of the delegated license-registered host - If applicable, in the HA MAC Address field, enter your High Availability (HA) server MAC address - Note: If you have more than one NIC installed on a UFM Server, use any of the MAC addresses

  5. Generate and Download License - Click on Generate License File to create the license key file for the software - Click on Download License File and save it on your local computer

Important Notes about License Regeneration - If you replace your NIC or server, repeat the process of generating the license to set new MAC addresses - You can only regenerate a license two times - To regenerate the license after that, contact NVIDIA Sales Administration at enterprisesupport@nvidia.com

NMX-M deployment in a shared Kubernetes environment requires shared storage for persistence of PV and PVCs. This is accomplished using Longhorn which distributed provides block storage for this. To enable this, iSCSI client must be installed on the nodes used for Kubernetes.

Installation on software image:

cm-chroot-sw-img /cm/images/k8s-admin-image

root@k8s-admin-image:/ apt-get update; apt-get install -y open-iscsi
root@k8s-admin-image:/ systemctl enable iscsid open-iscsi
root@k8s-admin-image:/ echo "fs.inotify.max_user_instances = 1024" >> /etc/sysctl.d/60-local.conf
exit

Push changes to Kubernetes nodes:

cmsh -c "device; foreach -c k8s-admin (imageupdate -w)"

Reload and configure iSCSI service:

pdsh -g category=k8s-admin "systemctl daemon-reload; /sbin/iscsi-iname -p \"InitiatorName=iqn.2005-03.org.open-iscsi\" > /etc/iscsi/initiatorname.iscsi; chmod 0600 /etc/iscsi/initiatorname.iscsi"

Create a configuration overlay to persist:

cmsh -c "configurationoverlay; add open-iscsi; set categories k8s-admin; roles; assign generic::open-iscsi; set services open-iscsi; excludelistsnippets; add initiatorname.iscsi; append excludelist /etc/iscsi/initiatorname.iscsi; set modefull yes; set modegrab yes; set modegrabnew yes; commit"

Validate configuration overlay:

cmsh -c "configurationoverlay; use open-iscsi; use open-iscsi; roles; use generic::open-iscsi; excludelistsnippets; list"

Name (key)           Lines            Disabled     Mode sync      Mode full      Mode update    Mode grab      Mode grab new
-------------------- ---------------- ------------ -------------- -------------- -------------- -------------- --------------
initiatorname.iscsi  1                no           yes            yes            yes            yes            yes

Reboot nodes:

cmsh -c "device; reboot -c k8s-admin"

Validate that initiatorname file has persisted:

pdsh -g category=k8s-admin "cat /etc/iscsi/initiatorname.iscsi" | dshbak -c

Example output:

node001: InitiatorName=iqn.2005-03.org.open-iscsi:229655aa846
node002: InitiatorName=iqn.2005-03.org.open-iscsi:6ffb48fb233d
node003: InitiatorName=iqn.2005-03.org.open-iscsi:581ac1b2a151

Download the NMX-M install package#

Downloading NMX-M#

  1. Go to NVIDIA’s NVIDIA Licensing Portal (NLP) and log in using your credentials.

  2. Click on Software Downloads, filter the product family to NMX-M, and select the relevant version of the software. Click on Download.

  3. Save the file on your local drive.

  4. Click Close.

Copy the .tar.gz file to the BCM head node:

rsync -azP NMX-MGR-85.1.2000.tar.gz root@bcm11-head-01:/root

Uncompress the package:

tar xvzf NMX-MGR-85.1.2000.tar.gz
cd NMX-M
find . -type f -name "*gz" -exec tar -xvzf {} \;

Install Zarf components#

NMX-M is packaged using Zarf which allows for air-gapped installation on Kubernetes. This works in part by using a local Docker registry and intercepting subsequent public container pull requests back to itself. This behavior isn’t desired for non-NMX-M components in the shared Kubernetes cluster.

Exclude existing namespaces from Zarf#

for i in $(kubectl get ns -o custom-columns=NAME:.metadata.name --no-headers | grep -Ev "infra|longhorn|kafka|nmx|zarf"); do kubectl label ns $i zarf.dev/agent=ignore; done

Setup Zarf#

mv ./Installation/prerequisites/zarf /usr/local/bin/
chmod +x /usr/local/bin/zarf
cd Installation/Zarf_init
zarf init -a amd64 --storage-class local-path --confirm

Deploy local registry daemonset#

kubectl apply -f registry-image-spread-daemonset.yaml
zarf tools wait-for ds registry-image-spread '{.status.numberReady}'=3 \
-n zarf \
--timeout=300s

Pin registry containers on Kubernetes nodes#

pdsh -g category=k8s-admin 'PATH="/cm/local/apps/cmd/bin/:$PATH"; source /etc/profile; module load containerd; ctr --namespace k8s.io images label 127.0.0.1:31999/library/registry:2.8.3 io.cri-containerd.pinned=pinned' | dshbak -c

Remove local registry daemonset#

kubectl delete -f registry-image-spread-daemonset.yaml

Install Longhorn#

Longhorn is a distributed, shared block storage solution for Kubernetes to provide persistent storage. The variable LONGHORN_DATA_PATH is defines which local storate we’ll configure. On control nodes, this should be set to /local as this is where we have a multi-terabyte SW RAID configured.

Deploy Longhorn via Zarf#

cd ../..
zarf package deploy Infra/zarf-package-infra-amd64-*.tar.zst \
--components="longhorn" \
--confirm \
--set LONGHORN_DATA_PATH=/local

Wait for this to complete:

zarf tools wait-for ds longhorn-csi-plugin '{.status.numberReady}'=3 \
-n longhorn-system \
--timeout=300s

Verify instance managers are all up:

while true; do
count=$(kubectl -n longhorn-system get pods --field-selector=status.phase=Running -l longhorn.io/component=instance-manager --no-headers | wc -l)
if [ "$count" -eq 3 ]; then
echo "Found 3 running instance-manager pods"
break
else
echo "Currently $count running pods, waiting for 3..."
sleep 5
fi
done

Remove Zarf from local storage and redeploy on Longhorn shared storage#

cd Installation/Zarf_init
zarf destroy --confirm --no-progress --no-color
zarf init --no-progress --no-color -a amd64 --storage-class longhorn --set REGISTRY_PVC_ENABLED=true --confirm
cd ../..
zarf package deploy Infra/zarf-package-infra-amd64-*.tar.zst \
--components="*" \
--set LONGHORN_DATA_PATH=/local \
--set ALERTMANAGER_WEBHOOK_URL=http://kube-prometheus-stack-alertmanager.prometheus:9093 \
--set RW_PASSWORD=rw-password \
--set RO_PASSWORD=ro-password \
--confirm

With Zarf and Longhorn installed, we can deploy the NMX-M Zarf packages.

Run the zarf package deploy to install NMX-M:

zarf package deploy ./Services/zarf-package-services-amd64-*.tar.zst --components="*" --confirm

Configure Longhorn to not be a default storageclass#

After installation of NMX-M, the default behavior is to have Longhorn as a default storageclass. This is not desirable as Longhorn is only to be used by the NMX-M components.

kubectl get storageclass
NAME                      PROVISIONER                                      RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)      cluster.local/local-path-provisioner             Delete          WaitForFirstConsumer   true                   25h
longhorn (default)        driver.longhorn.io                               Delete          Immediate              true                   24h
longhorn-no-replication   driver.longhorn.io                               Delete          Immediate              true                   24h
longhorn-static           driver.longhorn.io                               Delete          Immediate              true                   24h
shoreline-local-path-sc   cluster.local/shoreline-local-path-provisioner   Delete          WaitForFirstConsumer   true                   18h

We can run a patch to correct this behavior.

kubectl patch storageclass longhorn -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'


kubectl get storageclass
NAME                      PROVISIONER                                      RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)      cluster.local/local-path-provisioner             Delete          WaitForFirstConsumer   true                   26h
longhorn                  driver.longhorn.io                               Delete          Immediate              true                   25h
longhorn-no-replication   driver.longhorn.io                               Delete          Immediate              true                   25h
longhorn-static           driver.longhorn.io                               Delete          Immediate              true                   25h
shoreline-local-path-sc   cluster.local/shoreline-local-path-provisioner   Delete          WaitForFirstConsumer   true                   19h

NMX-M configuration#