Install and Configure NMX Manager (NMX-M)#

NMX-M provides a single interface for management and telemetry collection of NV Link switches. NMX-M is deployed on Kubernetes, along with the other components that make up Mission Control.

NMX-M Kubernetes Setup#

NMX-M Permanent License Generation and Application Guide#

Generating a License File#

Before you generate the license file, you need to do the following:

  • Prepare a list of servers with the MAC address of each server on which you plan to install the NMX-M software.

  • Access to NVIDIA’s NVIDIA Licensing Portal (NLP) with valid credentials.

To generate the license file, follow the steps below:

  1. Access the NVIDIA Licensing Portal

    • Go to NVIDIA’s NVIDIA Licensing Portal (NLP).

    • Log in using your credentials.

  2. Navigate to Network Entitlements

    • Click on the Network Entitlements tab.

    • You’ll see a list with the serial licenses of all your software products and software product license information and status.

  3. Select and Activate License

    • Select the license you want to activate.

    • Click on the “Actions” button.

  4. Configure MAC Addresses

    • In the MAC Address field, enter the MAC address of the delegated license-registered host.

    • If applicable, in the HA MAC Address field, enter your High Availability (HA) server MAC address.

    • Note: If you have more than one NIC installed on a UFM Server, use any of the MAC addresses.

  5. Generate and Download License

    • Click on Generate License File to create the license key file for the software

    • Click on Download License File and save it on your local computer

Important Notes about License Regeneration#

When you regenerate a license, you need to keep the following in mind:

  • If you replace your NIC or server, repeat the process of generating the license to set new MAC addresses.

  • You can only regenerate a license two times.

  • To regenerate the license after that, contact NVIDIA Sales Administration at enterprisesupport@nvidia.com.

NMX-M deployment in a shared Kubernetes environment requires shared storage for persistence of PV and PVCs. This is accomplished using Longhorn which distributed provides block storage for this. To enable this, iSCSI client must be installed on the nodes used for Kubernetes. To install the iSCSI client, follow the steps below:

  1. Enter the software image chroot:

    cm-chroot-sw-img /cm/images/k8s-admin-image
    
    root@k8s-admin-image:/ apt-get update; apt-get install -y open-iscsi
    root@k8s-admin-image:/ systemctl enable iscsid open-iscsi
    root@k8s-admin-image:/ echo "fs.inotify.max_user_instances = 1024" >> /etc/sysctl.d/60-local.conf
    exit
    
  2. Push changes to Kubernetes nodes:

    cmsh -c "device; foreach -c k8s-admin (imageupdate -w)"
    
  3. Reload and configure iSCSI service:

    pdsh -g category=k8s-admin "systemctl daemon-reload; /sbin/iscsi-iname -p \"InitiatorName=iqn.2005-03.org.open-iscsi\" > /etc/iscsi/initiatorname.iscsi; chmod 0600 /etc/iscsi/initiatorname.iscsi"
    
  4. Create a configuration overlay to persist:

    cmsh -c "configurationoverlay; add open-iscsi; set categories k8s-admin; roles; assign generic::open-iscsi; set services open-iscsi; excludelistsnippets; add initiatorname.iscsi; append excludelist /etc/iscsi/initiatorname.iscsi; set modefull yes; set modegrab yes; set modegrabnew yes; commit"
    
  5. Validate configuration overlay:

    cmsh -c "configurationoverlay; use open-iscsi; use open-iscsi; roles; use generic::open-iscsi; excludelistsnippets; list"
    
    Name (key)           Lines            Disabled     Mode sync      Mode full      Mode update    Mode grab      Mode grab new
    -------------------- ---------------- ------------ -------------- -------------- -------------- -------------- --------------
    initiatorname.iscsi  1                no           yes            yes            yes            yes            yes
    
  6. Reboot nodes:

    cmsh -c "device; reboot -c k8s-admin"
    
  7. Validate that initiatorname file has persisted:

    pdsh -g category=k8s-admin "cat /etc/iscsi/initiatorname.iscsi" | dshbak -c
    

    Example output:

    node001: InitiatorName=iqn.2005-03.org.open-iscsi:229655aa846
    node002: InitiatorName=iqn.2005-03.org.open-iscsi:6ffb48fb233d
    node003: InitiatorName=iqn.2005-03.org.open-iscsi:581ac1b2a151
    

Download the NMX-M install package#

The NMX-M install package can be downloaded from NVIDIA’s NVIDIA Licensing Portal (NLP).

Downloading NMX-M#

To download the package, follow the steps below:

  1. Go to NVIDIA’s NVIDIA Licensing Portal (NLP) and log in using your credentials.

  2. Click on Software Downloads, filter the product family to NMX-M, and select the relevant version of the software. Click on Download.

  3. Save the file on your local drive.

  4. Click Close.

  5. Copy the .tar.gz file to the BCM head node:

    rsync -azP NMX-MGR-85.1.2000.tar.gz root@bcm11-head-01:/root
    
  6. Uncompress the package:

    tar xvzf NMX-MGR-85.1.2000.tar.gz
    cd NMX-M
    
  7. Uncompress the package:

    find . -type f -name "*gz" -exec tar -xvzf {} \;
    

Install Zarf components#

NMX-M is packaged using Zarf which allows for air-gapped installation on Kubernetes. This works in part by using a local Docker registry and intercepting subsequent public container pull requests back to itself. This behavior isn’t desired for non-NMX-M components in the shared Kubernetes cluster.

Exclude existing namespaces from Zarf#

To exclude existing namespaces from Zarf, use the following code:

for i in $(kubectl get ns -o custom-columns=NAME:.metadata.name --no-headers | grep -Ev "infra|longhorn|kafka|nmx|zarf"); do kubectl label ns $i zarf.dev/agent=ignore; done

Setup Zarf#

To set up Zarf, use the following code:

mv ./Installation/prerequisites/zarf /usr/local/bin/
chmod +x /usr/local/bin/zarf
cd Installation/Zarf_init
zarf init -a amd64 --storage-class local-path --confirm

Deploy local registry daemonset#

To deploy the local registry daemonset, use the following code:

kubectl apply -f registry-image-spread-daemonset.yaml
zarf tools wait-for ds registry-image-spread '{.status.numberReady}'=3 \
-n zarf \
--timeout=300s

Pin registry containers on Kubernetes nodes#

To pin registry containers on Kubernetes nodes, use the following code:

pdsh -g category=k8s-admin 'PATH="/cm/local/apps/cmd/bin/:$PATH"; source /etc/profile; module load containerd; ctr --namespace k8s.io images label 127.0.0.1:31999/library/registry:2.8.3 io.cri-containerd.pinned=pinned' | dshbak -c

Remove local registry daemonset#

To remove the local registry daemonset, use the following code:

kubectl delete -f registry-image-spread-daemonset.yaml

Install Longhorn#

Longhorn is a distributed, shared block storage solution for Kubernetes to provide persistent storage. The variable LONGHORN_DATA_PATH defines which local storage we’ll configure. On control nodes, this should be set to /local as this is where we have a multi-terabyte SW RAID configured.

Deploy Longhorn using Zarf#

To deploy Longhorn using Zarf, use the following code:

cd ../..
zarf package deploy Infra/zarf-package-infra-amd64-*.tar.zst \
--components="longhorn" \
--confirm \
--set LONGHORN_DATA_PATH=/local

Wait for this to complete:

zarf tools wait-for ds longhorn-csi-plugin '{.status.numberReady}'=3 \
-n longhorn-system \
--timeout=300s

Verify instance managers are all up:

while true; do
count=$(kubectl -n longhorn-system get pods --field-selector=status.phase=Running -l longhorn.io/component=instance-manager --no-headers | wc -l)
if [ "$count" -eq 3 ]; then
echo "Found 3 running instance-manager pods"
break
else
echo "Currently $count running pods, waiting for 3..."
sleep 5
fi
done

Remove Zarf from local storage and redeploy on Longhorn shared storage#

To remove Zarf from local storage and redeploy on Longhorn shared storage, use the following code:

cd Installation/Zarf_init
zarf destroy --confirm --no-progress --no-color
zarf init --no-progress --no-color -a amd64 --storage-class longhorn --set REGISTRY_PVC_ENABLED=true --confirm
cd ../..
zarf package deploy Infra/zarf-package-infra-amd64-*.tar.zst \
--components="*" \
--set LONGHORN_DATA_PATH=/local \
--set ALERTMANAGER_WEBHOOK_URL=http://kube-prometheus-stack-alertmanager.prometheus:9093 \
--set RW_PASSWORD=rw-password \
--set RO_PASSWORD=ro-password \
--confirm

After you have installed Zarf and Longhorn, you can deploy the NMX-M Zarf packages.

Run the zarf package deploy to install NMX-M:

zarf package deploy ./Services/zarf-package-services-amd64-*.tar.zst --components="*" --confirm

Configure Longhorn to not be a default storageclass#

After installation of NMX-M, the default behavior is to set Longhorn as a default storageclass. We only want to use Longhorn for the NMX-M components and not for storage.

To configure Longhorn to not be a default storageclass, use the following steps:

  1. Get the current storageclasses:

kubectl get storageclass
NAME                      PROVISIONER                                      RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)      cluster.local/local-path-provisioner             Delete          WaitForFirstConsumer   true                   25h
longhorn (default)        driver.longhorn.io                               Delete          Immediate              true                   24h
longhorn-no-replication   driver.longhorn.io                               Delete          Immediate              true                   24h
longhorn-static           driver.longhorn.io                               Delete          Immediate              true                   24h
shoreline-local-path-sc   cluster.local/shoreline-local-path-provisioner   Delete          WaitForFirstConsumer   true                   18h
  1. Run the following code to patch the Longhorn storageclass:

kubectl patch storageclass longhorn -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"false"}}}'


kubectl get storageclass
NAME                      PROVISIONER                                      RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)      cluster.local/local-path-provisioner             Delete          WaitForFirstConsumer   true                   26h
longhorn                  driver.longhorn.io                               Delete          Immediate              true                   25h
longhorn-no-replication   driver.longhorn.io                               Delete          Immediate              true                   25h
longhorn-static           driver.longhorn.io                               Delete          Immediate              true                   25h
shoreline-local-path-sc   cluster.local/shoreline-local-path-provisioner   Delete          WaitForFirstConsumer   true                   19h

NMX-M configuration#

This section describes how to configure NMX-M.

Apply a Permanent NMX-M License#

To apply a permanent NMX-M license, use the following steps:

  1. Copy the license file to the /opt/nvidia/nmx/licenses directory.

  2. Run the License Configuration Script by executing the following code:

`/opt/nvidia/nmx/scripts/license-config.sh`
  1. Apply the new license by selecting Option 1.

  2. Then confirm the license details when prompted.