NVSM Multinode

NVSM Multinode is developed to monitor multiple compute nodes within a cluster. It serves as a single management interface with active monitoring and alerting for a cluster of nodes. It also supports features including cluster wide health with drill down capabilities, and dump collection.

NVSM Multinode Overview

Aggregator Node - Acts as the central coordinator, running an MQTT server to communicate with compute nodes. It is deployed as a container on an external management server. NVSM instances run for each compute node.

Compute Nodes - NVSM running on DGX systems connect to the aggregator node’s MQTT server.

Prerequisites

System Requirements

  • Ubuntu-based management server, can be an external or DGX system.

  • Network connectivity between the aggregator node and compute nodes.

Software Requirements

  • Ensure Docker (minimum version 20.10.21) and Docker Compose (compatible version) are installed on the system.

  • NVSM multinode supported version (24.09.05) onwards should be installed on compute nodes.

  • Ensure the aggregator node and all compute nodes time are synced to NTP. Since compute nodes and aggregator nodes connect over MQTT, the timestamp on MQTT message needs to be in sync.

Note: If compute node time and aggregator time is not synchronized then compute node MQTT messages will be assumed stale and NVSM CLI commands will error out: ServiceUnavailable. Aggregator and all compute nodes must set up time synchronization using services like ntpdate or chrony.

Setup Details

A cluster can have an aggregator node( Container) running on any x86 server(Ubuntu base) and multiple compute nodes( DGX Systems). Aggregator node system must have access to both host network and management( BMC) network. Compute nodes will connect to the aggregator node over MQTT where the MQTT server will be hosted by the aggregator node. NVSM running on compute nodes will connect to NVSM running on the aggregator node. On the aggregator node there will be an NVSM exporters dashboard which will serve as a single interface to view all connected nodes, sensors data, health etc..

Packages

  • Aggregator - Aggregator image contains a docker container pre-packaged with NVSM, MQTT server and nvsm-exporter stack.

  • Node Provisioner - Node provisioner image contains ansible playbook which provisions the aggregator node and compute nodes in the cluster.

  • Prometheus/Grafana (Optional) - If clients already have their own prometheus/grafana running then only deploy the aggregator container.

Setup Instructions

Ensure the latest NVSM (Multinode supported version[24.09.05] or above) is installed on all compute nodes. The aggregator container is pre-packaged with the latest NVSM.

Installing Docker

To install docker and docker-compose on Aggregator mode, run the following commands:

$ apt-get install docker
$ apt-get install docker-compose

Provision Aggregator Node

Download Docker Images

Download all required containers from <TBD>.

Load Docker Images

Run the following command to load the docker images:

$ docker load -i nvsm-prometheus_24.09.05.tar.gz
$ docker load -i nvsm-grafana_24.09.05.tar.gz
$ docker load -i nvsm-provision_24.09.05.tar.gz
$ docker load -i nvsm-aggregator_24.09.05.tar.gz

Run the following command to ensure the container images are present:

$ docker images
$ REPOSITORY                                                                        TAG      IMAGE ID        CREATED         SIZE
$ nvcr.io/nvstaging/cloud-native/nvsm-grafana         24.09.05       12e4ebaee709    3 hours ago     541MB
$ nvcr.io/nvstaging/cloud-native/nvsm-prometheus 24.09.05    68518ef28efc    3 hours ago     376MB
$ nvcr.io/nvstaging/cloud-native/nvsm-provision      24.09.05        3869ef50cbbd    3 hours ago     518MB
$ nvcr.io/nvstaging/cloud-native/nvsm-aggregator   24.09.05  eb6068a414be    3 hours ago     1.55GB

Create Inventory YAML File

Inventory file stores information of all compute nodes like BMC IP, Host IP and encrypted passwords. Using the inventory file, the node provisioner container invokes ansible playbook to copy certificates from the aggregator node to all compute nodes and restart compute nodes NVSM to connect to aggregator node NVSM.

Since the inventory.yaml file contains username and passwords, it must be secured to the admin user only.

Use the below sample file to create an inventory.yaml file:

aggregator:
hosts:
    # Add aggregator host here
    # Example:
    #
    # aggregator.example.com:
vars:
    # Aggregator
    nvsm_exporter_port: 9123
    nvsm_aggregator_image: "nvcr.io/nvstaging/cloud-native/nvsm-aggregator:@PACKAGE_VERSION@"
    nvsm_aggregator_container: "nvsm-aggregator"   # Name of nvsm container
    nvsm_aggregator_network: "nvsm"                # Name of nvsm container network
    # Dashboard stack - Prometheus & Grafana
    nvsm_enable_dashboard: true
    nvsm_prometheus_image: "nvcr.io/nvstaging/cloud-native/nvsm-prometheus:@PACKAGE_VERSION@"
    nvsm_grafana_image: "nvcr.io/nvstaging/cloud-native/nvsm-grafana:@PACKAGE_VERSION@"
    nvsm_grafana_port: 3000
    nvsm_grafana_admin_user: "nvsm"
    nvsm_grafana_admin_password: "nvsm"
compute:
hosts:
    # Add compute nodes here, echo node should have a nvsm_id and bmc_ip
    # Examples:
    #
    # dgx01.example.com:
    #   nvsm_id: 1
    #   bmc_ip: "192.168.10.1"
    # dgx02.example.com:
    #   nvsm_id: 2
    #   bmc_ip: "192.168.10.2"
    #   # overwrite the group vars if required.
    #   # for example, the host have different user/password
    #   ansible_user: "sshuser02"
    #   ansible_ssh_pass: "sshpwd02"
    #   ansible_sudo_pass: "sshpwd02"
    #   bmc_pass: "bmcpwd02"
    # dgx03.example.com:
    #   nvsm_id: 3
    #   bmc_ip: "192.168.10.3"
    #   ansible_user: "sshuser03"
    #   # Use literal block scalar, if the password contains special characters like double quote(")
    #   # The password here is "specialSSHPass" (including the double quotes)
    #   ansible_ssh_pass: |-
    #     "specialSSHPass"
    #   ansible_sudo_pass: |-
    #     "specialSSHPass"
    #   bmc_pass: |-
    #     "specialBMCPass"
vars:
    # BMC user/password applies to all compute hosts,
    # they can be override by host variables if some hosts have different passwords
    bmc_user: "admin"
    bmc_pass: "admin"
all:
children:
    compute:
    aggregator:
vars:
    # Aggregrator variables
    nvsm_mqtt_host: "aggregator.example.com"
    nvsm_mqtt_port: 8883
    # SSH user/password applies to all aggregator and compute hosts,
    # they can be override by host variables if some hosts have different passwords
    # The password here is sshpwd (without the double quotes)
    ansible_user: "sshuser"
    ansible_ssh_pass: "sshpwd"
    ansible_sudo_pass: "sshpwd"
    # Force reprovision
    nvsm_force_reprovision: false

Perform the following steps on the Aggregator node:

$ mkdir -p /etc/nvsm/aggregator
$ vim /etc/nvsm/aggregator/inventory.yaml
$ sudo chown root:root /etc/nvsm/aggregator/inventory.yaml
$ sudo chmod 0600 /etc/nvsm/aggregator/inventory.yaml

Update the inventory file /etc/nvsm/aggregator/inventory.yaml with the following information:

  • Add host to hosts section under the aggregator , it can be the hostname or host IP address of the aggregator node.

  • Change nvsm_mqtt_host variable under vars of aggregator to hostname or host IP address of the aggregator node.

  • Add hosts to the hosts section under compute, it can be hostname or host IP address of compute nodes.

  • Each compute node host must have an unique nvsm_id, it specifies compute node id when they connect to the aggregator node. nvsm_id can start from 1 to 1023.

  • Each compute node host must have an unique bmc_ip, it specifies the BMC ip associated with that node.

  • Specify SSH user/password for each compute node by updating ansible_user, ansible_ssh_pass, ansible_sudo_pass under the hosts section. Section vars variables under all is applicable to all hosts where variables defined under host section overrides vars variables.

Inventory File Details

The inventory.yaml are an ansible inventory file, see the details here.

The inventory file have a group called “all”, which contain 2 sub-groups:

  • aggregator group - contains the host list of aggregator hosts. Currently only one host is supported.

  • compute group - contains all DGX nodes.

The vars can be defined as the ‘all’ group, it will apply to all hosts (including aggregator and compute). Vars defined in the ‘compute’ group apply to the compute nodes, but do not apply to the aggregator group, but it will override the value from the “all” group if the var is also defined there. Vars defined in a host apply to that host only, and it will override the value defined in a group.

Examples: If all hosts share the same ssh user / passwd, please define ansible_user, ansible_ssh_pass, ansible_sudo_pass under all.vars If a particular hosts have a different user / passwd, please define it under the host, something like

$ dgx03.example.com:
    nvsm_id: 3
    bmc_ip: "192.168.10.3"
    # overwrite the group vars if required
    ansible_user: "sshuser"
    ansible_ssh_pass: "sshpwd"
    ansible_sudo_pass: "sshpwd"

Run Docker Command to Provision Nodes

Start Aggregator Container and Provision Compute Nodes

Run the provision container with configuration as listed in the command, Use the image id for provision container found from docker images command. Provisioner will start the Aggregator, Grafana and Prometheus containers using configurations mentioned in the inventory file.

The inventory file stores information of all compute nodes BMC IP, Host IP and passwords. Using the inventory file, the provision container invokes an ansible playbook to copy certificates from Aggregator node to compute nodes and restart compute nodes NVSM to connect to aggregator node NVSM.

$ docker run -it --rm -v /etc/nvsm/aggregator:/nvsm/mnconfig nvcr.io/nvstaging/cloud-native/nvsm-provision:24.09.05

Sample Output:

Validating inventory file...
Validation completed
PLAY [NVSM Aggregator node precheck] ***********************************************************************************************************************

TASK [Ping aggregator node] **********************************************************************************************************************************
ok: [aggregator.example.com]

TASK [Get inventory file stat] ***********************************************************************************************************************************
ok: [aggregator.example.com]

TASK [Check inventory file permission] *************************************************************************************************************************
ok: [aggregator.example.com] => {
    "changed": false,
    "msg": "All assertions passed"
}

TASK [Check if docker command is available] ******************************************************************************************************************
ok: [aggregator.example.com]

TASK [Check if docker compose plugin is available] ************************************************************************************************************
ok: [aggregator.example.com]

TASK [Using 'docker compose'] ********************************************************************************************************************************
ok: [aggregator.example.com]

TASK [Check if docker-compose command is available] ********************************************************************************************************
skipping: [aggregator.example.com]

TASK [Using 'docker-compose'] *******************************************************************************************************************************
skipping: [aggregator.example.com]

TASK [Show docker compose command] ********************************************************************************************************************
ok: [aggregator.example.com] => {
    "msg": "Using 'docker compose'"
}
PLAY [NVSM Worker node precheck] ************************************************************************************************************************
TASK [Ping compute node] **********************************************************************************************************************************
ok: [192.168.10.1]
TASK [Print NVSM Node ID] *********************************************************************************************************************************
ok: [192.168.10.1] => {
    "msg": "Precheck node ID 1"
}
TASK [Get Packages Facts] *********************************************************************************************************************************
ok: [192.168.10.1]
TASK [Install NVSM] ****************************************************************************************************************************************
skipping: [192.168.10.1]
TASK [Get Packages Facts] *********************************************************************************************************************************
skipping: [192.168.10.1]
TASK [Get current NVSM version] ***************************************************************************************************************************
ok: [192.168.10.1]
TASK [Check if NVSM version meet requirement] ***********************************************************************************************************
skipping: [192.168.10.1]
PLAY [NVSM Aggregator node provision] *******************************************************************************************************************
TASK [Check if aggregator was provisioned] ****************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check if aggregator container exists] ******************************************************************************************************************
ok: [aggregator.example.com]
TASK [Remove existing aggregator container] *****************************************************************************************************************
changed: [aggregator.example.com]
TASK [Create Aggregator docker-compose.yml] ***************************************************************************************************************
ok: [aggregator.example.com]
TASK [Start aggregator container] *****************************************************************************************************************************
changed: [aggregator.example.com]
TASK [Wait for NVSM keyfiles to be created] *******************************************************************************************************************
ok: [aggregator.example.com] => (item=nvsm-ca.crt)
ok: [aggregator.example.com] => (item=nvsm-client.crt)
ok: [aggregator.example.com] => (item=nvsm-client.key)
PLAY [NVSM compute node provision] **************************************************************************************************************************
TASK [Check if node was provisioned] **************************************************************************************************************************
ok: [192.168.10.1]
TASK [Copy keyfiles] *********************************************************************************************************************************************
changed: [192.168.10.1] => (item=nvsm-ca.crt)
changed: [192.168.10.1] => (item=nvsm-client.crt)
changed: [192.168.10.1] => (item=nvsm-client.key)
TASK [Copy using inline content] *********************************************************************************************************************************
changed: [192.168.10.1]
TASK [Restart NVSM] *******************************************************************************************************************************************
changed: [192.168.10.1]
PLAY [NVSM Aggregator node - post config] *********************************************************************************************************************
TASK [Restart nvsm-exporter] ************************************************************************************************************************************
changed: [aggregator.example.com]
TASK [Reload nvsm-lifecycled] ************************************************************************************************************************************
changed: [aggregator.example.com]
PLAY [NVSM Aggregator node postcheck] ************************************************************************************************************************
TASK [Post check] ************************************************************************************************************************************************
ok: [aggregator.example.com] => {
    "msg": "NVSM Aggregator node postcheck"
}
PLAY [NVSM Worker node postcheck] *****************************************************************************************************************************
TASK [Post check] ************************************************************************************************************************************************
ok: [aggregator.example.com] => {
    "msg": "NVSM Aggregator node postcheck node ID 1"
}
PLAY RECAP *******************************************************************************************************************************************************
192.168.10.1                : ok=25   changed=7    unreachable=0    failed=0    skipped=5    rescued=0    ignored=0

Check all containers are running on the Aggregator node:

$ sudo docker ps
CONTAINER ID   IMAGE  COMMAND   CREATED   STATUS   NAMES    PORTS
e7f9a56e6ef4   nvcr.io/nvstaging/cloud-native/nvsm-grafana:24.09.05            "/bin/bash -c ./entr_"   About an hour ago   Up About an hour   nvsm-grafana

2fe6fb57ed2a   nvcr.io/nvstaging/cloud-native/nvsm-prometheus:24.09.05    "/bin/bash -c ./entr_"   About an hour ago   Up About an hour   nvsm-prometheus

fff3ed636853   nvcr.io/nvstaging/cloud-native/nvsm-aggregator:24.09.05    "/bin/bash -c /usr/b_"   About an hour ago   Up About an hour   nvsm-aggregator     0.0.0.0:3000->3000/tcp, :::3000->3000/tcp, 0.0.0.0:8883->8883/tcp, :::8883->8883/tcp, 0.0.0.0:9123->9123/tcp, :::9123->9123/tcp

Connect to Multinode NVSM

After the provision, within a few minutes (3~5 mins), compute nodes must have connected to aggregator node NVSM. Now it’s time to check the NVSM multinode via login to the aggregator container.

Enter Aggregator Container

Login to the Aggregator container using the below command:

$ docker exec -it nvsm-aggregator /bin/bash

Examining the Aggregator NVSM Services

Run the following command on the Aggregator container to check running NVSM services:

$ nvsm status
SERVICE             ENABLED   ACTIVE   SUB       DESCRIPTION
================================================================================
nvsm-exporter enabled  active  running  NVSM Exporter to provide DGX System Management Metrics
nvsm-lifecycled        enabled  active  running  NVSM aggregator lifecycle manager
nvsm-mqtt                enabled  active  running  MQTT broker for NVSM API for signaling within NVSM API     components

Examine the nvsm_core running in aggregator:

$ # nvsm_lifecycle status
Hostname/IP  Node                                   Command  PID    Stat  StartCount  CrashReason  CrashCode
10.33.1.23    23  /usr/sbin/nvsm_core -mode server SERVE 23   99  Running           2         None          0
10.33.1.24    24  /usr/sbin/nvsm_core -mode server SERVE 24  105  Running           2         None          0

Examine the status for a given node:

$ # nvsm show status --node 23
Node ID: 23
Service Status:
    SERVICE                         STATUS
    Aggregator node nvsm_core       Running
    Compute node nvsm_core          Inactive

Run NVSM CLI commands for Cluster Nodes

We can run all supported NVSM CLI/show commands targeting any cluster node including show/dump health.

Examples to run the NVSM show commands targeted to a compute node:

$ nvsm show power --node 1
$ nvsm show gpus --node 2
$ nvsm show storage --node 3
$ nvsm show network --node 4
$ nvsm show alerts --node 5
$ nvsm show health --node 6
$ nvsm show health --node 7

Example to run the NVSM health command on all compute nodes:

$ nvsm show health
$ nvsm show health --node all

Example to collect nvsm dump from a compute node and store on aggregator node:

Dump collection gets executed on the compute node and copied back to the aggregator node at the output dir path.

$ nvsm dump health --node 1

Example to run nvsm cli command on any compute node.

$ nvsm
$ show
$ cd 1
$ show
$ ...iterate over any resource

Accessing Aggregator Services

From the aggregator node we can access the exporter having all cluster nodes information.

Manage Nodes

Add New Nodes

On the aggregator node, add new compute nodes in /etc/nvsm/aggregator/inventory.yaml into the compute hosts section and rerun the provisioner container.

$ docker run -it --rm -v /etc/nvsm/aggregator/nvsm/mnconfig nvcr.io/nvstaging/cloud-native/nvsm-provision:24.09.05

Removing Nodes

On the aggregator node, remove compute nodes in /etc/nvsm/aggregator/inventory.yaml from the compute hosts section and rerun the provisioner container.

$ docker run -it --rm -v /etc/nvsm/aggregator/nvsm/mnconfig nvcr.io/nvstaging/cloud-native/nvsm-provision:24.09.05

Post removing the nodes from Aggregator, Restart NVSM in singlenode mode on compute nodes. Perform below steps on compute nodes which are removed.

$ rm -rf /etc/nvsm/mn
$ systemctl restart nvsm

Features Impacted

NVSM Call Home will be supported in future releases.