NVSM Multinode
NVSM Multinode is developed to monitor multiple compute nodes within a cluster. It serves as a single management interface with active monitoring and alerting for a cluster of nodes. It also supports features including cluster wide health with drill down capabilities, and dump collection.
NVSM Multinode Overview
Aggregator Node - Acts as the central coordinator, running an MQTT server to communicate with compute nodes. It is deployed as a container on an external management server. NVSM instances run for each compute node.
Compute Nodes - NVSM running on DGX systems connect to the aggregator node’s MQTT server.
Prerequisites
System Requirements
Ubuntu-based management server, can be an external or DGX system.
Network connectivity between the aggregator node and compute nodes.
Software Requirements
Ensure Docker (minimum version 20.10.21) and Docker Compose (compatible version) are installed on the system.
NVSM multinode supported version (24.09.05) onwards should be installed on compute nodes.
Ensure the aggregator node and all compute nodes time are synced to NTP. Since compute nodes and aggregator nodes connect over MQTT, the timestamp on MQTT message needs to be in sync.
Note: If compute node time and aggregator time is not synchronized then compute node MQTT messages will be assumed stale and NVSM CLI commands will error out: ServiceUnavailable
.
Aggregator and all compute nodes must set up time synchronization using
services like ntpdate or chrony.
Setup Details
A cluster can have an aggregator node( Container) running on any x86 server(Ubuntu base) and multiple compute nodes( DGX Systems). Aggregator node system must have access to both host network and management( BMC) network. Compute nodes will connect to the aggregator node over MQTT where the MQTT server will be hosted by the aggregator node. NVSM running on compute nodes will connect to NVSM running on the aggregator node. On the aggregator node there will be an NVSM exporters dashboard which will serve as a single interface to view all connected nodes, sensors data, health etc..
Packages
Aggregator - Aggregator image contains a docker container pre-packaged with NVSM, MQTT server and nvsm-exporter stack.
Node Provisioner - Node provisioner image contains ansible playbook which provisions the aggregator node and compute nodes in the cluster.
Prometheus/Grafana (Optional) - If clients already have their own prometheus/grafana running then only deploy the aggregator container.
Setup Instructions
Ensure the latest NVSM (Multinode supported version[24.09.05] or above) is installed on all compute nodes. The aggregator container is pre-packaged with the latest NVSM.
Installing Docker
To install docker and docker-compose on Aggregator mode, run the following commands:
$ apt-get install docker
$ apt-get install docker-compose
Provision Aggregator Node
Download Docker Images
Download all required containers from <TBD>.
Load Docker Images
Run the following command to load the docker images:
$ docker load -i nvsm-prometheus_24.09.05.tar.gz
$ docker load -i nvsm-grafana_24.09.05.tar.gz
$ docker load -i nvsm-provision_24.09.05.tar.gz
$ docker load -i nvsm-aggregator_24.09.05.tar.gz
Run the following command to ensure the container images are present:
$ docker images
$ REPOSITORY TAG IMAGE ID CREATED SIZE
$ nvcr.io/nvstaging/cloud-native/nvsm-grafana 24.09.05 12e4ebaee709 3 hours ago 541MB
$ nvcr.io/nvstaging/cloud-native/nvsm-prometheus 24.09.05 68518ef28efc 3 hours ago 376MB
$ nvcr.io/nvstaging/cloud-native/nvsm-provision 24.09.05 3869ef50cbbd 3 hours ago 518MB
$ nvcr.io/nvstaging/cloud-native/nvsm-aggregator 24.09.05 eb6068a414be 3 hours ago 1.55GB
Create Inventory YAML File
Inventory file stores information of all compute nodes like BMC IP, Host IP and encrypted passwords. Using the inventory file, the node provisioner container invokes ansible playbook to copy certificates from the aggregator node to all compute nodes and restart compute nodes NVSM to connect to aggregator node NVSM.
Since the inventory.yaml file contains username and passwords, it must be secured to the admin user only.
Use the below sample file to create an inventory.yaml file:
aggregator:
hosts:
# Add aggregator host here
# Example:
#
# aggregator.example.com:
vars:
# Aggregator
nvsm_exporter_port: 9123
nvsm_aggregator_image: "nvcr.io/nvstaging/cloud-native/nvsm-aggregator:@PACKAGE_VERSION@"
nvsm_aggregator_container: "nvsm-aggregator" # Name of nvsm container
nvsm_aggregator_network: "nvsm" # Name of nvsm container network
# Dashboard stack - Prometheus & Grafana
nvsm_enable_dashboard: true
nvsm_prometheus_image: "nvcr.io/nvstaging/cloud-native/nvsm-prometheus:@PACKAGE_VERSION@"
nvsm_grafana_image: "nvcr.io/nvstaging/cloud-native/nvsm-grafana:@PACKAGE_VERSION@"
nvsm_grafana_port: 3000
nvsm_grafana_admin_user: "nvsm"
nvsm_grafana_admin_password: "nvsm"
compute:
hosts:
# Add compute nodes here, echo node should have a nvsm_id and bmc_ip
# Examples:
#
# dgx01.example.com:
# nvsm_id: 1
# bmc_ip: "192.168.10.1"
# dgx02.example.com:
# nvsm_id: 2
# bmc_ip: "192.168.10.2"
# # overwrite the group vars if required.
# # for example, the host have different user/password
# ansible_user: "sshuser02"
# ansible_ssh_pass: "sshpwd02"
# ansible_sudo_pass: "sshpwd02"
# bmc_pass: "bmcpwd02"
# dgx03.example.com:
# nvsm_id: 3
# bmc_ip: "192.168.10.3"
# ansible_user: "sshuser03"
# # Use literal block scalar, if the password contains special characters like double quote(")
# # The password here is "specialSSHPass" (including the double quotes)
# ansible_ssh_pass: |-
# "specialSSHPass"
# ansible_sudo_pass: |-
# "specialSSHPass"
# bmc_pass: |-
# "specialBMCPass"
vars:
# BMC user/password applies to all compute hosts,
# they can be override by host variables if some hosts have different passwords
bmc_user: "admin"
bmc_pass: "admin"
all:
children:
compute:
aggregator:
vars:
# Aggregrator variables
nvsm_mqtt_host: "aggregator.example.com"
nvsm_mqtt_port: 8883
# SSH user/password applies to all aggregator and compute hosts,
# they can be override by host variables if some hosts have different passwords
# The password here is sshpwd (without the double quotes)
ansible_user: "sshuser"
ansible_ssh_pass: "sshpwd"
ansible_sudo_pass: "sshpwd"
# Force reprovision
nvsm_force_reprovision: false
Perform the following steps on the Aggregator node:
$ mkdir -p /etc/nvsm/aggregator
$ vim /etc/nvsm/aggregator/inventory.yaml
$ sudo chown root:root /etc/nvsm/aggregator/inventory.yaml
$ sudo chmod 0600 /etc/nvsm/aggregator/inventory.yaml
Update the inventory file /etc/nvsm/aggregator/inventory.yaml
with the following information:
Add host to hosts section under the aggregator , it can be the hostname or host IP address of the aggregator node.
Change
nvsm_mqtt_host
variable undervars
ofaggregator
to hostname or host IP address of the aggregator node.Add hosts to the hosts section under
compute
, it can be hostname or host IP address of compute nodes.Each compute node host must have an unique
nvsm_id
, it specifies compute node id when they connect to the aggregator node. nvsm_id can start from 1 to 1023.Each compute node host must have an unique
bmc_ip
, it specifies the BMC ip associated with that node.Specify SSH user/password for each compute node by updating
ansible_user
,ansible_ssh_pass
,ansible_sudo_pass
under the hosts section. Sectionvars
variables underall
is applicable to all hosts where variables defined under host section overridesvars
variables.
Inventory File Details
The inventory.yaml are an ansible inventory file, see the details here.
The inventory file have a group called “all”, which contain 2 sub-groups:
aggregator group - contains the host list of aggregator hosts. Currently only one host is supported.
compute group - contains all DGX nodes.
The vars can be defined as the ‘all’ group, it will apply to all hosts (including aggregator and compute). Vars defined in the ‘compute’ group apply to the compute nodes, but do not apply to the aggregator group, but it will override the value from the “all” group if the var is also defined there. Vars defined in a host apply to that host only, and it will override the value defined in a group.
Examples: If all hosts share the same ssh user / passwd, please define ansible_user, ansible_ssh_pass, ansible_sudo_pass under all.vars If a particular hosts have a different user / passwd, please define it under the host, something like
$ dgx03.example.com:
nvsm_id: 3
bmc_ip: "192.168.10.3"
# overwrite the group vars if required
ansible_user: "sshuser"
ansible_ssh_pass: "sshpwd"
ansible_sudo_pass: "sshpwd"
Run Docker Command to Provision Nodes
Start Aggregator Container and Provision Compute Nodes
Run the provision container with configuration as listed in the command, Use the image id for provision container found from docker images command. Provisioner will start the Aggregator, Grafana and Prometheus containers using configurations mentioned in the inventory file.
The inventory file stores information of all compute nodes BMC IP, Host IP and passwords. Using the inventory file, the provision container invokes an ansible playbook to copy certificates from Aggregator node to compute nodes and restart compute nodes NVSM to connect to aggregator node NVSM.
$ docker run -it --rm -v /etc/nvsm/aggregator:/nvsm/mnconfig nvcr.io/nvstaging/cloud-native/nvsm-provision:24.09.05
Sample Output:
Validating inventory file...
Validation completed
PLAY [NVSM Aggregator node precheck] ***********************************************************************************************************************
TASK [Ping aggregator node] **********************************************************************************************************************************
ok: [aggregator.example.com]
TASK [Get inventory file stat] ***********************************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check inventory file permission] *************************************************************************************************************************
ok: [aggregator.example.com] => {
"changed": false,
"msg": "All assertions passed"
}
TASK [Check if docker command is available] ******************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check if docker compose plugin is available] ************************************************************************************************************
ok: [aggregator.example.com]
TASK [Using 'docker compose'] ********************************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check if docker-compose command is available] ********************************************************************************************************
skipping: [aggregator.example.com]
TASK [Using 'docker-compose'] *******************************************************************************************************************************
skipping: [aggregator.example.com]
TASK [Show docker compose command] ********************************************************************************************************************
ok: [aggregator.example.com] => {
"msg": "Using 'docker compose'"
}
PLAY [NVSM Worker node precheck] ************************************************************************************************************************
TASK [Ping compute node] **********************************************************************************************************************************
ok: [192.168.10.1]
TASK [Print NVSM Node ID] *********************************************************************************************************************************
ok: [192.168.10.1] => {
"msg": "Precheck node ID 1"
}
TASK [Get Packages Facts] *********************************************************************************************************************************
ok: [192.168.10.1]
TASK [Install NVSM] ****************************************************************************************************************************************
skipping: [192.168.10.1]
TASK [Get Packages Facts] *********************************************************************************************************************************
skipping: [192.168.10.1]
TASK [Get current NVSM version] ***************************************************************************************************************************
ok: [192.168.10.1]
TASK [Check if NVSM version meet requirement] ***********************************************************************************************************
skipping: [192.168.10.1]
PLAY [NVSM Aggregator node provision] *******************************************************************************************************************
TASK [Check if aggregator was provisioned] ****************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check if aggregator container exists] ******************************************************************************************************************
ok: [aggregator.example.com]
TASK [Remove existing aggregator container] *****************************************************************************************************************
changed: [aggregator.example.com]
TASK [Create Aggregator docker-compose.yml] ***************************************************************************************************************
ok: [aggregator.example.com]
TASK [Start aggregator container] *****************************************************************************************************************************
changed: [aggregator.example.com]
TASK [Wait for NVSM keyfiles to be created] *******************************************************************************************************************
ok: [aggregator.example.com] => (item=nvsm-ca.crt)
ok: [aggregator.example.com] => (item=nvsm-client.crt)
ok: [aggregator.example.com] => (item=nvsm-client.key)
PLAY [NVSM compute node provision] **************************************************************************************************************************
TASK [Check if node was provisioned] **************************************************************************************************************************
ok: [192.168.10.1]
TASK [Copy keyfiles] *********************************************************************************************************************************************
changed: [192.168.10.1] => (item=nvsm-ca.crt)
changed: [192.168.10.1] => (item=nvsm-client.crt)
changed: [192.168.10.1] => (item=nvsm-client.key)
TASK [Copy using inline content] *********************************************************************************************************************************
changed: [192.168.10.1]
TASK [Restart NVSM] *******************************************************************************************************************************************
changed: [192.168.10.1]
PLAY [NVSM Aggregator node - post config] *********************************************************************************************************************
TASK [Restart nvsm-exporter] ************************************************************************************************************************************
changed: [aggregator.example.com]
TASK [Reload nvsm-lifecycled] ************************************************************************************************************************************
changed: [aggregator.example.com]
PLAY [NVSM Aggregator node postcheck] ************************************************************************************************************************
TASK [Post check] ************************************************************************************************************************************************
ok: [aggregator.example.com] => {
"msg": "NVSM Aggregator node postcheck"
}
PLAY [NVSM Worker node postcheck] *****************************************************************************************************************************
TASK [Post check] ************************************************************************************************************************************************
ok: [aggregator.example.com] => {
"msg": "NVSM Aggregator node postcheck node ID 1"
}
PLAY RECAP *******************************************************************************************************************************************************
192.168.10.1 : ok=25 changed=7 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0
Check all containers are running on the Aggregator node:
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES PORTS
e7f9a56e6ef4 nvcr.io/nvstaging/cloud-native/nvsm-grafana:24.09.05 "/bin/bash -c ./entr_" About an hour ago Up About an hour nvsm-grafana
2fe6fb57ed2a nvcr.io/nvstaging/cloud-native/nvsm-prometheus:24.09.05 "/bin/bash -c ./entr_" About an hour ago Up About an hour nvsm-prometheus
fff3ed636853 nvcr.io/nvstaging/cloud-native/nvsm-aggregator:24.09.05 "/bin/bash -c /usr/b_" About an hour ago Up About an hour nvsm-aggregator 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp, 0.0.0.0:8883->8883/tcp, :::8883->8883/tcp, 0.0.0.0:9123->9123/tcp, :::9123->9123/tcp
Connect to Multinode NVSM
After the provision, within a few minutes (3~5 mins), compute nodes must have connected to aggregator node NVSM. Now it’s time to check the NVSM multinode via login to the aggregator container.
Enter Aggregator Container
Login to the Aggregator container using the below command:
$ docker exec -it nvsm-aggregator /bin/bash
Examining the Aggregator NVSM Services
Run the following command on the Aggregator container to check running NVSM services:
$ nvsm status
SERVICE ENABLED ACTIVE SUB DESCRIPTION
================================================================================
nvsm-exporter enabled active running NVSM Exporter to provide DGX System Management Metrics
nvsm-lifecycled enabled active running NVSM aggregator lifecycle manager
nvsm-mqtt enabled active running MQTT broker for NVSM API for signaling within NVSM API components
Examine the nvsm_core
running in aggregator:
$ # nvsm_lifecycle status
Hostname/IP Node Command PID Stat StartCount CrashReason CrashCode
10.33.1.23 23 /usr/sbin/nvsm_core -mode server SERVE 23 99 Running 2 None 0
10.33.1.24 24 /usr/sbin/nvsm_core -mode server SERVE 24 105 Running 2 None 0
Examine the status for a given node:
$ # nvsm show status --node 23
Node ID: 23
Service Status:
SERVICE STATUS
Aggregator node nvsm_core Running
Compute node nvsm_core Inactive
Run NVSM CLI commands for Cluster Nodes
We can run all supported NVSM CLI/show commands targeting any cluster node including show/dump health.
Examples to run the NVSM show commands targeted to a compute node:
$ nvsm show power --node 1
$ nvsm show gpus --node 2
$ nvsm show storage --node 3
$ nvsm show network --node 4
$ nvsm show alerts --node 5
$ nvsm show health --node 6
$ nvsm show health --node 7
Example to run the NVSM health command on all compute nodes:
$ nvsm show health
$ nvsm show health --node all
Example to collect nvsm dump from a compute node and store on aggregator node:
Dump collection gets executed on the compute node and copied back to the aggregator node at the output dir path.
$ nvsm dump health --node 1
Example to run nvsm cli command on any compute node.
$ nvsm
$ show
$ cd 1
$ show
$ ...iterate over any resource
Accessing Aggregator Services
From the aggregator node we can access the exporter having all cluster nodes information.
nvsm-exporter - http://[aggregator Hostname/IP]:9123/metrics
grafana - http://[aggregator Hostname/IP]:3000
Manage Nodes
Add New Nodes
On the aggregator node, add new compute nodes in /etc/nvsm/aggregator/inventory.yaml
into the compute hosts section and rerun the provisioner container.
$ docker run -it --rm -v /etc/nvsm/aggregator/nvsm/mnconfig nvcr.io/nvstaging/cloud-native/nvsm-provision:24.09.05
Removing Nodes
On the aggregator node, remove compute nodes in /etc/nvsm/aggregator/inventory.yaml
from the compute hosts section and rerun the provisioner container.
$ docker run -it --rm -v /etc/nvsm/aggregator/nvsm/mnconfig nvcr.io/nvstaging/cloud-native/nvsm-provision:24.09.05
Post removing the nodes from Aggregator, Restart NVSM in singlenode mode on compute nodes. Perform below steps on compute nodes which are removed.
$ rm -rf /etc/nvsm/mn
$ systemctl restart nvsm
Features Impacted
NVSM Call Home will be supported in future releases.