NVSM Multinode
NVSM Multinode is developed to monitor multiple compute nodes within a cluster. It serves as a single management interface with active monitoring and alerting for a cluster of nodes. It also supports features including cluster wide health with drill down capabilities, and dump collection.
NVSM Multinode Overview
Aggregator Node - Acts as the central coordinator, running an MQTT server to communicate with compute nodes. It is deployed as a container on an external management server. NVSM instances run for each compute node.
Compute Nodes - NVSM running on DGX systems connect to the aggregator node’s MQTT server.
Prerequisites
System Requirements
Ubuntu-based management server, can be an external or DGX system.
Network connectivity between the aggregator node and compute nodes.
Software Requirements
Ensure Docker (minimum version 20.10.21) and Docker Compose (compatible version) are installed on the system.
NVSM multinode supported version (24.03.05) onwards should be installed on compute nodes.
Ensure the aggregator node and all compute nodes time are synced to NTP. Since compute nodes and aggregator nodes connect over MQTT, the timestamp on MQTT message needs to be in sync.
Note: If compute node time and aggregator time is not synchronized then compute node MQTT messages will be assumed stale and NVSM CLI commands will error out: ServiceUnavailable
.
Aggregator and all compute nodes must set up time synchronization using
services like ntpdate or chrony.
Setup Details
Each cluster includes an Aggregator Node (the Container) that runs on x86 Ubuntu servers, and multiple Compute Nodes (DGX Systems). An Aggregator Node system must have access to the host network and management network (BMC). Compute Nodes connect to the Aggregator Node over the MQTT server, hosted by the Aggregator Node. NVSM running on Compute Nodes connect to NVSM also running on the Aggregator Node. The Aggregator Node also includes an NVSM exporters dashboard, serving as a single interface to view all connected nodes, sensor data, health, etc.
Security Warning: Docker Group Access Risks
When deploying the multinode NVSM aggregator, it runs as the root user inside a Docker container. However, due to Docker’s security model, any user who is a member of the Docker group on the host system can execute commands inside any running container, including the NVSM aggregator. This means that, unlike the single-node case where only the root user could run NVSM commands, non-root users who belong to the Docker group on the control plane (baremetal) node can also access and execute NVSM commands within the container. Administrators should be aware that this is a fundamental property of Docker membership in the Docker group effectively grants root-level access to all containers. To maintain security, it is essential to restrict Docker group membership to trusted administrators only and to secure the control plane node accordingly, as this node becomes a critical point of control for the entire cluster.
Packages
Aggregator - Aggregator image contains a docker container pre-packaged with NVSM, MQTT server and nvsm-exporter stack.
Node Provisioner - Node provisioner image contains ansible playbook which provisions the aggregator node and compute nodes in the cluster.
Prometheus/Grafana (Optional) - If clients already have their own prometheus/grafana running then only deploy the aggregator container.
Setup Instructions
Ensure the latest NVSM (Multinode supported version[24.03.05] or above) is installed on all compute nodes. The aggregator container is pre-packaged with the latest NVSM.
Installing Docker
To install docker and docker-compose on Aggregator mode, run the following commands:
$ apt-get install docker
$ apt-get install docker-compose
Provision Aggregator Node
Download Docker Images
Download the required containers from https://catalog.ngc.nvidia.com/containers below:
NVSM-Aggregator.
NVMS-Provision.
NVSM-Grafana.
NVSM-Prometheus.
Load Docker Images
Run the following command to load the docker images:
$ docker load -i nvsm-prometheus_25.03.05.tar.gz
$ docker load -i nvsm-grafana_25.03.05.tar.gz
$ docker load -i nvsm-provision_25.03.05.tar.gz
$ docker load -i nvsm-aggregator_25.03.05.tar.gz
Run the following command to ensure the container images are present:
$ docker images
$ REPOSITORY TAG IMAGE ID CREATED SIZE
$ nvcr.io/nvstaging/cloud-native/nvsm-grafana 25.03.05 12e4ebaee709 3 hours ago 541MB
$ nvcr.io/nvstaging/cloud-native/nvsm-prometheus 25.03.05 68518ef28efc 3 hours ago 376MB
$ nvcr.io/nvstaging/cloud-native/nvsm-provision 25.03.05 3869ef50cbbd 3 hours ago 518MB
$ nvcr.io/nvstaging/cloud-native/nvsm-aggregator 25.03.05 eb6068a414be 3 hours ago 1.55GB
Create Inventory YAML File
Inventory file stores information of all compute nodes like BMC IP, Host IP and encrypted passwords. Using the inventory file, the node provisioner container invokes ansible playbook to copy certificates from the aggregator node to all compute nodes and restart compute nodes NVSM to connect to aggregator node NVSM.
Since the inventory.yaml file contains username and passwords, it must be secured to the admin user only.
Use the below sample file to create an inventory.yaml file:
aggregator:
hosts:
# Add aggregator host here
# Example:
#
# aggregator.example.com:
vars:
# Aggregator
nvsm_exporter_port: 9123
nvsm_aggregator_image: "nvcr.io/nvstaging/nvsm/nvsm-aggregator:@PACKAGE_VERSION@"
nvsm_aggregator_container: "nvsm-aggregator" # Name of nvsm container
nvsm_aggregator_network: "nvsm" # Name of nvsm container network
# Dashboard stack - Prometheus & Grafana
nvsm_enable_dashboard: true
nvsm_prometheus_image: "nvcr.io/nvstaging/nvsm/nvsm-prometheus:@PACKAGE_VERSION@"
nvsm_grafana_image: "nvcr.io/nvstaging/nvsm/nvsm-grafana:@PACKAGE_VERSION@"
nvsm_grafana_port: 3000
nvsm_grafana_admin_user: "nvsm"
nvsm_grafana_admin_password: "nvsm"
# Api Gateway
nvsm_api_gateway_port: 273
compute:
hosts:
# Add compute nodes here, echo node should have a nvsm_id and bmc_ip
# Examples:
#
# dgx01.example.com:
# nvsm_id: 1
# bmc_ip: "192.168.10.1"
# dgx02.example.com:
# nvsm_id: 2
# bmc_ip: "192.168.10.2"
# # overwrite the group vars if required.
# # for example, the host have different user/password
# ansible_user: "sshuser02"
# ansible_ssh_pass: "sshpwd02"
# ansible_sudo_pass: "sshpwd02"
# bmc_pass: "bmcpwd02"
# dgx03.example.com:
# nvsm_id: 3
# bmc_ip: "192.168.10.3"
# ansible_user: "sshuser03"
# # Use literal block scalar, if the password contains special characters like double quote(")
# # The password here is "specialSSHPass" (including the double quotes)
# ansible_ssh_pass: |-
# "specialSSHPass"
# ansible_sudo_pass: |-
# "specialSSHPass"
# bmc_pass: |-
# "specialBMCPass"
vars:
# BMC user/password applies to all compute hosts,
# they can be override by host variables if some hosts have different passwords
bmc_user: "admin"
bmc_pass: "admin"
all:
children:
compute:
aggregator:
vars:
# Aggregrator variables
nvsm_mqtt_host: "aggregator.example.com"
nvsm_mqtt_port: 8883
# SSH user/password applies to all aggregator and compute hosts,
# they can be override by host variables if some hosts have different passwords
# The password here is sshpwd (without the double quotes)
ansible_user: "sshuser"
ansible_ssh_pass: "sshpwd"
ansible_sudo_pass: "sshpwd"
# Force reprovision
nvsm_force_reprovision: false
Perform the following steps on the Aggregator node:
$ mkdir -p /etc/nvsm/aggregator
$ vim /etc/nvsm/aggregator/inventory.yaml
$ sudo chown root:root /etc/nvsm/aggregator/inventory.yaml
$ sudo chmod 0600 /etc/nvsm/aggregator/inventory.yaml
Update the inventory file /etc/nvsm/aggregator/inventory.yaml
with the following information:
Add host to hosts section under the aggregator , it can be the hostname or host IP address of the aggregator node.
Change
nvsm_mqtt_host
variable undervars
ofaggregator
to hostname or host IP address of the aggregator node.Add hosts to the hosts section under
compute
, it can be hostname or host IP address of compute nodes.Each compute node host must have an unique
nvsm_id
, it specifies compute node id when they connect to the aggregator node. nvsm_id can start from 1 to 1023.Each compute node host must have an unique
bmc_ip
, it specifies the BMC ip associated with that node.Specify SSH user/password for each compute node by updating
ansible_user
,ansible_ssh_pass
,ansible_sudo_pass
under the hosts section. Sectionvars
variables underall
is applicable to all hosts where variables defined under host section overridesvars
variables.
Inventory File Details
The inventory.yaml are an ansible inventory file, see the details here.
The inventory file have a group called “all”, which contain 2 sub-groups:
aggregator group - contains the host list of aggregator hosts. Currently only one host is supported.
compute group - contains all DGX nodes.
The vars can be defined as the ‘all’ group, it will apply to all hosts (including aggregator and compute). Vars defined in the ‘compute’ group apply to the compute nodes, but do not apply to the aggregator group, but it will override the value from the “all” group if the var is also defined there. Vars defined in a host apply to that host only, and it will override the value defined in a group.
Examples: If all hosts share the same ssh user / passwd, please define ansible_user, ansible_ssh_pass, ansible_sudo_pass under all.vars If a particular hosts have a different user / passwd, please define it under the host, something like
$ dgx03.example.com:
nvsm_id: 3
bmc_ip: "192.168.10.3"
# overwrite the group vars if required
ansible_user: "sshuser"
ansible_ssh_pass: "sshpwd"
ansible_sudo_pass: "sshpwd"
Run Docker Command to Provision Nodes
Start Aggregator Container and Provision Compute Nodes
Run the provision container with configuration as listed in the command, Use the image id for provision container found from docker images command. Provisioner will start the Aggregator, Grafana and Prometheus containers using configurations mentioned in the inventory file.
The inventory file stores information of all compute nodes BMC IP, Host IP and passwords. Using the inventory file, the provision container invokes an ansible playbook to copy certificates from Aggregator node to compute nodes and restart compute nodes NVSM to connect to aggregator node NVSM.
$ docker run -it --rm -v /etc/nvsm/aggregator:/etc/nvsm/mn nvcr.io/nvstaging/nvsm/nvsm-provision:25.03.05
Sample Output:
Verifying multinode inventory file... Inventory file is not encrypted. Please encrypt it using 'nvsm MultinodeInventory encrypt'
Inventory validation completed successfully! Starting provisioning...
PLAY [NVSM Aggregator node precheck] ***********************************************************************************************************************
TASK [Ping aggregator node] **********************************************************************************************************************************
ok: [aggregator.example.com]
TASK [Get inventory file stat] ***********************************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check inventory file permission] *************************************************************************************************************************
ok: [aggregator.example.com] => {
"changed": false,
"msg": "All assertions passed"
}
TASK [Check if docker command is available] ******************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check if docker compose plugin is available] ************************************************************************************************************
ok: [aggregator.example.com]
TASK [Using 'docker compose'] ********************************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check if docker-compose command is available] ********************************************************************************************************
skipping: [aggregator.example.com]
TASK [Using 'docker-compose'] *******************************************************************************************************************************
skipping: [aggregator.example.com]
TASK [Show docker compose command] ********************************************************************************************************************
ok: [aggregator.example.com] => {
"msg": "Using 'docker compose'"
}
PLAY [NVSM Worker node precheck] ************************************************************************************************************************
TASK [Ping compute node] **********************************************************************************************************************************
ok: [192.168.10.1]
TASK [Print NVSM Node ID] *********************************************************************************************************************************
ok: [192.168.10.1] => {
"msg": "Precheck node ID 1"
}
TASK [Get Packages Facts] *********************************************************************************************************************************
ok: [192.168.10.1]
TASK [Install NVSM] ****************************************************************************************************************************************
skipping: [192.168.10.1]
TASK [Get Packages Facts] *********************************************************************************************************************************
skipping: [192.168.10.1]
TASK [Get current NVSM version] ***************************************************************************************************************************
ok: [192.168.10.1]
TASK [Check if NVSM version meet requirement] ***********************************************************************************************************
skipping: [192.168.10.1]
PLAY [NVSM Aggregator node provision] *******************************************************************************************************************
TASK [Check if aggregator was provisioned] ****************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check if aggregator container exists] ******************************************************************************************************************
ok: [aggregator.example.com]
TASK [Remove existing aggregator container] *****************************************************************************************************************
changed: [aggregator.example.com]
TASK [Create Aggregator docker-compose.yml] ***************************************************************************************************************
ok: [aggregator.example.com]
TASK [Check if aggregator container exists] ***************************************************************************************************************************
ok: [aggregator.example.com]
TASK [Start aggregator container] *****************************************************************************************************************************
changed: [aggregator.example.com]
TASK [Wait for NVSM keyfiles to be created] *******************************************************************************************************************
ok: [aggregator.example.com] => (item=nvsm-ca.crt)
ok: [aggregator.example.com] => (item=nvsm-client.crt)
ok: [aggregator.example.com] => (item=nvsm-client.key)
TASK [Make sure NVSM MQTT server is ready] ****************************************************************************************************************************
ok: [aggregator.example.com]
PLAY [NVSM compute node provision - phase2] **************************************************************************************************************************
TASK [Copy keyfiles] *********************************************************************************************************************************************
changed: [192.168.10.1] => (item=nvsm-ca.crt)
changed: [192.168.10.1] => (item=nvsm-client.crt)
changed: [192.168.10.1] => (item=nvsm-client.key)
TASK [Copy using inline content] *********************************************************************************************************************************
changed: [192.168.10.1]
TASK [Restart NVSM] *******************************************************************************************************************************************
changed: [192.168.10.1]
PLAY [NVSM Aggregator node - post config] *********************************************************************************************************************
TASK [Restart nvsm-exporter] ************************************************************************************************************************************
changed: [aggregator.example.com]
TASK [Reload nvsm-lifecycled] ************************************************************************************************************************************
changed: [aggregator.example.com]
PLAY [NVSM Aggregator node postcheck] ************************************************************************************************************************
TASK [Post check] ************************************************************************************************************************************************
ok: [aggregator.example.com] => {
"msg": "NVSM Aggregator node postcheck"
}
PLAY [NVSM Worker node postcheck] *****************************************************************************************************************************
TASK [Post check] ************************************************************************************************************************************************
ok: [aggregator.example.com] => {
"msg": "NVSM Aggregator node postcheck node ID 1"
}
PLAY RECAP *******************************************************************************************************************************************************
192.168.10.1 : ok=25 changed=7 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0
Check all containers are running on the Aggregator node:
$ sudo docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS NAMES PORTS
e7f9a56e6ef4 nvcr.io/nvstaging/cloud-native/nvsm-grafana:25.03.05 "/bin/bash -c ./entr_" About an hour ago Up About an hour nvsm-grafana
2fe6fb57ed2a nvcr.io/nvstaging/cloud-native/nvsm-prometheus:25.03.05 "/bin/bash -c ./entr_" About an hour ago Up About an hour nvsm-prometheus
fff3ed636853 nvcr.io/nvstaging/cloud-native/nvsm-aggregator:25.03.05 "/bin/bash -c /usr/b_" About an hour ago Up About an hour nvsm-aggregator 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp, 0.0.0.0:8883->8883/tcp, :::8883->8883/tcp, 0.0.0.0:9123->9123/tcp, :::9123->9123/tcp
Connect to Multinode NVSM
After the provision, within a few minutes (3~5 mins), compute nodes must have connected to aggregator node NVSM. Now it’s time to check the NVSM multinode via login to the aggregator container.
Enter Aggregator Container
Login to the Aggregator container using the below command:
$ docker exec -it nvsm-aggregator /bin/bash
Examining the Aggregator NVSM Services
Run the following command on the Aggregator container to check running NVSM services:
$ nvsm status
SERVICE ENABLED ACTIVE SUB DESCRIPTION
================================================================================
nvsm-exporter enabled active running NVSM Exporter to provide DGX System Management Metrics
nvsm-lifecycled enabled active running NVSM aggregator lifecycle manager
nvsm-mqtt enabled active running MQTT broker for NVSM API for signaling within NVSM API components
Examine the nvsm_core
running in aggregator:
$ # nvsm_lifecycle status
Hostname/IP Node Command PID Stat StartCount CrashReason CrashCode
10.33.1.23 23 /usr/sbin/nvsm_core -mode server SERVE 23 99 Running 2 None 0
10.33.1.24 24 /usr/sbin/nvsm_core -mode server SERVE 24 105 Running 2 None 0
Examine the status for a given node:
$ # nvsm show status --node 23
Node ID: 23
Service Status:
SERVICE STATUS
Aggregator node nvsm_core Running
Compute node nvsm_core Inactive
Run NVSM CLI commands for Cluster Nodes
We can run all supported NVSM CLI/show commands targeting any cluster node including show/dump health.
Examples to run the NVSM show commands targeted to a compute node:
$ nvsm show power --node 1
$ nvsm show gpus --node 2
$ nvsm show storage --node 3
$ nvsm show network --node 4
$ nvsm show alerts --node 5
$ nvsm show health --node 6
$ nvsm show health --node 7
Example to run the NVSM health command on all compute nodes:
$ nvsm show health
$ nvsm show health --node all
Example to collect nvsm dump from a compute node and store on aggregator node:
Dump collection gets executed on the compute node and copied back to the aggregator node at the output dir path.
$ nvsm dump health --node 1
Example to run nvsm cli command on any compute node.
$ nvsm
$ show
$ cd 1
$ show
$ ...iterate over any resource
Accessing Aggregator Services
From the aggregator node we can access the exporter having all cluster nodes information.
nvsm-exporter - http://[aggregator Hostname/IP]:9123/metrics
grafana - http://[aggregator Hostname/IP]:3000
Admin Tasks
Securing Multinode Inventory File
The multinode inventory file contains the credentials of all systems and BMCs. NVSM provides a set of commands to encrypt and manage the file.
An user can encrypt or decrypt the multinode inventory file at any time without impacting running NVSM services.
Encrypting Multinode Inventory File
The following command encrypts the multinode inventory file:
$ nvsm MultinodeInventory encrypt
Enter to set keystore password:
Confirm keystore password:
Initializing keystore...
Keystore initialized successfully.
Encrypting inventory file...
Inventory file encrypted successfully.
When running the command for the first time, you are prompted to set a keystore password. NVSM then generates a random encryption key, using it to encrypt the multinode inventory file.
Note: Ensure the Admin remembers this password.
Enter this same keystore password for all subsequent NVSM MultinodeInventory commands.
Viewing Multinode Inventory File
The following command decrypts the multinode inventory file, and prints it to the console:
nvsm MultinodeInventory view
Enter keystore password:
aggregator:
hosts:
....
Editing Multinode Inventory File
The following command edits the multinode inventory file:
nvsm MultinodeInventory edit
Enter the keystore password:
-> Edit your inventory file
The inventory file was updated successfully.
This decrypts the multinode inventory into a temp file, and opens it with VI editor. After saving this temp file, NVSM performs the validation, then writes it back with encryption.
Decrypting Multinode Inventory File
The following command decrypts the multinode inventory file, although this is highly not recommended:
nvsm MultinodeInventory decrypt
Enter keystore password:
Decryption successful
Inventory file decrypted successfully
Using a New Encryption Key
NVSM does not support setting the encryption key. In the case where the encryption key is leaked however, a new encryption key can be generated by decrypting and encrypting the multinode inventory:
nvsm MultinodeInventory decrypt
Enter keystore password:
Decryption successful
Inventory file decrypted successfully
nvsm MultinodeInventory encrypt
Enter to set keystore password:
Confirm keystore password:
Initializing keystore...
Keystore initialized successfully.
Encrypting inventory file...
Inventory file encrypted successfully.
Ensure you change your keystore password in this scenario.
Changing the Keystore Password
The following command changes the keystore password:
nvsm MultinodeInventory passwd
Enter current keystore password:
Enter new password:
Confirm new password:
Keystore password changed successfully.
Add New Nodes
On the Aggregator node, edit the multinode inventory file. Add new compute nodes to the compute hosts section, and rerun the multinode provisioning.
nvsm MultinodeInventory edit
docker run -it --rm -v /etc/nvsm/aggregator:/etc/nvsm/mn nvcr.io/nvstaging/nvsm/nvsm-provision:25.03.05
Removing Nodes
On the aggregator node, remove compute nodes with the MultinodeInventory
command:
nvsm MultinodeInventory remove [comma separate hostname]
This removes the multinode configs on the compute nodes, and restarts the NVSM service on the compute node as a standalone node.
If the compute node is not accessible but returns after, the Admin is responsible for removing the dir path /etc/nvsm/mn, and restarting NVSM service manually on the compute node:
rm -rf /etc/nvsm/mn
systemctl restart nvsm
The compute node is then removed from the inventory file, regardless if the above was performed or not:
$ # removing 1 node
nvsm MultinodeInventory remove dgx01.example.com
$ # removing multiple nodes with comma separated list
nvsm MultinodeInventory remove dgx01.example.com,dgx02.example.com
$ # It's recommended to backup the inventory file with "-b" option
nvsm MultinodeInventory remove -b dgx01.example.com,dgx02.example.com
NVSM OOB Telemetry Collection
Overview
The NVSM Telemetry feature collects system metrics from BMC and exports them to InfluxDB for analysis and visualization. OOB Telemetry feature is supported for Hopper and Blackwell DGX systems onwards.
Note: Telemetry collection is based on BMC redfish URIs. This feature is only functional from the multinode Aggregator node.
Enabling Telemetry
The following command enables telemetry:
nvsm enable telemetry
Enable telemetry collection and export to InfluxDB? [yes/no]: yes
Telemetry collection enabled.
After enabling: - NVSM starts metrics collection based on platform configuration. - Initiates InfluxDB export if configured. - Enables periodic collection of Redfish URIs metrics at specified intervals.
Disabling Telemetry
The following command disables telemetry:
nvsm disable telemetry
Disable telemetry collection and export to InfluxDB? [yes/no]: yes
Telemetry collection disabled.
After disabling: - Stops metric collection and InfluxDB export. - Halts ongoing telemetry operations.
Exporting Data from InfluxDB
InfluxDB is a time-series database used by NVSM to store telemetry metrics. It is optimized for high write and query loads, making it ideal for storing telemetry metrics data.
In the NVSM telemetry implementation: - Data is stored in the “nvidia” organization, the default authentication token is pre-configured. - Bucket for each compute node corresponds to nvsm-<node_id>. - Data is stored with timestamps, allowing historical analysis. - Aggregator container has required packages including InfluxDB and Influx cli.
$ # Export all data
influx query 'from(bucket:"nvsm-1") |> range(start: 0)' --org nvidia --token <token-id> --raw > all_data.csv
$ # Export last 24 hours data:
influx query 'from(bucket:"nvsm-1") |> range(start: -24h)' --org nvidia --token <token-id> --raw > 24hours_data.csv
$ # Export filtered metrics:
influx query 'from(bucket:"nvsm-1") |> range(start: 0) |> filter(fn: (r) => r._measurement == "thermal")' --org nvidia --token <token-id> --raw > thermal_data.csv