GB200 Rack Power On and Bring Up#
The GB200 rack bring up process can be summarized as follows. While this can be done in any order, it is advised that the NVLink Switches be brought up first so that the NVLink domain is already up and configured when the GB200 compute trays are powered on. Otherwise, all the compute nodes will have to be restarted to ensure they are able to communicate on the NVLink fabric correctly.
GB200 Compute Tray Bring Up Summary
Establish and confirm power control of the rack devices in the rack that is being brought up.
Power on the compute nodes to provision them. (Physical power on or through BMC either by its KVM software or using ipmitool)
After provisioning login to the nodes from the headnode and confirm the state of the NICs (all bonds up, all connections are up at least for the north-south networking)
After successful provisioning and bring up, assess the firmware status
check if the firmware levels match those reported in the SBOM
update the components if necessary.
NVLink Switch Tray Bring Up Summary
Check if the NVLink Switch is reachable via SSH (via the admin user on the COMe0 network)
Check if the NVLink Switch BMCs are reachable via:
BCM device power status.
SSH directly into the BMC.
ipmitool.
Install cm-lite-daemon
- If successful, the NVLink Switches will show as UP under the device
list. It will then configure NMX-C and NMX-T automatically.
- If cm-lite-daemon does not install successfully and the installer
needs to get the NVLink domain up to progress, select an NVLink Switch to serve as the master and configure NMX-C and NMX-T manually.
- After successful connectivity has been established on the COMe0
network and the BMC, assess the firmware status
- Check if the firmware levels match those reported in the SBOM. Update
the components if necessary.
- Check the version of NVOS and update the OS using each switch
individually or use ZTP to do the NVOS update.
Power Shelves Bring Up Summary
Power shelves are reporting as on. If they do not, try bouncing the ports from the switch side to get them to show status as being up.
Do firmware updates.
NIC Firmware Update Bring Up Summary
Verify NIC firmware versions and update them if needed.
Initial PowerOn and Provisioning#
Upon completion of the rack import process or manual configuration of the cluster compute and control node entries, the GB200 racks are ready to be brought up. With the rf0 (redfish 0) ports configured within BCM with a MAC to IP, all the GB200 compute trays, NVLink Switch devices, and the power shelves will get their IP when the respective ipminet is up.
GB200 Compute Tray PowerOn and Provisioning#
Ensure that OOB power control of the GB200 compute trays is configured.
Check power status:
Individual Node:
cmsh -c "device use <DGX GB200 compute tray>; power status"
All devices in a rack:
cmsh -c "device; power -r <rack number> status"
If the output says Skipped, that likely means the power control is not set.
Set power control settings:
One node:
cmsh -c "device use <rack location>-<pod number>-dgx-<rack number>-c<node number>;set powercontrol rf0; commit"
All nodes in a rack:
cmsh -c "foreach -n <rack number>-<pod number>-dgx-<rack number>-c[01-18] (set powercontrol rf0; commit)"
If it says failed, that means that it can reach the BMC/rf0, but the credentials are incorrect. Check the bmcsettings at the category level.
Confirm that WebGUI BMC access to a GB200 node is present.
- Depending on the network configuration, the head node may need to be
used as a jump point to reach the GB200 compute tray webGUI.
Open a web browser like Firefox and set proxy settings
Enter the BMC webUI via https://<bmc ip>
Power on one node and watch the boot and provisioning process
Power on with BCM:
cmsh -c "device use <compute node under test>; power on"
Alternatively, power on through the BMC webGUI server power control:
Note
For the GB200 Compute trays to reset properly through the “power” command, a delay needs to be set in the partition settings.
Configure power reset delay:
cmsh -c "partition; show bmcsettings"
[a03-p1-head-01->partition[base]->bmcsettings]% show Parameter Value -------------------------------- ------------------------------------------------ Revision User name bright Password ******** User ID 4 Power reset delay 5s <-- set this to do : power off + sleep(5) + power on
During the boot up of a GB200 compute tray.
Watch the node installer log to look for any issues during the provisioning.
tail -f /var/log/node-installer
Watch the syslog for any issues/errors
tail -f /var/log/syslog | grep -i cmd
Check cmsh to confirm the nodes are in an UP state.
Once one of the GB200 compute trays provisions successfully, proceed to power on and provision the rest of the nodes.
NVLink Switch Tray PowerOn and Configuration#
Confirm OOB power control of the NVLink Switches using a power status check in BCM. This is very similar to what was done for a GB200 compute tray.
Verify that an NVLink Switch can be reached through ssh from the head node using the admin user. The password may have been initially set during the bcm-netautogen process or if it completed the ZTP process.
SSH to NVLink Switch:
ssh admin@<rack location>-NVSW-01
If the NVLink Switch password for the admin user does not work, it is likely in a default (factory reset) state; try the password admin.
Note
If that login works, it will require a change of password. Set the password to match what has been put into BCM for the NVLink Switch entry under set net access settings.
If the NVLink Switch is powered on, install cm-lite-daemon to the switches. If ZTP is enabled, this will be done automatically. (currently this is bugged, so the manual process must be done)
Install cm-lite-daemon:
cmsh -c "device; litedaemon download"
Prepare files
cmsh -c "device; litedaemon download <NVLink Switch>"
A single switch install
cmsh -c "device; litedaemon install <NVLink Switch>"
Installs in parallel
cmsh -c "device; litedaemon install -n switch[01-09]"
Note
If the NVLink Switch gives a permission denied error, check that in the access settings submenu that the username and password are correct/match what was given to the bcm-netautogen tool if used or whatever was set manually when the rack imports were created.
Example litedaemon install output:
[T06-HEAD-01->device]% litedaemon install -n s03-p1-nvsw-[01-09]
<single switch output>
\**\* S03-P1-NVSW-01 \**\*
success: yes
--- stdout ---
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-config-cm_11.0_all.deb \*
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-lite-daemon_11.0-100683-cm11.0-e39d6d5b1f_all.deb \*
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-openssl_3.1.8-100198-cm11.0-026d50d170_amd64.deb \*
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-python312_3.12.9-100029-cm11.0-b1b8aeec10_amd64.deb \*
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-python3_11.0-100147-cm11.0-f01da5dcae_amd64.deb \*
\* Install \*
Selecting previously unselected package cm-config-cm.
(Reading database ... 49971 files and directories currently installed.)
Preparing to unpack cm-config-cm_11.0_all.deb ...
Unpacking cm-config-cm (11.0) ...
Selecting previously unselected package cm-lite-daemon.
Preparing to unpack cm-lite-daemon_11.0-100683-cm11.0-e39d6d5b1f_all.deb
...
Unpacking cm-lite-daemon (11.0-100683-cm11.0-e39d6d5b1f) ...
Selecting previously unselected package cm-openssl.
Preparing to unpack cm-openssl_3.1.8-100198-cm11.0-026d50d170_amd64.deb
...
Unpacking cm-openssl (3.1.8-100198-cm11.0-026d50d170) ...
Selecting previously unselected package cm-python312.
Preparing to unpack
cm-python312_3.12.9-100029-cm11.0-b1b8aeec10_amd64.deb ...
Unpacking cm-python312 (3.12.9-100029-cm11.0-b1b8aeec10) ...
Selecting previously unselected package cm-python3.
Preparing to unpack cm-python3_11.0-100147-cm11.0-f01da5dcae_amd64.deb
...
Unpacking cm-python3 (11.0-100147-cm11.0-f01da5dcae) ...
Setting up cm-config-cm (11.0) ...
Setting up cm-openssl (3.1.8-100198-cm11.0-026d50d170) ...
Setting up cm-python312 (3.12.9-100029-cm11.0-b1b8aeec10) ...
Setting up cm-python3 (11.0-100147-cm11.0-f01da5dcae) ...
Setting up cm-lite-daemon (11.0-100683-cm11.0-e39d6d5b1f) ...
\* Setup \*
\* Setup from /home/admin \*
\* Certificates \*
- cluster.pem
- bootstrap.pem
- bootstrap.key
\* Register \*
\* Done \*
NVLink Switch ZTP Process Execution#
If ZTP settings have been configured per the instructions in ZTP Settings Configuration in BCM (Optional), follow these steps to execute the ZTP process on NVLink Switches.
Restart the NVLink Switch or reset if it was previously configured to start the ZTP process.
Option 1: Restart switch using BMC API
curl -k -u <user>:<password> -H "Content-Type: application/json" -X POST https://<bmc_ip>/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "GracefulRestart"}'
Option 2: Reset to factory setting from NVUE (if accessible)
nv action reset system factory-default
Monitor ZTP status from BCM.
tail -f /var/log/cmdaemon | grep <switch_name>
Monitor the log output for ZTP completion messages and status updates.
Once ZTP is finished, validate the username and password are working and confirm ZTP status within BCM.
cmsh -c "device; use <switch_name>; ssh"
Login to the NVLink Switch NVOS:
sudo ztp status exit
Expected output:
admin@nvsw01:~$ sudo ztp status ZTP Admin Mode : True ZTP Mode : True ZTP Service : Inactive ZTP Status : SUCCESS ZTP Source : dhcp-opt67 (eth0) Runtime : 01m 51s Timestamp : 2025-07-30 23:12:10 UTC ZTP Service is not running 01-connectivity-check: SUCCESS 02-image: SUCCESS 05-commands-list: SUCCESS 06-startup-file: SUCCESS 09-provisioning-script: SUCCESS
Initiate a lite daemon download on BCM and install it on the remote device. For detailed steps, see the earlier section:
:ref:`litedaemon-install-instructions`
.Note
If the automatic process fails for cm-lite-daemon registration, proceed to manual registration procedures.
Validate the services are running successfully.
cmsh -c "device; use <switch_hostname>; latesthealthdata" cmsh -c "device; use <switch_hostname>; latestmonitoringdata"
Example health data output:
[bcm11-headnode->device[nvsw01]]% latesthealthdata Measurable Parameter Type Value Age State Info ------------------ ------------ ------------ ---------- ---------- ---------- ------------------------------------------------ ManagedServicesOk Internal PASS 2m 4s diskspace Disk PASS 2m 4s dmesg OS PASS 2m 4s ntp Internal FAIL 2m 4s time daemon not synchronized to a time server
Manual Registration of NVLink Switch#
If the NVLink Switch is not registered automatically (for example, if ZTP fails), manually register the switch using the following procedure without the litedaemon install command.
Note
This should only be used as a last resort if for some reason the litedaemon install command fails.
Prerequisites: Ensure the NVLink Switches are properly provisioned in BCM as described in the “Adding NVLink Switch to BCM” section.
Step 1: Download the cm-lite-daemon package and bootstrap files on the BCM head node
cd /root
apt update
apt install cm-lite-daemon
# List the cm-lite-daemon package and bootstrap files
ls -l /cm/shared/apps/cm-lite-daemon-dist/
# Example output:
# -rw------- 1 root root 1704 Jul 28 23:50 bootstrap.key
# -rw------- 1 root root 1285 Jul 28 23:50 bootstrap.pem
# -rw-r--r-- 1 root root 159540 Oct 31 2017 cm-lite-daemon.zip
Step 2: Copy the required files to the NVLink Switch
Copy the cm-lite-daemon package to the NVLink Switch.
scp /cm/shared/apps/cm-lite-daemon-dist/cm-lite-daemon.zip admin@<switch_ip>:/home/admin
Copy the bootstrap keys to the NVLink Switch.
scp /cm/shared/apps/cm-lite-daemon-dist/bootstrap.* admin@<switch_ip>:/home/admin
Step 3: Determine the active BCM head node IP
cmsh -c "device; use master; get ip"
Step 4: Register the NVLink Switch
SSH into the NVLink Switch:
ssh admin@<nv_link_switch_ip>
Unzip the cm-lite-daemon package:
sudo unzip cm-lite-daemon.zip
# Example output: # -rw——- 1 admin admin 1704 Jul 29 20:18 bootstrap.key # -rw——- 1 admin admin 1285 Jul 29 20:18 bootstrap.pem # drwxr-xr-x 7 root root 4096 Oct 31 2017 cm-lite-daemon # -rw-r–r– 1 admin admin 159540 Jul 29 20:17 cm-lite-daemon.zip
Move the cm-lite-daemon directory and bootstrap keys to the appropriate locations:
sudo cp -r /home/admin/cm-lite-daemon /opt sudo cp bootstrap.* /opt/cm-lite-daemon/etc/
Install required Python dependencies:
cd /opt/cm-lite-daemon/ sudo pip install -r requirements.txt
Register the switch with BCM:
sudo ./register_node --host <active_bcm_ip> --disable-cert-check
Note
Replace <switch_ip>
and <active_bcm_ip>
with the actual IP addresses for the specific environment.
This process manually registers the NVLink Switch with BCM if the automatic registration fails.
Configure NMX-C leader#
After a rack leaves the factory, the NVLink Switch configurations are generally wiped. If BCM configures the NVLink Switches via bcm-netautogen, and NMX-C leader will be configured automatically if ZTP is run. However, if the NVLink Switches are configured manually, an NMX-C master needs to be assigned. Typically, this is the first NVSW-01 (lowest switch on the rack physically). A fabric manager topology configuration file (fm_config.cfg) needs to be created and applied.
Method 1—generate an fm_config.cfg file locally on an NVLink Switch
Enable cluster apps.
nv set cluster state enabled
nv config apply
nv config save
Show the status of the NMX-C and NMX-T applications with nv show cluster apps. If this appears with not ok, the fm_config file needs to be generated.
admin@a18-p1-nvsw-01:~$ nv show cluster apps
Name ID Version Capabilities Components Version Status Reason Additional Information Summary
-------------- ------------- ---------------------- --------------------------------------------------- ---------------------------------------------------------------- ------ -------- -------------------------------- -------
nmx-controller nmx-c-nvos 0.9.0_2025-02-25_16-53 sm, gfm, fib, gw-api sm:2025.01.6, gfm:R570.124.02, fib-fe:0.9.0 not ok NMXC: OK CONTROL_PLANE_STATE_UNCONFIGURED
Generate the fm_config file from the live settings. If this is the first time generating this file after a fresh installation or after a factory reset, it has the factory default settings shipped with the corresponding version of NVOS.
nv action generate sdn config app nmx-controller type fm_config
Action executing ...
App config file nmx-controller_fm_config_20241029_042454 is successfully
generated
Action succeeded
The generated file is placed in the following directory:
/host/cluster_infra/app_config/nmx-controller/fm_config/<filename>
Default fm_config file contents
STATE_FILE_NAME=/tmp/fabricmanager.state
LOG_FILE_NAME=/logs/fabricmanager.log
LOG_FILE_MAX_SIZE=100
MNNVL_API_BACKEND_ENABLED=1
FABRIC_MODE_RESTART=1
USE_RPC=1
LOG_LEVEL=4
TRUNK_LINK_FAILURE_MODE=0
FM_STAY_RESIDENT_ON_FAILURES=0
MNNVL_RESILIENCY_MODE=0
FM_CMD_PORT_NUMBER=6666
STARTING_TCP_PORT=16000
NVSWITCH_FAILURE_MODE=0
BIND_INTERFACE_IP=127.0.0.1
FABRIC_MODE=0
MNNVL_ENABLED=1
LOG_USE_SYSLOG=0
DAEMONIZE=0
ABORT_CUDA_JOBS_ON_FM_EXIT=1
FM_CMD_BIND_INTERFACE=127.0.0.1
ACCESS_LINK_FAILURE_MODE=0
TOPOLOGY_FILE_PATH=/usr/share/nvidia/nvswitch
LOG_APPEND_TO_LOG=1
MNNVL_ENABLE_DEFAULT_PARTITION=1
LOG_MAX_ROTATE_COUNT=5
ENABLE_AUTH_ENCRYPTION=0
Note
For GB200 NVL72 systems, add the following to the fm_config.cfg file.
MNNVL_TOPOLOGY=gb200_nvl72r1_c2g4_topology
Install the configuration.
$ nv action install sdn config app nmx-controller type fm_config files <fm_config filename>
Restart NMX-C.
$ nv action stop cluster apps nmx-controller
$ nv action start cluster apps nmx-controller
Confirm that the status of nmx-controller is shown as ok.
$ nv show cluster apps
@admin@a18-p1-nvsw-01:/host/cluster_infra/app_config/nmx-controller/fm_config$ nv show cluster apps
Name ID Version Capabilities Components Version Status Reason Additional Information Summary
-------------- ------------- ---------------------- --------------------------------------------------- ------ ------- --------------------------- -------
nmx-controller nmx-c-nvos 0.9.0_2025-02-25_16-53 sm, gfm, fib, gw-api sm:2025.01.6, gfm:R570.124.02, not ok NMXC: OK, CONTROL_PLANE_STATE_OFFLINE
fib-fe:0.9.0
nmx-telemetry nmx-telemetry 0.9.5 nvl telemetry, gnmi nvl-telemetry:1.20.1, ok
aggregation, syslog gnmi-aggregator:1.0.1,
aggregation nmx-connector:1.0.1
Note
If the nmx-controller is not shown as ok as above, please wait a few minutes and check again.
Last Resort: If the nmx-controller status does not show ok after five minutes, try rebooting the switch tray and check the status again.
Reboot system:
$ nv action reboot system
Method 2—Import NVLink Switch fm_config.cfg file
An NVLink Switch can import an fm_config.cfg file from other NVLink Switch devices or other file locations on the network.
Enable cluster apps.
nv set cluster state enabled
nv config apply
nv config save
Show the status of the NMX-C and NMX-T applications with
nv show cluster apps
.Import the configuration file using the following command:
$ nv action fetch sdn config app nmx-controller type fm_config scp://<user>:<password>@<IP_address>/path/to/fm_config.cfg
Note
The fm_config file can be fetched with the following protocols:
<remote-url>; scp https ftp and sftp are supported.
.. code-block:: none
#BCM keeps a default fm_config.cfg at
/cm/local/apps/cmd/etc/htdocs/switch/fm_config/gb200_nvl72r1_c2g4.cfg
In this example the BCM maintained version is used but this could be any config file created to use as the fm_config.cfg file.
nv action fetch sdn config app nmx-controller type fm_config scp://<user>:<password>@<IP_address>/cm/local/apps/cmd/etc/htdocs/switch/fm_config/gb200_nvl72r1_c2g4.cfg
# Results
admin@a18-p1-nvsw-02:~$ nv action fetch sdn config app nmx-controller type fm_config scp://root:nvidia123@172.16.6.31/cm/local/apps/cmd/etc/htdocs/switch/fm_config/gb200_nvl72r1_c2g4.cfg
Action executing ...
Fetching file ...
Action executing ...
File fetched successfully
Action succeeded
Note
To reach the NVLink Switch devices, the head node is configured with bond1 to reach the OOB network. Use the IP assigned to bond1 to do the above example.
Verify that the file was fetched successfully.
$ nv show sdn config app nmx-controller type fm_config files
admin@a18-p1-nvsw-02:~$ nv show sdn config app nmx-controller type fm_config files
Available config file File path
---------------------- ------------------------------------------------------------------------------
gb200_nvl72r1_c2g4.cfg /host/cluster_infra/app_config/nmx-controller/fm_config/gb200_nvl72r1_c2g4.cfg
Install the configuration.
$ nv action install sdn config app nmx-controller type fm_config files <fm_config filename>
Restart NMX-C.
$ nv action stop cluster apps nmx-controller
$ nv action start cluster apps nmx-controller
Confirm that the status of nmx-controller is shown as ok.
$ nv show cluster apps
Configure NMX-Telemetry#
Check if NMX-T is running:
$ nv show cluster apps
Start the telemetry service:
$ nv action start cluster app nmx-telemetry
Confirm that the status of nmx-telemetry is shown as ok:
$ nv show cluster apps