GB200/GB300 Rack Power On and Bring Up#
The GB200/GB300 rack bring up process can be summarized as follows. While this can be done in any order, it is advised that the NVLink Switches be brought up first so that the NVLink domain is already up and configured when the GB200 compute trays are powered on. Otherwise, all the compute nodes will have to be restarted to ensure they are able to communicate on the NVLink fabric correctly.
GB200/GB300 Compute Tray Bring Up Summary
Establish and confirm power control of the rack devices in the rack that is being brought up.
Power on the compute nodes to provision them. (Physical power on or through BMC either by its KVM software or using ipmitool).
After provisioning login to the nodes from the headnode and confirm the state of the NICs (all bonds up, all connections are up at least for the north-south networking).
After successful provisioning and bring up, assess the firmware status:
Check if the firmware levels match those reported in the SBOM.
Update the components if necessary.
NVLink Switch Tray Bring Up Summary
Check if the NVLink Switch is reachable using SSH (via the admin user on the COMe0 network):
Check if the NVLink Switch BMCs are reachable using the following methods:
BCM device power status.
SSH directly into the BMC.
ipmitool.
Install cm-lite-daemon:
If successful, the NVLink Switches will show as UP under the device list. It will then configure NMX-C and NMX-T automatically.
If cm-lite-daemon does not install successfully and the installer needs to get the NVLink domain up to progress, select an NVLink Switch to serve as the master and configure NMX-C and NMX-T manually.
If Zero Touch Provisioning (ZTP) is enabled/configured by bcm-netautogen, cm-lite-daemon will be installed and NMX-C and NMX-T will be configured automatically.
After successful connectivity has been established on the COMe0/1 network and the BMC, assess the firmware status:
Check if the firmware levels match those reported in the SBOM. Update the components if necessary.
Check the version of NVOS and update the OS using each switch individually or use ZTP to do the NVOS update.
Power Shelves Bring Up Summary
Power shelves are reporting as on. If they do not, try bouncing the ports from the switch side to get them to show status as being up.
Do firmware updates.
NIC Firmware Update Bring Up Summary
Verify NIC firmware versions and update them if needed.
Initial Power On and Provisioning#
Upon completion of the rack import process or manual configuration of the cluster compute and control node entries, the GB200/GB300 racks are ready to be brought up. With the rf0 (redfish 0) ports configured within BCM with a MAC to IP, all the GB200/GB300 compute trays, NVLink Switch devices, and the power shelves will get their IP when the respective ipmitool is up.
GB200/GB300 Compute Tray Power On and Provisioning#
Ensure that OOB power control of the GB200/GB300 compute trays is configured.
Check power status:
Individual Node:
cmsh -c "device use <DGX GB200/GB300 compute tray>; power status"All devices in a rack:
cmsh -c "device; power -r <rack number> status"If the output says Skipped, that likely means the power control is not set.
Set power control settings:
One node:
cmsh -c "device use <rack location>-<pod number>-dgx-<rack number>-c<node number>;set powercontrol rf0; commit"All nodes in a rack:
cmsh -c "foreach -n <rack number>-<pod number>-dgx-<rack number>-c[01-18] (set powercontrol rf0; commit)"If it says failed, that means that it can reach the BMC/rf0, but the credentials are incorrect. Check the bmcsettings at the category level.
Confirm that WebGUI BMC access to a GB200 node is present.
Depending on the network configuration, the head node may need to be used as a jump point to reach the GB200 compute tray webGUI.
Open a web browser like Firefox and set proxy settings.
Enter the BMC webUI through the following URL:
https://<BMC_IP>.
Power on one node and watch the boot and provisioning process.
Power on with BCM:
cmsh -c "device use <compute node under test>; power on"Alternatively, power on through the BMC webGUI server power control:
Note
For the GB200/GB300 Compute trays to reset properly through the “power” command, a delay needs to be set in the partition settings.
Configure power reset delay:
cmsh -c "partition; show bmcsettings"[a03-p1-head-01->partition[base]->bmcsettings]% show Parameter Value -------------------------------- ------------------------------------------------ Revision User name bright Password ******** User ID 4 Power reset delay 5s <-- set this to do : power off + sleep(5) + power on
During the boot up of a GB200/GB300 compute tray.
Watch the node installer log to look for any issues during the provisioning.
tail -f /var/log/node-installerWatch the syslog for any issues/errors
tail -f /var/log/syslog | grep -i cmdCheck cmsh to confirm the nodes are in an UP state.
Once one of the GB200/GB300 compute trays provisions successfully, proceed to power on and provision the rest of the nodes.
NVLink Switch Tray Power On and Configuration#
Confirm OOB power control of the NVLink Switches using a power status check in BCM. This is very similar to what was done for a GB200/GB300 compute tray.
Verify that an NVLink Switch can be reached through SSH from the head node using the admin user. The password may have been initially set during the bcm-netautogen process or if it completed the ZTP process.
SSH to NVLink Switch:
ssh admin@<rack location>-NVSW-01
If the NVLink Switch password for the admin user does not work, it is likely in a default (factory reset) state; try the password admin.
Note
If that login works, it will require a change of password. Set the password to match what has been put into BCM for the NVLink Switch entry under set net access settings.
If the NVLink Switch is powered on, install cm-lite-daemon to the switches. If ZTP is enabled, this will be done automatically if ZTP is configured by bcm-netautogen. If ZTP is configured manually, cm-lite-daemon installation can be enabled in the ztpsettings with
set installlitedaemon yes. For detailed instructions on manually installing cm-lite-daemon, see the section Install cm-lite-daemon.
Install cm-lite-daemon#
cmsh -c "device; litedaemon download"
Prepare files
cmsh -c "device; litedaemon download <NVLink Switch>"
A single switch install
cmsh -c "device; litedaemon install <NVLink Switch>"Installs in parallel
cmsh -c "device; litedaemon install -n switch[01-09]"
Note
If the NVLink Switch gives a permission denied error, check that in the access settings submenu that the username and password are correct/match what was given to the bcm-netautogen tool if used or whatever was set manually when the rack imports were created.
Example litedaemon install output:
Example output from litedaemon installation:
[T06-HEAD-01->device]% litedaemon install -n s03-p1-nvsw-[01-09]
<single switch output>
\**\* S03-P1-NVSW-01 \**\*
success: yes
--- stdout ---
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-config-cm_11.0_all.deb \*
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-lite-daemon_11.0-100683-cm11.0-e39d6d5b1f_all.deb \*
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-openssl_3.1.8-100198-cm11.0-026d50d170_amd64.deb \*
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-python312_3.12.9-100029-cm11.0-b1b8aeec10_amd64.deb \*
\* Download: https://10.78.192.25:8081/switch/packages/2004/cm-python3_11.0-100147-cm11.0-f01da5dcae_amd64.deb \*
\* Install \*
Selecting previously unselected package cm-config-cm.
(Reading database ... 49971 files and directories currently installed.)
Preparing to unpack cm-config-cm_11.0_all.deb ...
Unpacking cm-config-cm (11.0) ...
Selecting previously unselected package cm-lite-daemon.
Preparing to unpack cm-lite-daemon_11.0-100683-cm11.0-e39d6d5b1f_all.deb
...
Unpacking cm-lite-daemon (11.0-100683-cm11.0-e39d6d5b1f) ...
Selecting previously unselected package cm-openssl.
Preparing to unpack cm-openssl_3.1.8-100198-cm11.0-026d50d170_amd64.deb
...
Unpacking cm-openssl (3.1.8-100198-cm11.0-026d50d170) ...
Selecting previously unselected package cm-python312.
Preparing to unpack
cm-python312_3.12.9-100029-cm11.0-b1b8aeec10_amd64.deb ...
Unpacking cm-python312 (3.12.9-100029-cm11.0-b1b8aeec10) ...
Selecting previously unselected package cm-python3.
Preparing to unpack cm-python3_11.0-100147-cm11.0-f01da5dcae_amd64.deb
...
Unpacking cm-python3 (11.0-100147-cm11.0-f01da5dcae) ...
Setting up cm-config-cm (11.0) ...
Setting up cm-openssl (3.1.8-100198-cm11.0-026d50d170) ...
Setting up cm-python312 (3.12.9-100029-cm11.0-b1b8aeec10) ...
Setting up cm-python3 (11.0-100147-cm11.0-f01da5dcae) ...
Setting up cm-lite-daemon (11.0-100683-cm11.0-e39d6d5b1f) ...
\* Setup \*
\* Setup from /home/admin \*
\* Certificates \*
- cluster.pem
- bootstrap.pem
- bootstrap.key
\* Register \*
\* Done \*
NVLink Switch ZTP Process Execution#
If ZTP settings have been configured per the instructions in Manual ZTP Settings Configuration in BCM, follow these steps to execute the ZTP process on NVLink Switches.
Restart the NVLink Switch or reset if it was previously configured to start the ZTP process.
Option 1: Restart switch using BMC API
curl -k -u <user>:<password> -H "Content-Type: application/json" -X POST https://<bmc_ip>/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset -d '{"ResetType": "GracefulRestart"}'Option 2: Reset to factory setting from NVUE (if accessible)
nv action reset system factory-defaultMonitor ZTP status from BCM.
tail -f /var/log/cmdaemon | grep <switch_name>Monitor the log output for ZTP completion messages and status updates.
Once ZTP is finished, validate the
usernameandpasswordare working and confirm ZTP status either within BCM or through SSH to the NVLink Switch directly.Login to the NVLink Switch NVOS:
cmsh -c "device; use <switch_name>; ssh" From the NVLink Switch NVOS:
sudo ztp status exit
Expected output:
admin@nvsw01:~$ sudo ztp status ZTP Admin Mode : True ZTP Mode : True ZTP Service : Inactive ZTP Status : SUCCESS ZTP Source : dhcp-opt67 (eth0) Runtime : 01m 51s Timestamp : 2025-07-30 23:12:10 UTC ZTP Service is not running 01-connectivity-check: SUCCESS 02-image: SUCCESS 05-commands-list: SUCCESS 06-startup-file: SUCCESS 09-provisioning-script: SUCCESS
If the status of the NVLink Switches are not appearing in BCM as Up, then check if the cm-lite-daemon service is running on the NVLink Switch. If it is not, try restarting the service. If the service is not installed, initiate a lite daemon download on BCM and install it on the remote device. For detailed steps, see the earlier section: Install cm-lite-daemon.
Note
If the automatic process fails for cm-lite-daemon registration, and installing from BCM fails, proceed to manual registration procedure.
Validate the services are running successfully.
cmsh -c "device; use <switch_hostname>; latesthealthdata" cmsh -c "device; use <switch_hostname>; latestmonitoringdata"
Example health data output:
[bcm11-headnode->device[nvsw01]]% latesthealthdata Measurable Parameter Type Value Age State Info ------------------ ------------ ------------ ---------- ---------- ---------- ------------------------------------------------ ManagedServicesOk Internal PASS 2m 4s diskspace Disk PASS 2m 4s dmesg OS PASS 2m 4s ntp Internal FAIL 2m 4s time daemon not synchronized to a time server
Manual Registration of NVLink Switch (without litedaemon installer)#
If the NVLink Switch is not registered automatically (for example, if ZTP fails), manually register the switch using the following procedure without the litedaemon install command.
Note
This should only be used as a last resort if for some reason the litedaemon install command fails.
Prerequisites: Ensure the NVLink Switches are properly setup in BCM as described in the Manual Addition of NVLink Switch Rack Entries section.
Step 1: Download the cm-lite-daemon package and bootstrap files on the BCM head node
cd /root
apt update
apt install cm-lite-daemon
# List the cm-lite-daemon package and bootstrap files
ls -l /cm/shared/apps/cm-lite-daemon-dist/
# Example output:
# -rw------- 1 root root 1704 Jul 28 23:50 bootstrap.key
# -rw------- 1 root root 1285 Jul 28 23:50 bootstrap.pem
# -rw-r--r-- 1 root root 159540 Oct 31 2017 cm-lite-daemon.zip
Step 2: Copy the required files to the NVLink Switch
Copy the cm-lite-daemon package to the NVLink Switch.
scp /cm/shared/apps/cm-lite-daemon-dist/cm-lite-daemon.zip admin@<switch_ip>:/home/adminCopy the bootstrap keys to the NVLink Switch.
scp /cm/shared/apps/cm-lite-daemon-dist/bootstrap.* admin@<switch_ip>:/home/admin
Step 3: Determine the active BCM head node IP
cmsh -c "device; use master; get ip"
Step 4: Register the NVLink Switch
SSH into the NVLink Switch:
ssh admin@<nv_link_switch_ip>Unzip the cm-lite-daemon package:
sudo unzip cm-lite-daemon.zip # Example output: # -rw------- 1 admin admin 1704 Jul 29 20:18 bootstrap.key # -rw------- 1 admin admin 1285 Jul 29 20:18 bootstrap.pem # drwxr-xr-x 7 root root 4096 Oct 31 2017 cm-lite-daemon # -rw-r--r-- 1 admin admin 159540 Jul 29 20:17 cm-lite-daemon.zip
Move the cm-lite-daemon directory and bootstrap keys to the appropriate locations:
sudo cp -r /home/admin/cm-lite-daemon /opt sudo cp bootstrap.* /opt/cm-lite-daemon/etc/
Install required Python dependencies:
cd /opt/cm-lite-daemon/ sudo pip install -r requirements.txt
Register the switch with BCM:
sudo ./register_node --host <active_bcm_ip> --disable-cert-check
Note
Replace <switch_ip> and <active_bcm_ip> with the actual IP addresses for the specific environment.
This process manually registers the NVLink Switch with BCM if the automatic registration fails.
Configure NMX-C leader#
After a rack leaves the factory, the NVLink Switch configurations are generally wiped. If BCM configures the NVLink Switches using bcm-netautogen, and NMX-C leader will be configured automatically if ZTP is run. However, if the NVLink Switches are configured manually, an NMX-C leader needs to be assigned. Typically, this is the first NVSW-01 (lowest switch on the rack physically). A fabric manager topology configuration file (fm_config.cfg) needs to be created and applied.
Method 1—generate an fm_config.cfg file locally on an NVLink Switch
Enable cluster apps.
nv set cluster state enabled
nv config apply
nv config save
Show the status of the NMX-C and NMX-T applications with nv show cluster apps. If this appears with not ok, the fm_config file needs to be generated.
admin@a18-p1-nvsw-01:~$ nv show cluster apps
Name ID Version Capabilities Components Version Status Reason Additional Information Summary
-------------- ------------- ---------------------- --------------------------------------------------- ---------------------------------------------------------------- ------ -------- -------------------------------- -------
nmx-controller nmx-c-nvos 0.9.0_2025-02-25_16-53 sm, gfm, fib, gw-api sm:2025.01.6, gfm:R570.124.02, fib-fe:0.9.0 not ok NMXC: OK CONTROL_PLANE_STATE_UNCONFIGURED
Generate the fm_config file from the live settings. If this is the first time generating this file after a fresh installation or after a factory reset, it has the factory default settings shipped with the corresponding version of NVOS.
nv action generate sdn config app nmx-controller type fm_config
Action executing ...
App config file nmx-controller_fm_config_20241029_042454 is successfully
generated
Action succeeded
The generated file is placed in the following directory: /host/cluster_infra/app_config/nmx-controller/fm_config/<FILENAME>
Default fm_config file contents
STATE_FILE_NAME=/tmp/fabricmanager.state LOG_FILE_NAME=/logs/fabricmanager.log LOG_FILE_MAX_SIZE=100 MNNVL_API_BACKEND_ENABLED=1 FABRIC_MODE_RESTART=1 USE_RPC=1 LOG_LEVEL=4 TRUNK_LINK_FAILURE_MODE=0 FM_STAY_RESIDENT_ON_FAILURES=0 MNNVL_RESILIENCY_MODE=0 FM_CMD_PORT_NUMBER=6666 STARTING_TCP_PORT=16000 NVSWITCH_FAILURE_MODE=0 BIND_INTERFACE_IP=127.0.0.1 FABRIC_MODE=0 MNNVL_ENABLED=1 LOG_USE_SYSLOG=0 DAEMONIZE=0 ABORT_CUDA_JOBS_ON_FM_EXIT=1 FM_CMD_BIND_INTERFACE=127.0.0.1 ACCESS_LINK_FAILURE_MODE=0 TOPOLOGY_FILE_PATH=/usr/share/nvidia/nvswitch LOG_APPEND_TO_LOG=1 MNNVL_ENABLE_DEFAULT_PARTITION=1 LOG_MAX_ROTATE_COUNT=5 ENABLE_AUTH_ENCRYPTION=0
Note
For GB200/GB300 NVL72 systems, add the following to the fm_config.cfg file. The topology file name is the same for GB300 despite it not being included in the MNNVL_TOPOLOGY name.
MNNVL_TOPOLOGY=gb200_nvl72r1_c2g4_topology
Install the configuration.
$ nv action install sdn config app nmx-controller type fm_config files <FM_CONFIG_FILENAME>
Restart NMX-C.
$ nv action stop cluster apps nmx-controller $ nv action start cluster apps nmx-controller
Confirm that the status of nmx-controller is shown as ok.
$ nv show cluster apps @admin@a18-p1-nvsw-01:/host/cluster_infra/app_config/nmx-controller/fm_config$ nv show cluster apps Name ID Version Capabilities Components Version Status Reason Additional Information Summary -------------- ------------- ---------------------- --------------------------------------------------- ------ ------- --------------------------- ------- nmx-controller nmx-c-nvos 0.9.0_2025-02-25_16-53 sm, gfm, fib, gw-api sm:2025.01.6, gfm:R570.124.02, not ok NMXC: OK, CONTROL_PLANE_STATE_OFFLINE fib-fe:0.9.0 nmx-telemetry nmx-telemetry 0.9.5 nvl telemetry, gnmi nvl-telemetry:1.20.1, ok aggregation, syslog gnmi-aggregator:1.0.1, aggregation nmx-connector:1.0.1
Note
If the nmx-controller is not shown as ok as above, please wait a few minutes and check again.
Last Resort: If the nmx-controller status does not show ok after five minutes, try rebooting the switch tray and check the status again. To reboot the switch tray, use the following command:
$ nv action reboot system
Method 2—Import NVLink Switch fm_config.cfg file
An NVLink Switch can import an fm_config.cfg file from other NVLink Switch devices or other file locations on the network.
Enable cluster apps.
nv set cluster state enabled nv config apply nv config save
Show the status of the NMX-C and NMX-T applications with
nv show cluster apps.Import the configuration file using the following command:
$ nv action fetch sdn config app nmx-controller type fm_config scp://<user>:<password>@<IP_address>/path/to/fm_config.cfg
Note
The fm_config.cfg file can be fetched using the following protocols:
<REMOTE_URL>; scp https ftp and sftp are supported.
#BCM keeps a default fm_config.cfg at /cm/local/apps/cmd/etc/htdocs/switch/fm_config/gb200_nvl72r1_c2g4.cfg
In this example the BCM maintained version is used but this could be any config file created to use as the fm_config.cfg file. For example:
nv action fetch sdn config app nmx-controller type fm_config scp://<user>:<password>@<IP_address>/cm/local/apps/cmd/etc/htdocs/switch/fm_config/gb200_nvl72r1_c2g4.cfg # Results admin@a18-p1-nvsw-02:~$ nv action fetch sdn config app nmx-controller type fm_config scp://root:nvidia123@172.16.6.31/cm/local/apps/cmd/etc/htdocs/switch/fm_config/gb200_nvl72r1_c2g4.cfg Action executing ... Fetching file ... Action executing ... File fetched successfully Action succeeded
Note
To reach the NVLink Switch devices, the head node is configured with bond1 to reach the OOB network. Use the IP assigned to bond1 to do the above example.
Verify that the file was fetched successfully.
$ nv show sdn config app nmx-controller type fm_config files admin@a18-p1-nvsw-02:~$ nv show sdn config app nmx-controller type fm_config files Available config file File path ---------------------- ------------------------------------------------------------------------------ gb200_nvl72r1_c2g4.cfg /host/cluster_infra/app_config/nmx-controller/fm_config/gb200_nvl72r1_c2g4.cfg
Install the configuration.
$ nv action install sdn config app nmx-controller type fm_config files <FM_CONFIG_FILENAME>
Restart NMX-C.
$ nv action stop cluster apps nmx-controller $ nv action start cluster apps nmx-controller
Confirm that the status of nmx-controller is shown as ok.
$ nv show cluster apps
Configure NMX-Telemetry#
To collect telemetry data from the NVLink Switches, NMX-Telemetry needs to be configured. To do this use the following steps:
Check if NMX-T is running:
$ nv show cluster apps
Start the telemetry service:
$ nv action start cluster app nmx-telemetry
Confirm that the status of nmx-telemetry is shown as ok:
$ nv show cluster apps