GB200 Rack Power On and Bring Up#

The GB200 rack bring up process can be summarized as follows. While this can be done in any order, it is advised that the NVLink Switches be brought up first so that the NVLink domain is already up and configured when the GB200 compute trays are powered on. Otherwise, all the compute nodes will have to be restarted to ensure they are able to communicate on the NVLink fabric correctly.

GB200 Compute Tray Bring Up Summary

  1. Establish and confirm power control of the rack devices in the rack that is being brought up.

  2. Power on the compute nodes to provision them. (Physical power on or through BMC either by its KVM software or using ipmitool)

  3. After provisioning login to the nodes from the headnode and confirm the state of the NICs (all bonds up, all connections are up at least for the north-south networking)

  4. After successful provisioning and bring up, assess the firmware status

    • check if the firmware levels match those reported in the SBOM

    • update the components if necessary.

NVLink Switch Tray Bring Up Summary

Check if the NVLink Switch is reachable via SSH (via the admin user on the COMe0 network)

  1. Check if the NVLink Switch BMCs are reachable via:

    • BCM device power status.

    • SSH directly into the BMC.

    • ipmitool.

  2. Install cm-lite-daemon

    • If successful, the NVLink Switches will show as UP under the device

      list. It will then configure NMX-C and NMX-T automatically.

    • If cm-lite-daemon does not install successfully and the installer

      needs to get the NVLink domain up to progress, select an NVLink Switch to serve as the master and configure NMX-C and NMX-T manually.

  3. After successful connectivity has been established on the COMe0

    network and the BMC, assess the firmware status

    • Check if the firmware levels match those reported in the SBOM. Update

      the components if necessary.

    • Check the version of NVOS and update the OS using each switch

      individually or use ZTP to do the NVOS update.

Power Shelves Bring Up Summary

  1. Power shelves are reporting as on. If they do not, try bouncing the ports from the switch side to get them to show status as being up.

  2. Do firmware updates.

NIC Firmware Update Bring Up Summary

  1. Verify NIC firmware versions and update them if needed.

Initial PowerOn and Provisioning#

Upon completion of the rack import process or manual configuration of the cluster compute and control node entries, the GB200 racks are ready to be brought up. With the rf0 (redfish 0) ports configured within BCM with a MAC to IP, all the GB200 compute trays, NVLink Switch devices, and the power shelves will get their IP when the respective ipminet is up.

GB200 Compute Tray PowerOn and Provisioning#

  1. Ensure that OOB power control of the GB200 compute trays is configured.

    Check power status:

    Individual Node:

    cmsh -c "device use <DGX GB200 compute tray>; power status"
    

    All devices in a rack:

    cmsh -c "device; power -r <rack number> status"
    
    • If the output says Skipped, that likely means the power control is not set.

    Set power control settings:

    One node:

    cmsh -c "device use <rack location>-<pod number>-dgx-<rack number>-c<node number>;set powercontrol rf0; commit"
    

    All nodes in a rack:

    cmsh -c "foreach -n <rack number>-<pod number>-dgx-<rack number>-c[01-18] (set powercontrol rf0; commit)"
    
    • If it says failed, that means that it can reach the BMC/rf0, but the credentials are incorrect. Check the bmcsettings at the category level.

  2. Confirm that WebGUI BMC access to a GB200 node is present.

    • Depending on the network configuration, the head node may need to be

      used as a jump point to reach the GB200 compute tray webGUI.

    • Open a web browser like Firefox and set proxy settings

    • Enter the BMC webUI via https://<bmc ip>

  3. Power on one node and watch the boot and provisioning process

    Power on with BCM:

    cmsh -c "device use <compute node under test>; power on"
    
    • Alternatively, power on through the BMC webGUI server power control:

    Note

    For the GB200 Compute trays to reset properly through the “power” command, a delay needs to be set in the partition settings.

    Configure power reset delay:

    cmsh -c "partition; show bmcsettings"
    
    [a03-p1-head-01->partition[base]->bmcsettings]% show
    Parameter                        Value
    -------------------------------- ------------------------------------------------
    Revision
    User name                        bright
    Password                         ********
    User ID                          4
    Power reset delay                5s  <-- set this to do : power off + sleep(5) + power on
    
  4. During the boot up of a GB200 compute tray.

    • Watch the node installer log to look for any issues during the provisioning.

    tail -f /var/log/node-installer
    
    • Watch the syslog for any issues/errors

    tail -f /var/log/syslog | grep -i cmd
    
    • Check cmsh to confirm the nodes are in an UP state.

  5. Once one of the GB200 compute trays provisions successfully, proceed to power on and provision the rest of the nodes.