Rack Reboot Sequence#

To ensure that software is in a known-correct state, running a cold or warm reboot sequence may be required. This topic describes the general reboot and verification process. However, to safeguard system integrity, it is recommended to use the Mission Control to perform the reboot sequence. Refer to the Mission Control documentation for more information.

Cold Reboot Sequence#

The cold rack reboot sequence should be executed for the following use cases:

  • After the initial software installation and setup and before running a workload for the first time.

  • After firmware upgrades and before re-running workloads.

  • Before running the full rack diags.

  • Maintenance or service operations, including but not limited to compute tray, switch tray, trunk links or a cable cartridge replacement.

Use the method that is most appropriate to complete the procedure below. You need all IP addresses for BMC, host, and power shelves.

Note

The sequence diverges at step 8 for NMX-C provisioning depending on whether you plan to run a multi-node workload or an L11 diag. For the diags use-case, the NMX-C services are started by the diags as part of its deployment but for regular software workloads these services have to be started manually after a cold reboot,

  1. AC power off all the power shelves in the rack that will turn off all PSUs.

    The shelves can be powered off in parallel.

  2. Wait three to five minutes for the capacitors to discharge from all the PSUs.

  3. Ping all compute/switch tray BMCs and OS IPs to ensure that they are not reachable on the network.

    Note

    Steps 1-3 are not applicable when you power up the rack for the first time.

  4. AC power on all the power shelves in the rack, which will turn on all PSUs.

    The shelves can be powered on in parallel.

  5. Wait for two minutes for the compute tray and switch tray BMCs to power up.

  6. Ping all compute/switch tray BMC IPs to ensure they are reachable on the network.

  7. The switch tray continues to boot to NVOS.

  8. Wait for two more minutes for the switch trays to boot to NVOS.

  9. To provision the switch tray, complete the following tasks:

  10. Power on all compute nodes from the compute tray BMC.

  11. Wait five minutes.

  12. Ping the compute nodes to ensure that they are reachable.

  13. Run the verification checks for the compute tray (refer to Post-reboot Verification for more information).

Warm Reboot Sequence#

The cold rack reboot sequence takes a long time. For workarounds to certain issues that are usually documented in the associated release notes, the following warm rack reboot procedure ensures that the GPUs, NVLinks, and the Switch ASICs come up in the correct state.

  1. Power off all compute trays through the BMC using IPMI, Redfish APIs or other supported APIs. Here is an example using IPMI commands:

    $ ipmitool chassis power off
    $ ipmitool power off
    
  2. Stop the NMX-C service on the switch node that is running the service.

    $ nv action stop cluster app nmx-controller
    
  3. Start the NMX-C on the selected switch node.

    $ nv action start cluster app nmx-controller
    

    The cluster must already be in enabled state before you restart the NMX-C.

  4. Verify NMX-C is running. Status should return ok.

    $ nv show cluster apps running
    Name            Status  Reason  Additional Information
    --------------  ------  ------  ------------------------------
    nmx-controller  ok              CONTROL_PLANE_STATE_CONFIGURED
    nmx-telemetry   ok
    
  5. Power up all the compute trays through BMC using IPMI, redfish APIs or other supported APIs. Here is an example using IPMI commands:

    ipmitool chassis power on
    ipmitool power on
    
  6. Run all the verification checks for the compute tray (refer to Post-reboot Verification for more information).

Post-reboot Verification#

Complete the following verification checks when a GPU is reset, the compute tray is rebooted, or the entire rack is cold or warm rebooted. For detailed instructions on performing each verification step, refer to the MNNVL User Guide Verification section.

  1. Verify that nvidia-persistenced is running and is in a good state..

    Refer to Checking the nvidia-persistenced Service for more information.

  2. Verify that nvidia-imex service is active and running.

    Refer to Checking the nvidia-imex Service for more information.

  3. Verify that all the links are active.

    Refer to Checking the NVLink status for more information.

  4. Verify fabric state to make sure all GPUs have a Completed state and a Success status.

    Refer to Checking the fabric health for more information.

  5. Verify peer-to-peer topology to ensure all GPUs show OK.

  6. If a GPU does not show OK, reset the GPU and return to step 1.

    Refer to Checking the p2p topology for details.