Rack Reboot Sequence#

This section describes how to “cold reboot” or “warm reboot” an MNNVL system.

Note

These terms are defined as follows:

A cold reboot powers the rack off and then back on.
A warm reboot never loses power.

Important

Always use a cold reboot unless directed by the release notes to use a warm reboot.

Cold Reboot Sequence#

To cold reboot a rack:

Note

If the rack will run L11 diagnostics, don’t power on the compute nodes. Stop at step 8.

Run the halt -p command on all compute nodes.
Run the nv action reboot system halt on all NVLink switches.
AC power off all the power shelves in the rack.
Wait at least three minutes for the PSU capacitors to discharge.
Power on all power shelves in the rack.
Wait at least two minutes for the compute tray and switch tray BMCs to power on.
Wait at least two more minutes for the switch trays to boot NVOS.
Verify the health of each switch tray with the switch tray verification steps.
Access the BMC of each compute tray and power on the node.

Tip

Use the RedFish API to power on the compute trays.

curl -si -u <BMC_USERNAME>:<BMC_PASSWORD> -k -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"ResetType": "On"}' https://<BMC_IP/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset

Run the compute tray verification steps.

Warm Reboot Sequence#

Important

Only use a warm reboot when directed by the release notes.

For a warm reboot:

Run the halt -p command on all compute nodes.
Stop the NMX-C service on the switch node running NMX-C.

nv action stop cluster app nmx-controller

Verify the cluster is enabled from the switch node with the command nv show cluster.

nv show cluster
           operational  applied
---------  -----------  -------
state      enabled      enabled

Start the NMX-C service on the switch node.

nv action start cluster app nmx-controller

Verify NMX-C is ok with the command nv show cluster apps running.

$ nv show cluster apps running
Name            Status  Reason  Additional Information
--------------  ------  ------  ------------------------------
nmx-controller  ok              CONTROL_PLANE_STATE_CONFIGURED
nmx-telemetry   ok

Power up all the compute trays through BMC.

Tip

Use the RedFish API to power on the compute trays.

curl -si -u <BMC_USERNAME>:<BMC_PASSWORD> -k -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"ResetType": "On"}' https://<BMC_IP/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset

Run all the verification checks for the compute tray (refer to Post Reboot Verification).

Post Reboot Verification#

Complete the following verification checks when you reset a GPU, reboot the compute tray, or reboot the MNNVL rack.

Verify that the nvidia-persistenced service is running.

systemctl status nvidia-persistenced

Verify that the nvidia-imex service is running.

systemctl status nvidia-imex

Verify that all the NVLink connections are active.

nvidia-smi nvlink --status

Verify fabric state to make sure all GPUs have a Completed state and a Success status.

nvidia-smi -q | grep 'Fabric' -A 4`

Verify peer-to-peer topology to ensure all GPUs show OK.

nvidia-smi topo -p2p n