Rack Reboot Sequence#
This section describes how to “cold reboot” or “warm reboot” an MNNVL system.
Note
These terms are defined as follows:
A cold reboot powers the rack off and then back on.
A warm reboot never loses power.
Important
Always use a cold reboot unless directed by the release notes to use a warm reboot.
Cold Reboot Sequence#
To cold reboot a rack:
Note
If the rack will run L11 diagnostics, don’t power on the compute nodes. Stop at step 8.
Run the
halt -p
command on all compute nodes.Run the
nv action reboot system halt
on all NVLink switches.AC power off all the power shelves in the rack.
Wait at least three minutes for the PSU capacitors to discharge.
Power on all power shelves in the rack.
Wait at least two minutes for the compute tray and switch tray BMCs to power on.
Wait at least two more minutes for the switch trays to boot NVOS.
Verify the health of each switch tray with the switch tray verification steps.
Access the BMC of each compute tray and power on the node.
Tip
Use the RedFish API to power on the compute trays.
curl -si -u <BMC_USERNAME>:<BMC_PASSWORD> -k -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"ResetType": "On"}' https://<BMC_IP/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
Run the compute tray verification steps.
Warm Reboot Sequence#
Important
Only use a warm reboot when directed by the release notes.
For a warm reboot:
Run the
halt -p
command on all compute nodes.Stop the NMX-C service on the switch node running NMX-C.
nv action stop cluster app nmx-controller
Verify the cluster is
enabled
from the switch node with the commandnv show cluster
.
nv show cluster
operational applied
--------- ----------- -------
state enabled enabled
Start the NMX-C service on the switch node.
nv action start cluster app nmx-controller
Verify NMX-C is
ok
with the commandnv show cluster apps running
.
$ nv show cluster apps running
Name Status Reason Additional Information
-------------- ------ ------ ------------------------------
nmx-controller ok CONTROL_PLANE_STATE_CONFIGURED
nmx-telemetry ok
Power up all the compute trays through BMC.
Tip
Use the RedFish API to power on the compute trays.
curl -si -u <BMC_USERNAME>:<BMC_PASSWORD> -k -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{"ResetType": "On"}' https://<BMC_IP/redfish/v1/Systems/System_0/Actions/ComputerSystem.Reset
Run all the verification checks for the compute tray (refer to Post Reboot Verification).
Post Reboot Verification#
Complete the following verification checks when you reset a GPU, reboot the compute tray, or reboot the MNNVL rack.
Verify that the
nvidia-persistenced
service is running.
systemctl status nvidia-persistenced
Verify that the
nvidia-imex
service is running.
systemctl status nvidia-imex
Verify that all the NVLink connections are active.
nvidia-smi nvlink --status
Verify fabric state to make sure all GPUs have a
Completed
state and aSuccess
status.
nvidia-smi -q | grep 'Fabric' -A 4`
Verify peer-to-peer topology to ensure all GPUs show OK.
nvidia-smi topo -p2p n