M.2 NVMe Boot Drive Replacement#
This topic describes how to replace the boot drive in the NVIDIA DGX™ B300 system.
Caution
Static Sensitive Devices: Ensure to observe best practices for electrostatic discharge (ESD) protection. Ensure that personnel and equipment are connected to a common ground, such as wearing a wrist strap connected to the chassis ground and placing components on static-free work surfaces.
M.2 NVMe Boot Drive Replacement Overview#
This is a high-level overview of the procedure to replace a boot drive.
Identify the failed M.2 drive.
Get a replacement M.2 from NVIDIA Enterprise Support.
Power off the system.
Unplug all motherboard cables.
Pull out the motherboard.
Remove the lid.
Remove the left BlueField-3 I/O bay.
Pull out the M.2 bay.
Replace the failed M.2 drive.
Install the M.2 bay
Install the left BlueField-3 I/O bay.
Install the motherboard lid.
Slide the motherboard tray into the system.
Connect all the cables.
Power on the system.
Rebuild the RAID volume and mount the filesystem.
Send the failed unit to NVIDIA Enterprise Support using the packaging provided.
Prepare for Replacement#
The NVIDIA DGX™ B300 system automatically sets the failed M.2 drive offline when it detects the failure.
The boot drives are mirrored, so the mdadm command-line utility can identify the drive to replace.
Caution
Wear an ESD strap during any procedure that involves touching electronic components.
Identify the failed M.2 drive using the OS tools or the
nvsmcommand.sudo nvsm show health
The command output indicates the drive name,
nvme0n1ornvme1n1.Confirm the drive name using the
mdadmcommand:sudo mdadm -D /dev/md0
The command output indicates the drive names and the drive state.
Contact NVIDIA Enterprise Support to request a replacement M.2 drive.
Back up any critical data to a network shared volume or other backup option.
When the new drive arrives, remove the failed drive from the mirrored volume.
Run the following commands to mark the drive as failed and to remove the drive from the array.
Mark the disk as failed, if it is not already marked as failed:
sudo mdadm --manage /dev/md0 --fail /dev/nvme[0/1]n1
Remove the failed disk from the array:
sudo mdadm --manage /dev/md0 --remove /dev/nvme[0/1]n1
Power off the system.
Remove the left BlueField-3 I/O bay to access the M.2 NVMe boot drives below.
Note
Each cable is labeled to ensure it is connected to the correct position after the procedure.
Remove the BlueField-3 I/O Bay#
After the four cables have been unplugged, press the left release tab and push the bay towards the front.
Carefully route the cables through the opening as the I/O bay is moved out of the motherboard tray.
Finish pulling the old I/O bay out of the motherboard tray.
Ensure the motherboard tray levers remain fully extended, as shown in the illustration, so the M.2 bay can be pulled out.
Release the M2 Bay#
Remove the M.2 Drive#
Before attempting to remove one of the M.2 NVMe drives, perform the following prerequisites:
Determine the location ID of the faulty M.2 drive.
Obtain the replacement M.2 drive and save the packaging for returning the faulty drive.
Identify the M.2 drive that needs to be replaced.
Note
The PCIe bus number corresponds to a specific drive on the board.
Remove the screw on the M.2 drive that needs to be replaced.
The screw is not captive and might fall and become lost.
Tilt the M.2 drive slightly so it can be ejected.
Pull the M.2 drive out of the connector and off the bay.
Install the New M.2 Drive#
Insert the M.2 Drive Bay and Reconnect the I/O Bay#
After the replacement is complete, ensure the ejection levers are completely open. Insert the M.2 drive bay into the corresponding lower slot until it locks in place.
Carefully route all the cables through the opening in the motherboard tray slot.
After inserting the I/O bay into the tray, ensure it locks in place, checking the tab has locked.
Connect the two power cables and the two PCIe cables to their correct connectors on the switchboard, following the labels on each cable end.
To identify the correct connections, refer to this table that maps BlueField-3 card connectors to their corresponding board connectors.
BlueField-3 I/O Board
Left Slot Installation
Right Slot Installation
Cable Label P2
Board connector J9
Board connector J3
Cable Label P3
Board connector J10
Board connector J4
Integrate the New Drive and Complete the Installation#
Insert the motherboard following the instructions in Motherboard Tray - Opening and Closing.
Power on the system and log in.
Rebuild the RAID1 (mirror) volume on the boot drives.
Confirm the system is healthy by running the
nvsmcommand.sudo nvsm show health
Send the failed M.2 drive to NVIDIA Enterprise Support using the packaging provided.
Note
If your organization purchased a media retention policy, you might be able to keep the failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.