M.2 NVMe Boot Drive Replacement#
This topic describes how to replace the boot drive in the NVIDIA DGX™ B200 system.
Caution
Static Sensitive Devices: Be sure to observe best practices for electrostatic discharge (ESD) protection. This includes ensuring personnel and equipment are connected to a common ground, such as by wearing a wrist strap connected to the chassis ground and placing components on static-free work surfaces.
M.2 NVMe Boot Drive Replacement Overview#
This is a high-level overview of the procedure to replace a boot drive.
Determine which M.2 device needs to be replaced with the help of NVIDIA Enterprise Support.
Get a replacement M.2 disk from NVIDIA Enterprise Support.
Ensure the system is shut down.
If cables do not reach, label all the cables and unplug them from the motherboard tray.
Slide the motherboard out until it locks in place.
Open the rear compartment.
Pull out the M.2 riser card with both M.2 disks attached.
Replace the failed M.2 device on the riser card.
Install the M.2 riser card with both M.2 disks.
Close the rear motherboard compartment.
Slide the motherboard back into the system.
Plug in all the cables using the labels as a reference.
Power on the system.
Confirm the M.2 RAID 1 mirror is synchronizing.
Send the failed unit to NVIDIA Enterprise Support using the packaging provided.
Identify the Failed M.2 Drive#
The NVIDIA DGX™ B200 system automatically sets the failed M.2 drive offline when it detects the failure.
The boot drives are mirrored, so the mdadm
command-line utility can identify the drive to replace.
Determine which drive failed:
sudo nvsm show health
The command output indicates the drive name,
nvme0n1
ornvme1n1
.Confirm the drive name by using the
mdadm
command:sudo mdadm -D /dev/md0
The command output indicates the drive names and the drive state.
Contact NVIDIA Enterprise Support to request a replacement M.2 drive.
Back up any critical data to a network shared volume or other backup option.
When the new drive arrives, remove the failed drive from the mirrored volume.
Run the following commands to mark the drive as failed and to remove the drive from the array.
Mark the disk as failed, if it is not already marked as failed:
sudo mdadm --manage /dev/md0 --fail /dev/nvme[0/1]n1
Remove the failed disk from the array:
sudo mdadm --manage /dev/md0 --remove /dev/nvme[0/1]n1
Power off the system.
Remove the M.2 Boot Drive Carrier#
Before attempting to remove the M.2 boot drive carrier, perform the following prerequisites:
Label all network, monitor, and USB cables connected to the motherboard tray for easy identification when reconnecting.
Unplug all power cords, network, monitor, and USB cables.
For more information, refer to Motherboard Tray - Opening and Closing the I/O Door.
After the I/O section of the motherboard is open, loosen the black captive thumbscrew on the right side of the motherboard for the PCI card locking mechanism:
Rotate the locking mechanism for the PCI carrier out of the way:
Loosen the captive screw on the support bracket of the M.2 riser card:
Pull the M.2 riser card from the slot:
Lift the M.2 riser card to remove it from the system:
Remove the M.2 Drive#
Before attempting to remove one of the M.2 NVMe drives, perform the following prerequisites:
Determine the location ID of the faulty M.2 drive.
Obtain the replacement M.2 drive and save the packaging for returning the faulty drive.
Replace the M.2 Drive#
Install the M.2 Boot Drive Carrier and Close the System#
Lower the M.2 riser card into the slot:
Install the M.2 carrier card into the PCI riser by aligning it with the slot and then pressing it against the PCI slot riser:
Tighten the captive screw on the support bracket of the M.2 PCI riser card:
Close the latch to secure the M.2 carrier card and secure it in place:
Tighten the thumbscrew to ensure the locking mechanism stays in place:
Integrate the New Drive and Complete the Installation#
Return the motherboard to its regular position and power on the system.
For more information, refer to Motherboard Tray - Opening and Closing the I/O Door.
Boot the operating system.
Run the following command to rebuild the boot drive mirror:
sudo nvsm start /systems/localhost/storage/volumes/md0/rebuild/
When prompted, enter the device name of the spare (replaced) drive,
nvme0n1
ornvme1n1
.PROMPT: In order to rebuild this volume, a spare drive is required. Please specify the spare drive to use to rebuild md0. Name of spare drive for md0 rebuild (CTRL-C to cancel): nvmeXn1 WARNING: Once the volume rebuild process is started, the process cannot be stopped. Start RAID-1 rebuild on md0? [y/n] y
After entering y at the prompt to start the RAID 1 rebuild, the
Initiating rebuild ...
message appears.After about 30 seconds, the
Rebuilding RAID-1 ...
message should appear./systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12 15:27:26.525187 Rebuilding RAID-1 rebuild on volume md0... 31.0% [=============/ ]
If this message remains at
Initiating RAID-1 rebuild
for more than 30 seconds, the rebuild process cannot be completed successfully. In this case, ensure the name of the replacement drive is correct and try again.
Use the packaging from the new drive to send the failed drive to NVIDIA Enterprise Support.
Note
If your organization purchased a media retention policy, you might be able to keep the failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.