M.2 NVMe Boot Drive Replacement

Caution

Static Sensitive Devices: Be sure to observe best practices for electrostatic discharge (ESD) protection. This includes making sure personnel and equipment are connected to a common ground, such as by wearing a wrist strap connected to the chassis ground, and placing components on static-free work surfaces.

M.2 NVMe Boot Drive Replacement Overview

This is a high-level overview of the procedure to replace a boot drive.

  1. Determine which M.2 device needs to be replaced with the help of NVIDIA Enterprise Support

  2. Get a replacement M.2 disk from NVIDIA Enterprise Support

  3. Make sure the system is shut down

  4. If cables don’t reach, label all cables and unplug them from the motherboard tray

  5. Slide motherboard out until it locks in place

  6. Open rear compartment

  7. Pull out the M.2 riser card with both M.2 disks attached

  8. Replace the failed M.2 device on the riser card

  9. Install the M.2 riser card with both M.2 disks

  10. Close the rear motherboard compartment

  11. Slide the motherboard back into the system

  12. Plug in all cables using the labels as a reference

  13. Power on the system

  14. Confirm the M.2 RAID 1 mirror is synchronizing

  15. Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided

Identify the Failed M.2 NVMe

The DGX H100 system automatically sets the failed M.2 drive offline when it detects the failure. The boot drives are mirrored, so the mdadm command-line utility can identify the drive to replace.

  1. Determine which drive failed:

    sudo nvsm show health
    

    The command output indicates the drive name, nvme0n1 or nvme1n1.

  2. Confirm the drive name by using the mdadm command:

    sudo mdadm -D /dev/md0
    

    The command output indicates the drive names and the drive state.

  3. Contact NVIDIA Enterprise Support to request a replacement M.2 drive.

  4. When the new drive arrives, you must remove the failed drive from the RAID volume. Run the following commands to mark the drive as failed and to remove the drive from the array.

    1. Mark the disk as failed, if it is not already marked as failed:

      sudo mdadm --manage /dev/md0 --fail /dev/nvmeXn1
      
    2. Remove the failed disk from the array:

      sudo mdadm --manage /dev/md0 --remove /dev/nvmeXn1
      

    Replace X in the preceding commands with the ID of the failed drive.

  5. Back up any critical data to a network shared volume or some other means of backup.

  6. Power down the system.

Remove the M.2 Boot Drive Carrier

Before attempting to remove M.2 boot drive carrier, make sure that you performed the following prerequisites:

  • Label all network, monitor, and USB cables connected to the motherboard tray for easy identification when reconnecting.

  • Unplug all power cords, and all network, monitor, and USB cables.

Refer to Motherboard Tray - Opening and Closing the IO door for more information.

  1. After the IO section of the motherboard is open, unlock the M.2 drive carrier by loosening the PCI card locking mechanism by loosening the black captive thumbscrew on the right side of the motherboard:

    _images/dgx-h100-unlock-m2-carrier.png
  2. Rotate the locking mechanism for the PCI carrier out of the way:

    _images/dgx-h100-lock-remove.png
  3. Lossen the captive screw on the support bracket of the M.2 riser card:

    _images/dgx-h100-pci-riser-loosen.png
  4. Pull the M.2 riser card from the slot:

    _images/card-remove-pci.png
  5. Lift the M.2 riser card to remove it from the system:

    _images/dgx-h100-pci-riser-lift.png

Remove the M.2 Drive

Before attempting to remove one of the M.2 NVMe drives, make sure that you performed the following prerequisites:

  • Determined the location ID of the faulty M.2 drive.

  • Obtained the replacement M.2 drive and have saved the packaging for use when returning the faulty drive.

  1. Identify the M.2 NVMe that needs to be replaced:

    _images/nvme-card.png
  2. Loosen the screw of the identified M.2 drive:

    _images/nvme-card-2.png
  3. Pull the left end of the M.2 drive up about 30˚:

    _images/dgx-h100-nvme-lift-30.png
  4. To pull the M.2 out, raise it slightly, up to 30˚ and pull the drive off the socket as shown in the following figure:

    _images/dgx-h100-nvme-remove.png

Replace the M.2 Drive

  1. To insert the M.2 drive, set it at an angle and insert it into the connector:

    _images/nvme-drive-1.png
  2. Lower the M.2 drive and align it with the screw post:

    _images/nvme-drive-2.png
  3. Install and tighten the screw to secure the drive to the riser:

    _images/nvme-drive-3.png

Install the M.2 Boot Drive Carrier and Close the System

  1. Position the M.2 riser card into the system:

    _images/card-m2-riser.png
  2. Install the M.2 carrier card into the PCI riser by aligning it with the slot and then pressing it against the riser:

    _images/card-m2-riser-2.png
  3. Tighten the captive screw on the support bracket of the M.2 riser card:

    _images/dgx-h100-pci-riser-tighten.png
  4. Close the latch to secure the M.2 carrier and secure it in place:

    _images/pci-carrier-lock.png
  5. Tighten the thumb screw to make sure the locking mechanism stays in place:

    _images/rear-captive-lock.png

Integrate the New Drive and Complete Installation

  1. Return the motherboard to its regular position and power on the system.

    Refer to Motherboard Tray - Opening and Closing the IO door for more information.

  2. Boot the Operating System.

  3. Run the following command to rebuild the boot drive mirror:

    sudo nvsm start /systems/localhost/storage/volumes/md0/rebuild/
    
  4. At the Type of volume rebuild prompt, enter raid-1 and press Enter:

    PROMPT: In order to rebuild volume, volume type is required. Please
         specify the volume type to rebuild from options below.
         raid-0: create raid-0 data volume
         raid-1: rebuild OS boot and root volumes
         esp:    find and replicate an empty EFI system partition
    
    Type of volume rebuild (CTRL-C to cancel): raid-1
    
  5. At the Name of spare drive prompt, enter the replacement drive name, nvme0n1 or nvme1n1, and press Enter:

    PROMPT: In order to rebuild this volume, a spare drive
         is required. Please specify the spare drive to
         use to rebuild RAID-1.
    
    Name of spare drive for RAID-1 rebuild (CTRL-C to cancel): nvmeXn1
    
  6. At the warning prompt, enter y and press Enter:

    WARNING: Once the volume rebuild process is started, the process cannot be stopped.
    Start RAID-1 rebuild on md0? [y/n] y
    

    Example Output

    Initializing rebuild ...
    
  7. Monitor the progress. After approximately 30 seconds, the following message appears:

    /systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12
              15:27:26.525187
    Rebuilding RAID-1 rebuild on volume md0…
    31.0% [=============/ ]
    

    If this message remains at Initiating RAID-1 rebuild for more than 30 seconds, then there is a problem with the rebuild process. In this case, make sure the name of the replacement drive is correct and try again.

  8. Use the packaging from the new drive to ship back the failed drive back to NVIDIA Enterprise Support

Note

If your organization purchased a media retention policy, you might be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.