M.2 NVMe Boot Drive Replacement

M.2 NVMe Boot Drive Replacement Overview

This is a high-level overview of the procedure to replace a boot drive.
  1. With the help of NVIDIA Enterprise Support, determine which M.2 drive needs to be replaced.
  2. Get replacement from NVIDIA Enterprise Support.
  3. Power down the system.
  4. Label all cables and unplug them from the motherboard tray.
  5. Remove the motherboard tray and place on a solid flat surface.
  6. Remove the  motherboard tray lid.
  7. Pull out the M.2 riser card with both M.2 disks attached.
  8. Replace the failed M.2 device on the riser card.
  9. Install the M.2 riser card with both M.2 disks.
  10. Close the lid on the motherboard tray.
  11. Insert the motherboard tray into the system.
  12. Plug in all cables using the labels as a reference.
  13. Power on the system.
  14. Confirm the RAID 1 array is being rebuilt.

Identifying the Failed M.2 NVMe

The DGX-2 System automatically sets the failed M.2 drive offline when it detects the failure.
  1. From the console, run the following command to identify the failed drive.
    $ sudo mdadm -D /dev/md0
    Normally, the output would show both drives (nvme0 and nvme1) in an active sync state. The following example output shows only nvme1 in active sync, indicating that nvme0 is the failed drive.
    Number   Major   Minor  RaidDevice  State
       0     259       2       0      active sync  /dev/nvme1n1p2
       -       0       0       1      removed 
  2. Make a note of the device name for the failed drive (nvme0 or nvme1) and the device name for the good drive (nvme0 or nvme1). You will need this information when rebuidling the RAID 1 array after replacing the drive.
  3. Run the following command to determine the location of the failed boot drive, replacing X with the number corresponding to the device name of the failed drive.
    $ ls -l /dev/disk/by-path |grep nvmeX |cut -d':' -f3
    The output will be either '01' or '05'. Be sure to note this number as you will need it when performing the replacement.
  4. Identify the manufacturer and model for the M.2 drive by running the following command on the healthy drive, where X corresponds to the healthy drive, and inspecting the Manufacturer = and Model = line.
    $ sudo nvsm show /systems/localhost/storage/drives/nvmeXn1
  5. Provide the vendor name for the drive when ordering the replacement and then obtain the replacement from NVIDIA Enterprise Support.

Replacing the M.2 NVMe Drive

Before attempting to replace one of the M.2 NVMe drives, be sure to have performed the following:
  • Determined the location ID of the faulty M.2 NVMe drive.
  • Obtained the replacement M.2 NVMe drive and have saved the packaging for use when returning the faulty drive.

CAUTION: Static Sensitive Devices: - Be sure to observe best practices for electrostatic discharge (ESD) protection. This includes making sure personnel and equipment are connected to a common ground, such as by wearing a wrist strap connected to the chassis ground, and placing components on static-free work surfaces.

  1. Back up any critical data to a network shared volume or some other means of backup.
  2. Power down the system.
  3. Label all cables connected to the motherboard tray for easy identification when reconnecting.
  4. Remove the motherboard tray.

    Refer to the instructions in the section Removing the Motherboard Tray.

  5. Remove the M.2 modules and the riser card from the motherboard tray by pushing on the clip to release the riser.

  6. Identify the failed M.2 module and remove it from the riser card by loosening the screw with a Philips #2 screwdriver.

    Use the label on the motherboard tray lid to help identify the M.2_0 module and the M.2_1 module.

  7. Insert the new M.2 module and secure it with the screw to the riser card.
  8. Install the assembled module on the motherboard by inserting the the riser card in its slot.

  9. Install the motherboard tray lid and then install the motherboard tray.

    Refer to the instructions in the section Installing the Motherboard Tray.

  10. Connect all the cables to the motherboard tray.
Rebuild the RAID 1 array according to the instruction in the section Rebuilding the Boot Drive RAID 1 Volume.

Rebuilding the Boot Drive RAID 1 Volume

After replacing a faulty M.2 OS drive, you must rebuild the RAID 1 array.
  1. Turn the DGX-2 System on. The rebuilding process should begin automatically upon system boot.
  2. Log in and then confirm that the RAID 1 array is being rebuilt.
    $ sudo mdadm -D /dev/md0 
    • If the RAID 1 array is still in the process of being rebuilt, the output will include the following line.
      Rebuilt Status  :   XX% complete
    • If the RAID 1 array rebuilding process is completed, the output will show both drives in 'active sync' state and you can skip the remaining steps.
  3. If the rebuilding process did not start automatically, then rebuild the array manually. In the following steps, replace X with the number that corresponds to the replaced drive, and Y with the number that corresponds to the drive that was not replaced (the surviving drive). If you did not note this information when identifying the failed drive, then follow the instructions in the first step of Identifying the Faile M.2 Drive.
    1. Start an NVSM CLI interactive session and switch to the storage target.
      $ sudo nvsm
      nvsm-> cd /systems/localhost/storage
    2. Start the rebuilding process and be ready to enter the device name of the replaced drive.
      nvsm(/systems/localhost/storage)-> start volumes/md0/rebuild
      PROMPT: In order to rebuild this volume, a spare drive
              is required. Please specify the spare drive to
              use to rebuild md0.
      Name of spare drive for md0 rebuild (CTRL-C to cancel): nvmeXn1
      WARNING: Once the volume rebuild process is started, the
               process cannot be stopped.
      Start RAID-1 rebuild on md0? [y/n] y
      
      After entering y at the prompt to start the RAID 1 rebuild, the "Initiating rebuild ..." message appears.
      /systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12
      15:27:26.525187
      Initiating RAID-1 rebuild on volume md0...
        0.0% [\                              ]  
      After about 30 seconds, the "Rebuilding RAID-1 ..." message should appear.
      /systems/localhost/storage/volumes/md0/rebuild started at 2018-10-12
      15:27:26.525187
      Rebuilding RAID-1 rebuild on volume md0...
        31.0% [=============/                         ]  

      If this message remains at "Initiating RAID-1 rebuild" for more than 30 seconds, then there is a problem with the rebuild process. In this case, make sure the name of the replacement drive is correct and try again.

      The RAID 1 rebuild process should take about 1 hour to complete.

Returning the NVMe Drive/Riser Assembly

Use the packaging from the new drive/riser assembly and follow the instructions that came with the package to ship the old drive/riser assembly back to NVIDIA Enterprise Support.
Note: If your organization has purchased a media retention policy, you may be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.