U.2 NVMe Cache Drive Replacement

U.2 NVMe Cache Drive Replacement Overview

This is a high-level overview of the procedure to replace a cache drive.
CAUTION:
Hot-swapping of the NVMe drives is not supported. Be sure to turn the system off before replacing a failed drive.
  1. Identify the failed Non-Volatile Memory Express (NVMe) drive.
  2. Get replacement from NVIDIA Enterprise Support.
  3. Power down the system and then remove the failed NVMe drive.
  4. Insert the new NVMe drive.
  5. Power on the DGX-2 System.
  6. Rebuild the RAID volume and remount the /raid partition.

Identifying the Failed U.2 NVMe

Identifying the Failed NVMe from the Front

If physical access to the system is available, you can identify a failed drive by the blinking red LED as illustrated in the following example.

Identifying the Failed NVMe from the Console

To identify the failed NVMe drive from the DGX-2 console, enter the following and then look for a missing entry from the output.

$ sudo mdadm -D /dev/md1
 
Number   Major   Minor   RaidDevice State
          0     259        8        0      active sync   /dev/nvme9n1
          1     259       13        1      active sync   /dev/nvme5n1
          2     259        7        2      active sync   /dev/nvme6n1
          3     259       10        3      active sync   /dev/nvme3n1
          4     259       12        4      active sync   /dev/nvme2n1
          5     259       11        5      active sync   /dev/nvme7n1
          6     259        9        6      active sync   /dev/nvme8n1
          7     259        6        7      active sync   /dev/nvme4n1

The list should include device names from nvme2n1 through nvme9n1 for systems with 8 NVMe drives, and from nvme0n1 through nvme15n1 for systems with 16 NVMe drives.

To map the device name to the physical slot ID, enter the following, where X corresponds to the missing device name.

$ ls -l /dev/disk/by-path |grep nvmeX |cut -d'|' -f3

The command returns the PCIe bus ID. Refer to the following figure to find the slot ID that corresponds to the PCIe bus ID for the faulty drive.

Figure 1. NVMe Drives: PCIe to Slot Mapping

Identifying the NVMe Manufacturer and Model

Enter the following, replacing X with the number corresponding to the Linux device name for the failed drive.

$ sudo nvsm show /systems/localhost/storage/drives/nvmeXn1 

Example output:

/systems/localhost/storage/drives/nvme5n1
Properties:
    Capacity = 3840755982336
    BlockSizeBytes = 7501476528
    SerialNumber = 174719FCF9F1
    PartNumber = N/A
    Model = Micron_9200_MTFDHAL3T8TCT
    Revision = 100007H0
    Manufacturer = Micron Technology Inc
    Status_State = Enabled
    Status_Health = OK
    Name = Non-Volatile Memory Express
    MediaType = SSD
    IndicatorLED = N/A
    EncryptionStatus = N/A
    HotSpareType = N/A
    Protocol = NVMe
    NegotiatedSpeedsGbs = 0
    Id = 5

Determine the manufacturer and model from the 'Model' entry in the output, and then request a replacment NVMe from NVIDIA Enterprise Support, specifying this information.

Replacing the U.2 NVMe Drive

  1. Be sure you have obtained the replacement drive.
  2. Back up any critical data to a network shared volume or some other means of backup.
  3. Power off the system using the power button.
  4. Remove the NVMe drive by squeezing the levers on the handle and pulling the drive out.
  5. Replace the new NVMe drive in the same slot by fully inserting it and making sure it clicks into place.
  6. Power on the system.
Perform the tasks describes in the chapter U.2 NVMe Cache Drive Post-Installation Tasks.