U.2 NVMe Cache Drive Replacement

U.2 NVMe Cache Drive Replacement Overview

This is a high-level overview of the procedure to replace a cache Non-Volatile Memory Express (NVMe) drive.

  1. Identify failed SSD

  2. Get replacement SSD from NVIDIA Enterprise Support

  3. Power off the system

  4. Remove failed SSD identified earlier

  5. Insert new SSD

  6. Power on the system

  7. Rebuild the RAID volume and mount the filesystem

  8. Ship back the failed unit to NVIDIA Enterprise Support using the packaging provided

Identifying the Failed U.2 NVMe SSD

Identifying the Failed NVMe from the Front

If physical access to the system is available, you can identify a failed drive by the illuminated amber LED.

_images/u2-nvme-mapping-h100.png

Identifying the Failed NVMe from the Console

  • To identify the failed data drive, you can use the nvsm command:

    sudo nvsm show health
    

    View the command output and look for drive alerts to identity the failed drive.

Alternatively, you can use the BMC web user interface to access the Sensor screen, the IPMI event log, and the System log to identify issues with the U.2 drives.

Identifying the NVMe Manufacturer and Model

  • Use the nvsm command to display the drive information:

    sudo nvsm show /systems/localhost/storage/drives/nvmeXn1
    

    Replace X in the preceding command with the number that corresponds to the Linux device name for the failed drive.

    Example Output

     /systems/localhost/storage/drives/nvme5n1
     Properties:
         PhysicalLocation_Info = SlotU.2_Slot3
         BlockSizeBytes = 512
         SerialNumber = 22L0A01WT2N8
         Model = KCM6DRUL3T84
         Revision = 0107
         Manufacturer = KIOXIA Corporation
         Status_State = Enabled
         Status_Health = OK
         Name = nvme5n1
         MediaType = SSD
         EncryptionStatus = Unlocked
         CapacityBytes = 3840755982336
         Id = nvme5n1
     Targets:
     Verbs:
         cd
         set
         show
    

    Refer to the Manufacturer and Model fields in the output. Request a replacement NVMe from NVIDIA Enterprise Support, specifying this information.

Replacing the U.2 NVMe Drive

  1. Make sure that you requested and obtained the replacement drive from NVIDIA Enterprise Support.

  2. Back up any critical data to a network shared volume or some other means of backup.

  3. Power off the system using the power button.

  4. Remove the bezel. Refer to Removing and Attaching the Bezel for more information.

  5. After the system powers off, use the following figure to identify the drive to replace on the chassis.

    The figures in the following procedures show replacing drive number 7 at PCI address ae.

    _images/u2-nvme-mapping-h100.png
  6. Remove the NVMe drive.

    1. Press the tab on the right side of the drive to release the lever:

      _images/dgx-h100-nvme-lever.png
    2. Pull the drive out by using the lever:

      _images/dgx-h100-nvme-lever-remove.png
    3. Remove the drive:

      _images/dgx-h100-u2-nvme-remove.png

Insert the U.2 NVMe Drive

  1. Open the lever on the drive and insert the replacement drive in the same slot:

    _images/dgx-h100-nvme-install.png
  2. Close the lever and secure it in place:

    _images/dgx-h100-nvme-lever-close.png
  3. Confirm the drive is flush with the system:

    _images/dgx-h100-nvme-flush.png
  4. Install the bezel after the drive replacement is complete.

  5. Power on the system.

Next Steps