U.2 NVMe Cache Drive Replacement

Important:

Replace only with U.2 NVMe drives of the same manufacturer, model, and density (capacity) as the existing or replaced drives.

U.2 NVMe Cache Drive Replacement Overview

This is a high-level overview of the procedure to replace a cache Non-Volatile Memory Express (NVMe) drive.
  1. Identify the failed U.2 NVMe drive.
  2. Order a replacement from NVIDIA Enterprise Support.
  3. Use nvsm to prepare the drive for removal - look for the white LED.
  4. Replaced the failed NVMe drive.
  5. Rebuild the RAID volume and remount the /raid partition.
  6. Confirm the system is healthy by running nvsm show health.
  7. Ship the failed unit back to NVIDIA Enterprise Support using the provided packaging.

Identifying the Failed U.2 NVMe

Identifying the Failed NVMe from the Front

If physical access to the system is available, you can identify a failed drive by the illuminated amber LED .

Identifying the Failed NVMe from the Console

To identify the failed NVMe drive from the DGX A100 console, enter the following and then look for drive alerts in the output to identify the failed drive.

$ sudo nvsm show health

The command returns the PCIe bus ID. Refer to the following figure to find the slot ID that corresponds to the PCIe bus ID for the faulty drive.

Figure 1. NVMe Drives: PCIe to Slot Mapping

Alternatively, you can use the BMC dashboard to access the Sensor screen, the IPMI event log, and the System log to identify issues with the U.2 drives.

Note: The PCIe bus IDs for slots 6 and 7 depend on the firmware version.

Identifying the NVMe Manufacturer and Model

Enter the following, replacing X with the number corresponding to the Linux device name for the failed drive.

$ sudo nvsm show /systems/localhost/storage/drives/nvmeXn1 

Example output:

/systems/localhost/storage/drives/nvme5n1
Properties:
    Capacity = 3840755982336
    BlockSizeBytes = 7501476528
    SerialNumber = 174719FCF9F1
    PartNumber = N/A
    Model = Micron_9200_MTFDHAL3T8TCT
    Revision = 100007H0
    Manufacturer = Micron Technology Inc
    Status_State = Enabled
    Status_Health = OK
    Name = Non-Volatile Memory Express
    MediaType = SSD
    IndicatorLED = N/A
    EncryptionStatus = N/A
    HotSpareType = N/A
    Protocol = NVMe
    NegotiatedSpeedsGbs = 0
    Id = 5

Determine the manufacturer and model from the 'Model' entry in the output, and then request a replacement NVMe from NVIDIA Enterprise Support, specifying this information.

Replacing the U.2 NVMe Drive

  1. Be sure you have requested and obtained the replacement drive from NVIDIA Enterprise Support.
  2. Back up any critical data to a network shared volume or some other means of backup.
  3. Power off the system using the power button.
  4. Remove the NVMe drive.
    1. Push the lever release button (on the right side of the lever) to unlock the lever.

    2. Pull the lever to remove the module.

  5. Replace the new NVMe drive in the same slot.
    1. Unlock the release lever and then slide the drive into the slot until the front face is flush with the other drives.
    2. Close the lever and lock it in place.

  6. Power on the system.
Perform the tasks describes in the chapter U.2 NVMe Cache Drive Post-Installation Tasks.