E1.S Cache Drive Replacement#

This topic describes how to replace an E1.S cache drive in the compute tray of the NVIDIA DGX™ GB200 system.

E1.S Cache Drive Replacement Overview#

This is a high-level overview of the steps needed to replace a cache drive.

  1. Identify the failed cache drive

  2. Power down the compute tray being serviced

  3. Replace the drive

  4. Power up the compute tray

  5. Rebuild the system RAID volume

  6. Confirm system health with nvsm show health

Identify the Failed Cache Module#

This diagram shows the physical location of each cache drive module slot. Only odd-numbered slots are used for NVMe E1.S storage devices.

_images/cache-identify.png

Identify a failed cache module using any of the following methods:

  • Run sudo nvsm show health from a terminal session and look for drive alerts

  • Use the BMC web interface to view the IPMI events log and look for drive alerts

Replace the Failed Cache Drive Module#

  1. Power down the compute tray being serviced.

  2. Identify the NVMe E1.S drive that’s being replaced. Press the button at the top of the drive to eject it and release the lever.

    _images/cache-replace-1-open.png
  3. Use the lever to remove the failed drive module, and then insert the new one. As you insert the new module, press the latch button to ensure the lever stays in the open position.

    _images/cache-replace-2-exchange.png
  4. Fully insert the drive module and close the lever to lock it in place.

    _images/cache-replace-3-close.png

Finalize the Replacement Procedure#

  1. Power up the system and log into the console.

  2. Confirm the new drive module is recognized by running sudo nvme list. You’ll see something like the following (one boot and four cache drives will be visible, but the names and models may differ):

    Node          SN         Model    Namespace Usage                Format       FW Rev
    ------------- ---------- -------- --------- -------------------- ------------ --------
    /dev/nvme0n1  S4YPNE0N3  SAMSUNG  1           3.84 TB / 3.84 TB  512 B + 0 B  EPK9CB5Q
    /dev/nvme1n1  S4YPNE0N0  SAMSUNG  1           3.84 TB / 3.84 TB  512 B + 0 B  EPK9CB5Q
    /dev/nvme2n1  S436NA0N4  SAMSUNG  1          44.44 GB / 1.92 TB  512 B + 0 B  EDA7602Q
    /dev/nvme4n1  S4YPNE0N2  SAMSUNG  1           3.84 TB / 3.84 TB  512 B + 0 B  EPK9CB5Q
    /dev/nvme5n1  S4YPNE0N1  SAMSUNG  1           3.84 TB / 3.84 TB  512 B + 0 B  EPK9CB5Q
    
  3. If disk encryption is enabled, disable it before rebuilding the RAID array using the sudo nv-disk-encrypt disable command.

  4. Rebuild the RAID cache volume using the configure_raid_array.py -c -f command. Enter y when prompted to confirm the operation.

  5. If disk encryption is desired, enable it using the instructions in the DGX OS user guide.

  6. Confirm the RAID volume is healthy by running the sudo nvsm show volumes command.

  7. Return the failed cache module to NVIDIA Enterprise Support using the packaging provided.