U.2 NVMe Cache Drive Replacement
Important
Replace only with U.2 NVMe drives of the same manufacturer, model, and density (capacity) as the existing or replaced drives.
U.2 NVMe Cache Drive Replacement Overview
This is a high-level overview of the procedure to replace a cache Non-Volatile Memory Express (NVMe) drive.
Identify the failed U.2 NVMe drive.
Order a replacement from NVIDIA Enterprise Support.
Use nvsm to prepare the drive for removal - look for the white LED.
Replaced the failed NVMe drive.
Rebuild the RAID volume and remount the
/raid
partition.Confirm the system is healthy by running
nvsm show health
.Ship the failed unit back to NVIDIA Enterprise Support using the provided packaging.
Identifying the Failed U.2 NVMe
Identifying the Failed NVMe from the Front
If physical access to the system is available, you can identify a failed drive by the illuminated amber LED .
Identifying the Failed NVMe from the Console
To identify the failed NVMe drive from the DGX A100 console, enter the following and then look for drive alerts in the output to identify the failed drive.
$ sudo nvsm show health
The command returns the PCIe bus ID. Refer to the following figure to find the slot ID that corresponds to the PCIe bus ID for the faulty drive.
Alternatively, you can use the BMC dashboard to access the Sensor screen, the IPMI event log, and the System log to identify issues with the U.2 drives.
Note
The PCIe bus IDs for slots 6 and 7 depend on the firmware version.
Identifying the NVMe Manufacturer and Model
Enter the following, replacing X
with the number corresponding to the Linux device name for the failed drive.
$ sudo nvsm show /systems/localhost/storage/drives/nvmeXn1
Example output:
/systems/localhost/storage/drives/nvme5n1
Properties:
Capacity = 3840755982336
BlockSizeBytes = 7501476528
SerialNumber = 174719FCF9F1
PartNumber = N/A
Model = Micron_9200_MTFDHAL3T8TCT
Revision = 100007H0
Manufacturer = Micron Technology Inc
Status_State = Enabled
Status_Health = OK
Name = Non-Volatile Memory Express
MediaType = SSD
IndicatorLED = N/A
EncryptionStatus = N/A
HotSpareType = N/A
Protocol = NVMe
NegotiatedSpeedsGbs = 0
Id = 5
Determine the manufacturer and model from the ‘Model’ entry in the output, and then request a replacement NVMe from NVIDIA Enterprise Support, specifying this information.
Replacing the U.2 NVMe Drive
Be sure you have requested and obtained the replacement drive from NVIDIA Enterprise Support.
Back up any critical data to a network shared volume or some other means of backup.
Power off the system using the power button.
Remove the NVMe drive.
Push the lever release button (on the right side of the lever) to unlock the lever.
Pull the lever to remove the module.
Replace the new NVMe drive in the same slot.
Unlock the release lever and then slide the drive into the slot until the front face is flush with the other drives.
Close the lever and lock it in place.
Power on the system.
Perform the tasks describes in the chapter U.2 NVMe Cache Drive Post-Installation Tasks.