DGX-2 Service Manual :: DGX Systems Documentation

U.2 NVMe Cache Drive Post-Installation Tasks

This chapter describes the tasks that are typically needed after replacing a U.2 NVME drive or upgrading from 8 to 16 drives.

Recreating the Cache RAID 0 Volume

Stop cachefilesd.
```
$ sudo systemctl stop cachefilesd 
```

Umount /raid and stop raid-0.

$ sudo umount –f /raid

$ sudo mdadm –-stop /dev/md1

Run the script to rebuild the RAID volume.
```
$ sudo /usr/bin/configure_raid_array.py –c –f
```
Press Y at any questions.
When completed, confirm that the /raid volume is mounted.
```
$ df -hl /raid
```
The /dev/md1 filesystem should be mounted on /raid with size 28 TB or 56 TB, depending on whether 8 or 16 drives are installed.

Confirming the Volume is Ready

Confirm the storage devices and volumes in the system are healthy using the following command.
```
$ sudo nvsm show systems/localhost/storage/volumes/md1 
```
Verify Status_Health=OK and that the numbers of drives listed in Drives = is as expected.
Confirm that the drives are now available.
```
$ sudo mdadm -D /dev/md1  
```

If the drive manufacturer is Micron, perform the steps in Enabling the Temperature Sensor.

Enabling the Temperature Sensor

The steps in this section need to be followed only for Micron NVMe drives.

Verify the need to enable temperature reading for the installed NVMe drives by running ipmitool.
```
$ sudo ipmitool sdr | grep -i temp | grep -i -e nvme*temp -e temp_u2
```
If any of the NVMe drives do not show a temperature reading, then enable the SMBUS on all the drives.
1. Esatablish root role before running the script.
```
$ sudo su
```
2. Run the following script.
```
:user# for drives in `nvme list|grep Micron | cut -d' ' -f1 |sed 's/..$//'`;
do /opt/MicronTechnology/MicronMSECLI/msecli -M -k 1 -n $drives ;
done
```
3. Exit out of root role.
```
:user# exit
```
Confirm that temperature reading for the replaced drive is enabled by running ipmitool.
```
$ sudo ipmitool sdr | grep -i temp | grep -i -e nvme*temp -e temp_u2
```

Returning the NVMe Drive/Riser Assembly

Use the packaging from the new drive/riser assembly and follow the instructions that came with the package to ship the old drive/riser assembly back to NVIDIA Enterprise Support.

Note: If your organization has purchased a media retention policy, you may be able to keep failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.