M.2 NVMe Boot Drive Replacement#

This topic describes how to replace the boot drive in the NVIDIA DGX™ B300 system.

Caution

Static Sensitive Devices: Ensure to observe best practices for electrostatic discharge (ESD) protection. Ensure that personnel and equipment are connected to a common ground, such as wearing a wrist strap connected to the chassis ground and placing components on static-free work surfaces.

M.2 NVMe Boot Drive Replacement Overview#

This is a high-level overview of the procedure to replace a boot drive.

  1. Identify the failed M.2 drive.

  2. Get a replacement M.2 from NVIDIA Enterprise Support.

  3. Power off the system.

  4. Unplug all motherboard cables.

  5. Pull out the motherboard.

  6. Remove the lid.

  7. Remove the left BlueField-3 I/O bay.

  8. Pull out the M.2 bay.

  9. Replace the failed M.2 drive.

  10. Install the M.2 bay

  11. Install the left BlueField-3 I/O bay.

  12. Install the motherboard lid.

  13. Slide the motherboard tray into the system.

  14. Connect all the cables.

  15. Power on the system.

  16. Rebuild the RAID volume and mount the filesystem.

  17. Send the failed unit to NVIDIA Enterprise Support using the packaging provided.

Prepare for Replacement#

The NVIDIA DGX™ B300 system automatically sets the failed M.2 drive offline when it detects the failure. The boot drives are mirrored, so the mdadm command-line utility can identify the drive to replace.

Caution

Wear an ESD strap during any procedure that involves touching electronic components.

  1. Identify the failed M.2 drive using the OS tools or the nvsm command.

    sudo nvsm show health
    

    The command output indicates the drive name, nvme0n1 or nvme1n1.

  2. Confirm the drive name using the mdadm command:

    sudo mdadm -D /dev/md0
    

    The command output indicates the drive names and the drive state.

  3. Contact NVIDIA Enterprise Support to request a replacement M.2 drive.

  4. Back up any critical data to a network shared volume or other backup option.

  5. When the new drive arrives, remove the failed drive from the mirrored volume.

    Run the following commands to mark the drive as failed and to remove the drive from the array.

    1. Mark the disk as failed, if it is not already marked as failed:

      sudo mdadm --manage /dev/md0 --fail /dev/nvme[0/1]n1
      
    2. Remove the failed disk from the array:

      sudo mdadm --manage /dev/md0 --remove /dev/nvme[0/1]n1
      
  6. Power off the system.

  7. Remove the left BlueField-3 I/O bay to access the M.2 NVMe boot drives below.

    Note

    Each cable is labeled to ensure it is connected to the correct position after the procedure.

    _images/dgx-b300-bf3-left-bay.png

Remove the BlueField-3 I/O Bay#

  1. After the four cables have been unplugged, press the left release tab and push the bay towards the front.

    _images/dgx-b300-bf3-left-release-tab.png
  2. Carefully route the cables through the opening as the I/O bay is moved out of the motherboard tray.

  3. Finish pulling the old I/O bay out of the motherboard tray.

  4. Ensure the motherboard tray levers remain fully extended, as shown in the illustration, so the M.2 bay can be pulled out.

    _images/dgx-b300-bf3-left-bay-out.png

Release the M2 Bay#

  1. Ensure the ejection levers are fully extended to prevent obstruction before removing the M.2 bay from the motherboard.

    _images/dgx-b300-m2-levers.png
  2. Release the latch on the M.2 bay as shown in the illustration and then pull the bay out the front to eject it.

    _images/dgx-b300-m2-bay-eject.png

Remove the M.2 Drive#

Before attempting to remove one of the M.2 NVMe drives, perform the following prerequisites:

  • Determine the location ID of the faulty M.2 drive.

  • Obtain the replacement M.2 drive and save the packaging for returning the faulty drive.

  1. Identify the M.2 drive that needs to be replaced.

    Note

    The PCIe bus number corresponds to a specific drive on the board.

    _images/dgx-b300-nvme-card.png
  2. Remove the screw on the M.2 drive that needs to be replaced.

    The screw is not captive and might fall and become lost.

    _images/dgx-b300-nvme-card-2.png
  3. Tilt the M.2 drive slightly so it can be ejected.

    _images/dgx-b300-nvme-tilt.png
  4. Pull the M.2 drive out of the connector and off the bay.

    _images/dgx-b300-nvme-remove.png

Install the New M.2 Drive#

  1. Insert the new M.2 drive into the connector.

    _images/dgx-b300-nvme-drive-1.png
  2. Rest the M.2 drive on the bay.

    _images/dgx-b300-nvme-drive-2.png
  3. Carefully attach the screw to secure the M.2 drive.

    _images/dgx-b300-nvme-drive-3.png

Insert the M.2 Drive Bay and Reconnect the I/O Bay#

  1. After the replacement is complete, ensure the ejection levers are completely open. Insert the M.2 drive bay into the corresponding lower slot until it locks in place.

    _images/dgx-b300-nvme-m2-bay-insert.png
  2. Carefully route all the cables through the opening in the motherboard tray slot.

    After inserting the I/O bay into the tray, ensure it locks in place, checking the tab has locked.

    _images/dgx-b300-nvme-io-bay-insert.png
  3. Connect the two power cables and the two PCIe cables to their correct connectors on the switchboard, following the labels on each cable end.

    _images/dgx-b300-nvme-io-bay-connect.png

    To identify the correct connections, refer to this table that maps BlueField-3 card connectors to their corresponding board connectors.

    BlueField-3 I/O Board

    Left Slot Installation

    Right Slot Installation

    Cable Label P2

    Board connector J9

    Board connector J3

    Cable Label P3

    Board connector J10

    Board connector J4

Integrate the New Drive and Complete the Installation#

  1. Insert the motherboard following the instructions in Motherboard Tray - Opening and Closing.

  2. Power on the system and log in.

  3. Rebuild the RAID1 (mirror) volume on the boot drives.

  4. Confirm the system is healthy by running the nvsm command.

    sudo nvsm show health
    
  5. Send the failed M.2 drive to NVIDIA Enterprise Support using the packaging provided.

Note

If your organization purchased a media retention policy, you might be able to keep the failed drives for destruction. Check with NVIDIA Enterprise Support on the status of the policy for specifics.