DIMM Replacement

DIMM Replacement Overview

This is a high-level overview of the procedure to replace a dual inline memory module (DIMM) on the DGX A100 system.

  1. Use the nvsm health commands to identify the failed DIMM

  2. Get a replacement DIMM from NVIDIA Enterprise Support.

  3. Shut down the system.

  4. Label all motherboard tray cables and unplug them.

  5. Remove the motherboard tray and place on a solid flat surface.

  6. Remove the motherboard tray lid.

  7. Use the reference diagram on the lid of the motherboard tray to identify the failed DIMM.

  8. Replace the bad DIMM with the new one.

  9. Close the lid on the motherboard tray.

  10. Insert the motherboard tray into the system.

  11. Plug in all cables using the labels as a reference.

  12. Power on the system.

  13. Verify that all DIMMs are now healthy with nvsm.

Identifying the Failed DIMM

  1. From the console, run the following nvsm command to identify memory alerts.

    $ sudo nvsm show /systems/localhost/memory/alerts
    

    Alerts will appear under the Target section. For example.

    Targets:
              alert0
    
  2. Get specific information about the memory alert.

    The following example obtains information for alert0.

    $ sudo nvsm show /systems/localhost/memory/alerts/alert0
    

    Inspect the component_id = line to determine the DIMM ID. The following example shows a DIMM ID of A1.

    Properties:
    system_name = ....
    component_id = CPU1_DIMM_A1
    ...
    

    The output provides other information about the alert that can be provided to NVIDIA Enterprise Support.

  3. Determine the DIMM manufacturer.

    $ sudo nvsm show memory
    
  4. Request the replacement DIMM from NVIDIA Enterprise Support, specifying the manufacturer.

Replacing the DIMM

Before attempting to replace any of the dual inline memory modules (DIMMs), be sure to have performed the following:

  • Determined the location ID of the faulty DIMM needing replacement as explained in:ref:identifying-failed-dimm. The location ID is an alphanumeric designator, such as A0, A1, B0, B1, etc.

  • Obtained the replacement DIMM and have saved the packaging for use when returning the faulty DIMM.

Caution

Static Sensitive Devices: - Be sure to observe best practices for electrostatic discharge (ESD) protection. This includes making sure personnel and equipment are connected to a common ground, such as by wearing a wrist strap connected to the chassis ground, and placing components on static-free work surfaces.

  1. Power down the system.

  2. Label all cables connected to the motherboard tray for easy identification when reconnecting.

  3. Remove the motherboard tray.

    Refer to the instructions in the section Removing the Motherboard Tray.

  4. Using the diagram label on the lid as a guide, locate the faulty DIMM to be replaced.

    _images/mb-tray-lid-label-dimms.png
  5. Remove the DIMM.

    1. Press down on the side latches at both ends of the DIMM socket to push them away from the DIMM. This should unseat the DIMM from the socket.

      _images/dimm-remove.png
    2. Pull the DIMM straight up to remove it from the socket.

      _images/dimm-install.png
  6. Carefully insert the replacement DIMM.

    1. Make sure the socket latches are open.

    2. Position the DIMM over the socket, making sure that the notch on the DIMM lines up with the key in the slot, then press the DIMM down into the socket until the side latches click in place.

    3. Make sure that the latches are up and locked in place.

    _images/dimm-install.png
  7. Install the three air baffles, replace the motherboard tray lid and then install the motherboard tray.

    Refer to the instructions in the section Reinstalling the Motherboard Tray.

  8. Connect all the cables to the motherboard tray.

  9. Install all the power cords.

  10. Power on the system and log in.

  11. Confirm that the system is healthy.

    $ sudo nvsm show health
    $ sudo nvsm show /systems/localhost/memory/alerts
    

    There should be no new alerts listed.

  12. Ship the bad DIMM back to NVIDIA Enterprise Support.