DGX A100 Firmware Known Issues

Random Retry Error Messages in the Log File or the Output of show_version

Issue

After running the firmware update, you might see the Too many retries message in the nvidia-fw.log file or the Err:retry message in the PSU section of the show_version output.

Workaround

You can resolve the issue by running the command again.

Running Another Firmware Update Container Using Docker or Podman Causes the First Container to Abort

Issue

When you run the Firmware Update Container through Podman, any attempts to start a second Firmware Update Container through Docker or Podman will cause the first instance of the Firmware Update Container to halt. Running multiple instances of the Firmware Update Container concurrently on the same system is not supported and can lead to system issues.

Unable to Run Firmware Update Container in Containerless Mode When Podman Is Installed

Issue

When you run the Firmware Update Container with either the -docker or -podman argument, but the specified container environment is not installed, you might see a docker: command not found or podman: command not found error message.

Workaround

This is an expected behavior. The container can run in containerless mode, but only if both Docker and Podman are not present. By specifying a specific environment, the container will try to use that environment exclusively, not falling back on the containerless method. If you want to run in containerless mode, uninstall both Docker and Podman and then run the container again without the -docker or -podman argument.

BMC Web User Interface - Backed-up Username and Credentials not Working

Issue

When attempting to log in to the BMC web user interface after restoring the configuration using the Maintenance > Restore configuration feature, you can encounter an issue where your backed-up username and credentials do not work. This problem occurs specifically when restoring configuration to a different motherboard tray, such as after a motherboard tray replacement.

Workaround

To resolve this issue, you can create a new user from the host operating system and then delete the old users from the web user interface after logging in with the newly created credentials. To create a new user with administrator privileges, you can perform the following steps.

  1. List users from the host operating system using the IPMItool command:

    sudo ipmitool user list 1
    
  2. Create a new user and set administrator privilege using the following commands:

    sudo ipmitool user set name <empty-userID-slot> <username>
    sudo ipmitool user set password <userID> <password>
    sudo ipmitool user enable <userID>
    sudo ipmitool user priv <userID> 0x4 1
    sudo ipmitool channel setaccess 1 <userID> callin=on ipmi=on link=on
    
  3. Verify that the new user is created successfully by listing the users again using the IPMItool command:

    sudo ipmitool user list
    

BCM users only: Firmware Update Completes with Error on Base Command Manager

Issue

When attempting to update the -0R4 CPU trays, a failure occurs during the update process where the firmware update container fails to list services.

The failure messages can include the following:

  • Failed to install DGX 88064_Retimer dev 91 3.1.o

  • Unable to unload NVIDIA drivers. The following process(es)/service(s) need to be stopped in order for switch firmware update to occur:

  • <blank>

Workaround

  1. Run the following command:

    scontrol update NodeName=hostname State=drain Reason="FW update"
    

    Wait for jobs on the host to complete and the status of the node to report drained.

    If the output for the following command returns draining, the response implies the node has jobs running and is not ready for the firmware update. Only proceed to the next step if the node status returns drained.

    sinfo --state=drained | grep hostname
    
  2. Stop the slurmd service on the compute node:

    ansible -i /opt/provisioning/inventory/ --become -m shell -a  'systemctl stop slurmd.service ' 'hostname'
    

After the firmware update, if the host was rebooted after the firmware update, change the host state to resume:

scontrol update NodeName=hostname state=resume