Additional Features and Instructions

This chapter describes specific features of the DGX A100 server to consider during setup and operation.

Managing the DGX Crash Dump Feature

The DGX OS includes a script to manage this feature.

Using the Script

This section provides information about how to use the script to manage DGX crash dumps.

  • To enable only dmesg crash dumps, enter the following command:

    $ sudo /usr/sbin/nvidia-kdump-config enable-dmesg-dump
    

    This option reserves memory for the crash kernel.

  • To enable both dmesg and vmcore crash dumps, enter the following command:

    $ sudo /usr/sbin/nvidia-kdump-config enable-vmcore-dump
    

    This option reserves memory for the crash kernel.

  • To disable crash dumps, enter the following:

    $ sudo /usr/sbin/nvidia-kdump-config disable
    

    This option disables the use of kdump and make sure no memory is reserved for the crash kernel.

Connecting to Serial Over LAN to View the Console

While dumping vmcore, the BMC screen console goes blank approximately 11 minutes after the crash dump is started. To view the console output during the crash dump, connect to serial over LAN as follows:

$ ipmitool -I lanplus -H <bmc-ip-address> -U <bmc-username> -P <bmc-password> \
    sol activate