NVIDIA UFM Enterprise Appliance Software User Manual v1.9.1
NVIDIA UFM Enterprise Appliance Software User Manual v1.9.1

Appendix - GRUB and Kernel Behavior

This configuration aims to improve system reliability by ensuring quick recovery from critical errors while preserving valuable diagnostic data for troubleshooting.

The system is configured to treat certain critical Kernel events as panics, ensuring a timely and automated response. In the event of a Kernel panic, the system automatically reboots, preserving crash data for analysis.

  1. Kernel Oops Behavior:

    • Any Kernel oops is treated as a Kernel panic.

  2. CPU Soft Lockup:

    • Any CPU soft lockup is treated as a Kernel panic.

  3. CPU Hard Lockup:

    • Any CPU hard lockup is treated as a Kernel panic.

  4. Automatic Reboot on Kernel Panic:

    • On any kernel panic, the system is configured to automatically reboot after 10 seconds.

    • A kernel dump is generated and saved in /var/crash.

  5. Kernel Crash Dump Management:

    • The system retains a maximum of 5 Kernel crash dumps in /var/crash.

    • If more than 5 crash dumps are generated, the oldest dump is automatically deleted to ensure only the five most recent dumps are kept.

  • The Kernel crash dumps stored in /var/crash can be analyzed to diagnose the cause of the panic. It is recommended to review these dumps after a system reboot following a Kernel panic.

  • The reboot delay of ten seconds allows time for crash dump generation and any additional logging.

© Copyright 2024, NVIDIA. Last updated on Sep 5, 2024.