Error Recovery and Response Flags#

Here is a list of flags that are required for client error recovery and response:

  • Row-remapping pending flag

    • This flag indicates that row-remapping will happen at the next GPU reset.

    • Even with the flag set, unaffected applications can continue running without affecting accuracy and performance, and new workloads can be launched.

    • This is useful to identify readiness for live virtual machine (VM) migrations into this GPU, and this GPU should be reset if a live VM will be migrated to it.

  • Row-remapping failure flag

    • For definition of row-remapping failure, see RMA Policy Thresholds for Row-Remapping.

  • Recovery action flag

    • Recovery action flag will be triggered for uncorrectable errors. When set, this flag indicates the following user actions:

      • None (NVML_GPU_RECOVERY_ACTION_NONE)

        • No recovery action is needed. The GPU can be immediately used again by starting a new GPU process.

      • Reset (NVML_GPU_RECOVERY_ACTION_GPU_RESET)

        • The GPU has encountered a fault that requires a reset to recover. Terminate all GPU processes, reset the GPU using “nvidia-smi -r”, and the GPU can be used again by starting new GPU processes.

      • Reboot (NVML_GPU_RECOVERY_ACTION_NODE_REBOOT)

        • The GPU has encountered a fault may have left the OS in an inconsistent state. Reboot the operating system to restore the OS back to a consistent state.

      • Drain P2P (NVML_GPU_RECOVERY_ACTION_DRAIN_P2P)

        • The GPU has encountered a fault that requires all peer-to-peer traffic to be quiesced. Terminate all GPU processes that conduct peer-to-peer traffic and disable UVM persistence mode. Once all peer-to-peer traffic are drained, query NVML_FI_DEV_GET_GPU_RECOVERY_ACTION again, which will return one of the other actions.

  • Drain and Reset Flag (NVML_GPU_RECOVERY_ACTION_DRAIN_AND_RESET)

    • The GPU has encountered a fault that results the GPU to temporarily operate at a reduced capacity, such as part of its frame buffer memory being offlined, or some of its MIG partitions down. No new work should be scheduled on the GPU, but existing work that didn’t get affected are safe to continue until they finish or reach a good checkpoint. After all existing work have drained, reset the GPU to regain its full capacity.

Note

These client flags are currently exposed through SMBPBI.