CUDA Instrumentation Methods#

Beginning with the CUDA 12.9 release, the CUDA driver introduces lightweight instrumentation methodologies designed for debugging and development when standard developer tools are not suitable.

This user guide provides CUDA developers with an understanding of these instrumentation methods and their applications for debugging on Jetson platforms. These are lightweight instrumentation methods that aim to make debugging any CUDA issue in the “always on” manner without affecting the reproduction rate or performance of the application. They are particularly useful for diagnosing GPU crashes or hangs.

Note

All instrumentation methodologies are provided through the new libcuda_instrumentation.so library.

Prerequisites#

Complete the following prerequisites before using these instrumentation methods:

Verify the availability of CUDA instrumented binaries on the target system:.

ls /usr/lib/aarch64-linux-gnu/nvidia/libcuda_*

Expected output: /usr/lib/aarch64-linux-gnu/nvidia/libcuda_instrumentation.so.
Set up the instrumented libcuda.so.1 for use:
1. Rename libcuda_instrumentation.so to libcuda.so.1:
  
  cp libcuda_instrumentation.so libcuda.so.1
2. Set LD_LIBRARY_PATH to use the instrumented libcuda.so.1:
  
  export LD_LIBRARY_PATH=<path/to/instrumented_libcuda>:$LD_LIBRARY_PATH
These changes ensure that the application correctly links to the instrumented CUDA library during execution.

GPU Task Tracker#

The GPU Task Tracker is an instrumentation methodology that helps CUDA users identify faulty CUDA kernels causing application hangs or crashes. It tracks all job submissions made by applications and libraries, helping you to pinpoint the kernel responsible for the issue.

When to Use#

Use this method when a crash or hang occurs in the submitted GPU work, and the application needs to identify the specific workload responsible for the issue.

How to Use#

Run the application with the following environment variable set to enable the GPU Task Tracker:

ls /usr/lib/aarch64-linux-gnu/nvidia/libcuda_*

If the issue (kernel hang or GPU error) is reproduced, the framework identifies the faulty kernel, along with its stream/channel ID and program counter (PC), that is causing the hang or error. The following two examples show what you might see.

Case 1: Hang Detected

Z14infiniteKerneli is the name of the kernel that has hung.

Logs:

$ cp libcuda_instrumentation.so libcuda.so.1
$ export LD_LIBRARY_PATH=<path to lib>:$LD_LIBRARY_PATH
$ export CUDA_DIAG_TASK_TRACKER=1
$ echo $CUDA_DIAG_TASK_TRACKER
1
$ ./cudaDiagnostics -t diag_kernel_launch_hang_tests
Device 0:
Driver version: 12090
Runtime version: 12090
Dispatcher pid: 107065
Running test diag_kernel_launch_hang_tests (pid: 107067)
[run_launch_tests():100] numKernels : 1021, threads : 4, hangKernelIndex : 0, hangThreadIndex : 3
[GPU Task Tracker]: Possible hang detected.. Printing pending jobs in GPU..
[GPU Task Tracker]: Last pending job in stream : [0x0xfffd68007500], hwChannelId : [227], is [_Z14infiniteKerneli], launchPC [0x104b91700], launchCount [0x1]

Case 2: Error Detected

Compare the PC and HwChannelId values from the dmesg logs for reference.
You can extract the following details from the logs:
- hwChannelId (from stdout) corresponds to ch id:508 in dmesg.
- Error kernel information is in stdout logs. For example: _Z18triggerFaultKernelv.
- CUDA prints the PC value of the kernel for comparison: launchPC [0x101b0e500].

Logs:

$ cp libcuda_instrumentation.so libcuda.so.1
$ export LD_LIBRARY_PATH=<path to lib>:$LD_LIBRARY_PATH
$ export CUDA_DIAG_TASK_TRACKER=1
$ echo $CUDA_DIAG_TASK_TRACKER
1
$ sudo dmesg -C
$ ./cudaDiagnostics -t diag_kernel_trigger_fault --triggerFault
Device 0:
Driver version: 12090
Runtime version: 12090
Dispatcher pid: 7075
Running test diag_kernel_trigger_fault (pid: 7077)
[GPU Task Tracker]: Possible error detected.. Printing pending jobs in GPU..
[GPU Task Tracker]: Last pending job in stream : [0x0xaaaadf592b40], hwChannelId : [508], is [_Z18triggerFaultKernelv], launchPC [0x101b0e500], launchCount [0x1]
[GPU Task Tracker]: Possible error detected.. Printing pending jobs in GPU..
[GPU Task Tracker]: Last pending job in stream : [0x0xaaaadf592b40], hwChannelId : [508], is [_Z18triggerFaultKernelv], launchPC [0x101b0e500], launchCount [0x1]
[GPU Task Tracker]: Possible error detected.. Printing pending jobs in GPU..
[GPU Task Tracker]: Last pending job in stream : [0x0xaaaadf592b40], hwChannelId : [508], is [_Z18triggerFaultKernelv], launchPC [0x101b0e500], launchCount [0x1]
[GPU Task Tracker]: Possible error detected.. Printing pending jobs in GPU..
[GPU Task Tracker]: Last pending job in stream : [0x0xaaaadf592b40], hwChannelId : [508], is [_Z18triggerFaultKernelv], launchPC [0x101b0e500], launchCount [0x1]
[GPU Task Tracker]: Possible error detected.. Printing pending jobs in GPU..
^^^^ PASS: diag_kernel_trigger_fault (1707.6ms)
Total time: 1708ms

$ sudo dmesg
[ 6818.920799] nvgpu: 17000000.gpu nvgpu_cic_mon_report_err_safety_services:60   [ERR]  Error reporting is not supported in this platform
[ 6818.935774] nvgpu: 17000000.gpu gv11b_mm_mmu_fault_handle_buf_valid_entry:554  [ERR]  page fault error: err_type = 0x8, fault_status = 0x200
[ 6818.948751] nvgpu: 17000000.gpu      gv11b_fb_mmu_fault_info_dump:297  [ERR]  [MMU FAULT] mmu engine id:  65, ch id:  508, fault addr: 0x1000, fault addr aperture: 0, fault type: invalid pde, access type: virt write,
[ 6818.968603] nvgpu: 17000000.gpu      gv11b_fb_mmu_fault_info_dump:310  [ERR]  [MMU FAULT] protected mode: 0, client type: gpc, client id:  t1_6, gpc id if client type is gpc: 1,
[ 6818.984953] nvgpu: 17000000.gpu      gv11b_fb_mmu_fault_info_dump:320  [ERR]  [MMU FAULT] faulted act eng id if any: 0x0, faulted veid if any: 0x1, faulted pbdma id if any: 0xffffffff,
[ 6819.001963] nvgpu: 17000000.gpu gv11b_mm_mmu_fault_set_mmu_nack_pending:362  [ERR]  chid: 508 is referenceable but not bound to tsg
[ 6819.014152] nvgpu: 17000000.gpu gv11b_mm_mmu_fault_handle_mmu_fault_refch:399  [ERR]  chid: 508 is referenceable but not bound to tsg
[ 6819.026511] nvgpu: 17000000.gpu       nvgpu_rc_mmu_fault_recovery:383  [ERR]  mmu fault id=508 id_type=0 act_eng_bitmask=00000001
[ 6819.054690] nvgpu: 17000000.gpu gv11b_fifo_locked_abort_runlist_active_tsgs:48   [ERR]  abort active tsgs of runlists set in runlists_mask: 0x00000001
[ 6819.068584] nvgpu: 17000000.gpu       nvgpu_tsg_set_ctx_mmu_error:1375 [ERR]  TSG 1 generated a mmu fault
[ 6819.078437] nvgpu: 17000000.gpu     nvgpu_set_err_notifier_locked:144  [ERR]  error notifier set to 31 for ch 510 owned by Xorg
[ 6819.090253] nvgpu: 17000000.gpu       nvgpu_tsg_set_ctx_mmu_error:1375 [ERR]  TSG 2 generated a mmu fault
[ 6819.100109] nvgpu: 17000000.gpu     nvgpu_set_err_notifier_locked:144  [ERR]  error notifier set to 31 for ch 509 owned by Xorg
[ 6819.111932] nvgpu: 17000000.gpu       nvgpu_tsg_set_ctx_mmu_error:1375 [ERR]  TSG 3 generated a mmu fault
[ 6819.127384] nvgpu: 17000000.gpu       nvgpu_tsg_set_ctx_mmu_error:1375 [ERR]  TSG 5 generated a mmu fault
[ 6819.137233] nvgpu: 17000000.gpu     nvgpu_set_err_notifier_locked:144  [ERR]  error notifier set to 31 for ch 507 owned by gnome-initial-s
[ 6819.150460] nvgpu: 17000000.gpu nvgpu_cic_mon_report_err_safety_services:60   [ERR]  Error reporting is not supported in this platform
[ 6819.162903] nvgpu: 17000000.gpu      ga10b_fifo_ctxsw_timeout_isr:349  [ERR]  Host pfifo ctxsw timeout error