Troubleshooting

Use the pages below to narrow the problem down before changing NCCL settings.

  • GPU troubleshooting covers GPU-to-GPU, GPU-to-NIC, ACS, topology, and multi-node NVLink issues.

  • Networking Troubleshooting covers interface selection, low-level fabric checks, latency and bandwidth tests, and InfiniBand or RoCE diagnostics.

  • Runtime and MPI issues covers basic error handling, shared-memory and runtime problems, and MPI startup validation.

  • Performance and tuning covers baseline performance triage and NCCL tuning knobs to try after system checks look healthy.

  • Logging covers NCCL logging levels, subsystem filters, output files, and timestamps.

  • RAS covers NCCL’s built-in RAS subsystem for diagnosing hangs and crashes.