Troubleshooting
Use the pages below to narrow the problem down before changing NCCL settings.
GPU troubleshooting covers GPU-to-GPU, GPU-to-NIC, ACS, topology, and multi-node NVLink issues.
Networking Troubleshooting covers interface selection, low-level fabric checks, latency and bandwidth tests, and InfiniBand or RoCE diagnostics.
Runtime and MPI issues covers basic error handling, shared-memory and runtime problems, and MPI startup validation.
Performance and tuning covers baseline performance triage and NCCL tuning knobs to try after system checks look healthy.
Logging covers NCCL logging levels, subsystem filters, output files, and timestamps.
RAS covers NCCL’s built-in RAS subsystem for diagnosing hangs and crashes.