Troubleshooting#

Hints#

Known Issues#

  • Due to a NCCL bug in 2.30.4, NCCL does not initialize its internal RMA signal buffers until the first ncclPutSignal/ncclWaitSignal operation. When using CUDA Graphs, users must run at least one AllGather + Matmul or Matmul + ReduceScatter operation before starting a CUDA Graph capture that contains these operations. This issue is fixed in NCCL 2.30.6.

  • Some users on multi-node InfiniBand configurations may experience hangs with NCCL 2.29.7 and above. To work around this issue, set the NCCL_PXN_DISABLE=1 environment variable. This issue is fixed in NCCL 2.30.6.

The issues below are relevant only when maintaining older cuBLASMp releases.

  • cuBLASMp versions older than 0.8.0: During cublasMpGridCreate, NVSHMEM 3.4.5 and older may call exit() on unsupported platforms (e.g. arm64-sbsa + PCIe GPUs). For cuBLASMp 0.7.0, users can set CUBLASMP_DISABLE_NVSHMEM environment variable to 1 to work around this issue. cuBLASMp 0.8.0 and newer no longer use NVSHMEM.

  • cuBLASMp versions older than 0.5.0: Some users may face hangs due to lazy initialization of NCCL in UCC. To disable the lazy NCCL initialization, please set UCC_TL_NCCL_LAZY_INIT environment variable to no.

  • cuBLASMp versions older than 0.5.0: Some users may see errors with HPC-X v2.18 caused by a clash of UCC being initialized in OMPI and cuBLASMp. To disable UCC initialization in OMPI, please set OMPI_MCA_coll_ucc_enable environment variable to 0.