Troubleshooting#

Hints#

  • If any issues need to be debugged, the following environment variables are very useful:

    • CUBLASMP_LOG_LEVEL environment variable to enable detailed logging of cuBLASMp operations

    • NCCL_DEBUG. NCCL provides extensive debugging capabilities through environment variables such as NCCL_DEBUG. For a comprehensive list of NCCL debugging options, refer to the NCCL Environment Variables documentation.

    • CUBLASMP_DISABLE_NVSHMEM environment variable to disable NVSHMEM usage, which can help debug NVSHMEM-related issues or work around compatibility problems

Known Issues#

  • cuBLASMp versions 0.6.0 and older: During cublasMpGridCreate, NVSHMEM 3.4.5 and older may call exit() on unsupported platforms (e.g. arm64-sbsa + PCIe GPUs). Starting from cuBLASMp 0.7.0, users can set CUBLASMP_DISABLE_NVSHMEM environment variable to 1 to work around this issue.

  • cuBLASMp versions older than 0.5.0: Some users may face hangs due to lazy initialization of NCCL in UCC. To disable the lazy NCCL initialization, please set UCC_TL_NCCL_LAZY_INIT environment variable to no.

  • cuBLASMp versions older than 0.5.0: Some users may see errors with HPC-X v2.18 caused by a clash of UCC being initialized in OMPI and cuBLASMp. To disable UCC initialization in OMPI, please set OMPI_MCA_coll_ucc_enable environment variable to 0.