Troubleshooting#
Hints#
If any issues need to be debugged, the following environment variables are very useful:
CUBLASMP_LOG_LEVEL environment variable to enable detailed logging of cuBLASMp operations
NCCL_DEBUG. For the broader set of NCCL diagnostics, refer to the NCCL environment variable documentation.
Known Issues#
The issues below are relevant only when maintaining older cuBLASMp releases.
cuBLASMp versions older than 0.8.0: During cublasMpGridCreate, NVSHMEM 3.4.5 and older may call
exit()on unsupported platforms (e.g. arm64-sbsa + PCIe GPUs). For cuBLASMp 0.7.0, users can setCUBLASMP_DISABLE_NVSHMEMenvironment variable to1to work around this issue. cuBLASMp 0.8.0 and newer no longer use NVSHMEM.cuBLASMp versions older than 0.5.0: Some users may face hangs due to lazy initialization of NCCL in UCC. To disable the lazy NCCL initialization, please set
UCC_TL_NCCL_LAZY_INITenvironment variable tono.cuBLASMp versions older than 0.5.0: Some users may see errors with HPC-X v2.18 caused by a clash of UCC being initialized in OMPI and cuBLASMp. To disable UCC initialization in OMPI, please set
OMPI_MCA_coll_ucc_enableenvironment variable to0.