NCCL
2.20
Overview of NCCL
Using NCCL
Creating a Communicator
Creating a communication with options
Creating more communicators
Using multiple NCCL communicators concurrently
Finalizing a communicator
Destroying a communicator
Error handling and communicator abort
Asynchronous errors and error handling
Fault Tolerance
Collective Operations
AllReduce
Broadcast
Reduce
AllGather
ReduceScatter
Data Pointers
CUDA Stream Semantics
Mixing Multiple Streams within the same ncclGroupStart/End() group
Group Calls
Management Of Multiple GPUs From One Thread
Aggregated Operations (2.2 and later)
Nonblocking Group Operation
Point-to-point communication
Sendrecv
One-to-all (scatter)
All-to-one (gather)
All-to-all
Neighbor exchange
Thread Safety
In-place Operations
Using NCCL with CUDA Graphs
User Buffer Registration
Memory Allocator
NCCL API
Communicator Creation and Management Functions
ncclGetLastError
ncclGetErrorString
ncclGetVersion
ncclGetUniqueId
ncclCommInitRank
ncclCommInitAll
ncclCommInitRankConfig
ncclCommSplit
ncclCommFinalize
ncclCommDestroy
ncclCommAbort
ncclCommGetAsyncError
ncclCommCount
ncclCommCuDevice
ncclCommUserRank
ncclCommRegister
ncclCommDeregister
ncclMemAlloc
ncclMemFree
Collective Communication Functions
ncclAllReduce
ncclBroadcast
ncclReduce
ncclAllGather
ncclReduceScatter
Group Calls
ncclGroupStart
ncclGroupEnd
Point To Point Communication Functions
ncclSend
ncclRecv
Types
ncclComm_t
ncclResult_t
ncclDataType_t
ncclRedOp_t
ncclScalarResidence_t
ncclConfig_t
User Defined Reduction Operators
ncclRedOpCreatePreMulSum
ncclRedOpDestroy
Migrating from NCCL 1 to NCCL 2
Initialization
Communication
Counts
In-place usage for AllGather and ReduceScatter
AllGather arguments order
Datatypes
Error codes
Examples
Communicator Creation and Destruction Examples
Example 1: Single Process, Single Thread, Multiple Devices
Example 2: One Device per Process or Thread
Example 3: Multiple Devices per Thread
Example 4: Multiple communicators per device
Communication Examples
Example 1: One Device per Process or Thread
Example 2: Multiple Devices per Thread
NCCL and MPI
API
Using multiple devices per process
ReduceScatter operation
Send and Receive counts
Other collectives and point-to-point operations
In-place operations
Using NCCL within an MPI Program
MPI Progress
Inter-GPU Communication with CUDA-aware MPI
Environment Variables
NCCL_P2P_DISABLE
Values accepted
NCCL_P2P_LEVEL
Values accepted
Integer Values (Legacy)
NCCL_P2P_DIRECT_DISABLE
Values accepted
NCCL_SHM_DISABLE
Values accepted
NCCL_SOCKET_IFNAME
Values accepted
NCCL_SOCKET_FAMILY
Values accepted
NCCL_SOCKET_NTHREADS
Values accepted
NCCL_NSOCKS_PERTHREAD
Values accepted
NCCL_DEBUG
Values accepted
NCCL_BUFFSIZE
Values accepted
NCCL_NTHREADS
Values accepted
NCCL_MAX_NCHANNELS
Values accepted
NCCL_MIN_NCHANNELS
Values accepted
NCCL_CROSS_NIC
Values accepted
NCCL_CHECKS_DISABLE
Values accepted
NCCL_CHECK_POINTERS
Values accepted
NCCL_LAUNCH_MODE
Values accepted
NCCL_IB_DISABLE
Values accepted
NCCL_IB_HCA
Values accepted
NCCL_IB_TIMEOUT
Values accepted
NCCL_IB_RETRY_CNT
Values accepted
NCCL_IB_GID_INDEX
Values accepted
NCCL_IB_SL
Values accepted
NCCL_IB_TC
Values accepted
NCCL_IB_AR_THRESHOLD
Values accepted
NCCL_IB_CUDA_SUPPORT
Values accepted
NCCL_IB_QPS_PER_CONNECTION
Values accepted
NCCL_IB_SPLIT_DATA_ON_QPS
Values accepted
NCCL_IB_PCI_RELAXED_ORDERING
Values accepted
NCCL_IB_ADAPTIVE_ROUTING
Values accepted
NCCL_MEM_SYNC_DOMAIN
Values accepted
NCCL_CUMEM_ENABLE
Values accepted
NCCL_NET
Values accepted
NCCL_NET_PLUGIN
Values accepted
NCCL_NET_GDR_LEVEL (formerly NCCL_IB_GDR_LEVEL)
Values accepted
Integer Values (Legacy)
NCCL_NET_GDR_READ
Values accepted
NCCL_NET_SHARED_BUFFERS
Value accepted
NCCL_NET_SHARED_COMMS
Value accepted
NCCL_SINGLE_RING_THRESHOLD
Values accepted
NCCL_LL_THRESHOLD
Values accepted
NCCL_TREE_THRESHOLD
Values accepted
NCCL_ALGO
Values accepted
NCCL_PROTO
Values accepted
NCCL_IGNORE_CPU_AFFINITY
Values accepted
NCCL_DEBUG_FILE
Values accepted
NCCL_DEBUG_SUBSYS
Values accepted
NCCL_COLLNET_ENABLE
Value accepted
NCCL_COLLNET_NODE_THRESHOLD
Value accepted
NCCL_TOPO_FILE
Value accepted
NCCL_TOPO_DUMP_FILE
Value accepted
NCCL_NVB_DISABLE
Value accepted
NCCL_PXN_DISABLE
Value accepted
NCCL_P2P_PXN_LEVEL
Value accepted
NCCL_GRAPH_REGISTER
Value accepted
NCCL_LOCAL_REGISTER
Value accepted
NCCL_SET_STACK_SIZE
Value accepted
NCCL_SET_THREAD_NAME
Value accepted
NCCL_GRAPH_MIXING_SUPPORT
Value accepted
NCCL_DMABUF_ENABLE
Value accepted
NCCL_P2P_NET_CHUNKSIZE
Values accepted
NCCL_P2P_LL_THRESHOLD
Values accepted
NCCL_ALLOC_P2P_NET_LL_BUFFERS
Values accepted
NCCL_COMM_BLOCKING
Values accepted
NCCL_CGA_CLUSTER_SIZE
Values accepted
NCCL_MAX_CTAS
Values accepted
NCCL_MIN_CTAS
Values accepted
NCCL_NVLS_ENABLE
Values accepted
NCCL_IB_MERGE_NICS
Values accepted
Troubleshooting
Errors
GPU Direct
GPU-to-GPU communication
GPU-to-NIC communication
PCI Access Control Services (ACS)
Topology detection
Shared memory
Docker
Systemd
Networking issues
IP Network Interfaces
IP Ports
InfiniBand
NCCL
Docs
»
Index
Index
B
|
C
|
M
|
N
|
S
B
blocking (C macro)
C
cgaClusterSize (C macro)
M
maxCTAs (C macro)
minCTAs (C macro)
N
NCCL_CONFIG_INITIALIZER (C macro)
ncclAllGather (C function)
ncclAllReduce (C function)
ncclAvg (C macro)
ncclBcast (C function)
ncclBfloat16 (C macro)
ncclBroadcast (C function)
ncclChar (C macro)
ncclComm_t (C type)
ncclCommAbort (C function)
ncclCommCount (C function)
ncclCommCuDevice (C function)
ncclCommDeregister (C function)
ncclCommDestroy (C function)
ncclCommFinalize (C function)
ncclCommGetAsyncError (C function)
ncclCommInitAll (C function)
ncclCommInitRank (C function)
ncclCommInitRankConfig (C function)
ncclCommRegister (C function)
ncclCommSplit (C function)
ncclCommUserRank (C function)
ncclConfig_t (C type)
ncclDataType_t (C type)
ncclDouble (C macro)
ncclFloat (C macro)
ncclFloat16 (C macro)
ncclFloat32 (C macro)
ncclFloat64 (C macro)
ncclGetErrorString (C function)
ncclGetLastError (C function)
ncclGetUniqueId (C function)
ncclGetVersion (C function)
ncclGroupEnd (C function)
ncclGroupStart (C function)
ncclHalf (C macro)
ncclInProgress (C macro)
ncclInt (C macro)
ncclInt32 (C macro)
ncclInt64 (C macro)
ncclInt8 (C macro)
ncclInternalError (C macro)
ncclInvalidArgument (C macro)
ncclInvalidUsage (C macro)
ncclMax (C macro)
ncclMemAlloc (C function)
ncclMemFree (C function)
ncclMin (C macro)
ncclProd (C macro)
ncclRecv (C function)
ncclRedOp_t (C type)
ncclRedOpCreatePreMulSum (C function)
ncclRedOpDestroy (C function)
ncclReduce (C function)
ncclReduceScatter (C function)
ncclRemoteError (C macro)
ncclResult_t (C type)
ncclScalarDevice (C macro)
ncclScalarHostImmediate (C macro)
ncclScalarResidence_t (C type)
ncclSend (C function)
ncclSuccess (C macro)
ncclSum (C macro)
ncclSystemError (C macro)
ncclUint32 (C macro)
ncclUint64 (C macro)
ncclUint8 (C macro)
ncclUnhandledCudaError (C macro)
netName (C macro)
S
splitShare (C macro)