NCCL
2.30
  • Overview of NCCL
  • Setup
  • Using NCCL
    • Creating a Communicator
      • Creating a communicator with options
      • Creating a communicator using multiple ncclUniqueIds
      • Shrinking a communicator
      • Growing a communicator
      • Creating more communicators
      • Using multiple NCCL communicators concurrently
      • Finalizing a communicator
      • Destroying a communicator
    • Error handling and communicator abort
      • Asynchronous errors and error handling
    • Fault Tolerance
    • Quality of Service
    • Collective Operations
      • AllReduce
      • Broadcast
      • Reduce
      • AllGather
      • ReduceScatter
      • AlltoAll
      • Gather
      • Scatter
    • Data Pointers
    • CUDA Stream Semantics
      • Mixing Multiple Streams within the same ncclGroupStart/End() group
    • Group Calls
      • Management Of Multiple GPUs From One Thread
      • Aggregated Operations (2.2 and later)
      • Group Operation Ordering Semantics
      • Nonblocking Group Operation
    • Point-to-point communication
      • Two-sided communication
        • Sendrecv
        • One-to-all (scatter)
        • All-to-one (gather)
        • All-to-all
        • Neighbor exchange
      • One-sided communication
        • PutSignal and WaitSignal
        • Barrier
        • All-to-all
    • Thread Safety
    • In-place Operations
    • Using NCCL with CUDA Graphs
      • Requirements and Limitations
    • User Buffer Registration
      • NVLink Sharp Buffer Registration
      • IB Sharp Buffer Registration
      • General Buffer Registration
      • Buffer Registration, GPU Direct RDMA, and MPS with MLOPart
      • Buffer Registration and PXN
      • Memory Allocator
      • Window Registration
      • Zero-CTA Optimization
    • Device-Initiated Communication
      • Device API
      • Requirements
      • Cross-Version Compatibility
      • Host-Side Setup
      • Simple LSA Kernel
      • Multimem Device Kernel
      • Thread Groups
      • Teams
      • Segment Types
      • Host-Accessible Device Pointer Functions
      • GIN Device Kernel
        • Compatibility adjustments
  • NCCL API
    • Communicator Creation and Management Functions
      • ncclGetLastError
      • ncclGetErrorString
      • ncclGetVersion
      • ncclGetUniqueId
      • ncclCommInitRank
      • ncclCommInitAll
      • ncclCommInitRankConfig
      • ncclCommInitRankScalable
      • ncclCommSplit
      • ncclCommShrink
      • ncclCommGetUniqueId
      • ncclCommGrow
      • ncclCommRevoke
      • ncclCommFinalize
      • ncclCommDestroy
      • ncclCommAbort
      • ncclCommGetAsyncError
      • ncclCommCount
      • ncclCommCuDevice
      • ncclCommUserRank
      • ncclCommRegister
      • ncclCommDeregister
      • ncclCommWindowRegister
      • ncclCommWindowDeregister
      • ncclMemAlloc
      • ncclMemFree
      • ncclCommSuspend
      • ncclCommResume
      • ncclCommMemStats
    • Collective Communication Functions
      • ncclAllReduce
      • ncclBroadcast
      • ncclReduce
      • ncclAllGather
      • ncclReduceScatter
      • ncclAlltoAll
      • ncclGather
      • ncclScatter
    • Group Calls
      • ncclGroupStart
      • ncclGroupEnd
      • ncclGroupSimulateEnd
    • Point To Point Communication Functions
      • Two-Sided Point-to-Point Operations
        • ncclSend
        • ncclRecv
      • One-Sided Point-to-Point Operations (RMA)
        • ncclPutSignal
        • ncclSignal
        • ncclWaitSignal
    • Types
      • ncclComm_t
      • ncclResult_t
      • ncclDataType_t
      • ncclRedOp_t
      • ncclScalarResidence_t
      • ncclConfig_t
      • ncclSimInfo_t
      • ncclCommMemStat_t
      • ncclWindow_t
    • User Defined Reduction Operators
      • ncclRedOpCreatePreMulSum
      • ncclRedOpDestroy
    • NCCL API Supported Flags
      • Window Registration Flags
      • NCCL Communicator CTA Policy Flags
      • Communicator Shrink Flags
    • Device API
      • Device API – Host-Side Setup
        • Host-Side Setup
        • Host-Accessible Device Pointer Functions
      • Device API – Memory and LSA
        • LSA
        • Multimem
      • Device API – GIN
        • GIN
      • Device API – Remote Reduce and Copy: Building Blocks for Custom Communication Kernels
        • Compile-Time Requirements
        • API Overview
        • ReduceSum — N Sources to One Destination
        • Copy (Broadcast) — One Source to N Destinations
        • ReduceSumCopy
        • Lambda-Based (Custom Layouts)
        • Custom Reduction Operators
    • NCCL Parameter API
      • Types
      • Handle-Based API
        • ncclParamBind
        • Accessor Family of Functions - ncclResult_t ncclParamGet*(ncclParamHandle_t h, T* out)
        • ncclParamGetStr
        • ncclParamGet
      • Key-Based API
        • ncclParamGetParameter
        • ncclParamGetAllParameterKeys
        • ncclParamDumpAll
  • Language Bindings
    • Python Bindings (NCCL4Py)
      • Communicator
        • Communicator Class
        • Creation and Lifecycle Methods
        • Collective Communication Methods
        • Point-to-Point and Signal Methods
        • Memory Registration Methods
        • Device Communicator Setup
        • Status and Utility Methods
      • Configuration
        • NCCLConfig
        • NCCLDevCommRequirements
        • CTAPolicy
      • Group Operations
        • group
        • group_start
        • group_end
        • GroupSimInfo
      • Memory Management
        • mem_alloc
        • mem_free
      • Communicator Resources
        • CommResource
        • RegisteredBufferHandle
        • RegisteredWindowHandle
        • CustomRedOp
        • DevCommResource
      • Types and Constants
        • Data type
        • Reduction operator
        • Exceptions
      • Versions
        • show_versions
        • get_version
        • VersionInfo
        • LibraryInfo
      • Framework Interop
        • CuPy
        • PyTorch
  • Migrating from NCCL 1 to NCCL 2
    • Initialization
    • Communication
    • Counts
    • In-place usage for AllGather and ReduceScatter
    • AllGather arguments order
    • Datatypes
    • Error codes
  • Examples
    • Communicator Creation and Destruction Examples
      • Example 1: Single Process, Single Thread, Multiple Devices
      • Example 2: One Device per Process or Thread
      • Example 3: Multiple Devices per Thread
      • Example 4: Multiple communicators per device
    • Communication Examples
      • Example 1: One Device per Process or Thread
      • Example 2: Multiple Devices per Thread
  • NCCL and MPI
    • API
      • Using multiple devices per process
      • ReduceScatter operation
      • Send and Receive counts
      • Other collectives and point-to-point operations
      • In-place operations
    • Using NCCL within an MPI Program
      • MPI Progress
      • Inter-GPU Communication with CUDA-aware MPI
  • Environment Variables
    • System configuration
      • NCCL_SOCKET_IFNAME
        • Values accepted
      • NCCL_SOCKET_FAMILY
        • Values accepted
      • NCCL_SOCKET_MAGIC
        • Values accepted
      • NCCL_SOCKET_RETRY_CNT
        • Values accepted
      • NCCL_SOCKET_RETRY_SLEEP_MSEC
        • Values accepted
      • NCCL_SOCKET_POLL_TIMEOUT_MSEC
        • Values accepted
      • NCCL_SOCKET_NTHREADS
        • Values accepted
      • NCCL_NSOCKS_PERTHREAD
        • Values accepted
      • NCCL_CROSS_NIC
        • Values accepted
      • NCCL_IB_HCA
        • Values accepted
      • NCCL_IB_TIMEOUT
        • Values accepted
      • NCCL_IB_RETRY_CNT
        • Values accepted
      • NCCL_IB_GID_INDEX
        • Values accepted
      • NCCL_IB_ADDR_FAMILY
        • Values accepted
      • NCCL_IB_ADDR_RANGE
        • Values accepted
      • NCCL_IB_ROCE_VERSION_NUM
        • Values accepted
      • NCCL_IB_SL
        • Values accepted
      • NCCL_IB_TC
        • Values accepted
      • NCCL_IB_FIFO_TC
        • Values accepted
      • NCCL_IB_RETURN_ASYNC_EVENTS
        • Values accepted
      • NCCL_OOB_NET_ENABLE
        • Values accepted
      • NCCL_OOB_NET_IFNAME
        • Values accepted
      • NCCL_UID_STAGGER_THRESHOLD
        • Values accepted
      • NCCL_UID_STAGGER_RATE
        • Values accepted
      • NCCL_NET
        • Values accepted
      • NCCL_NET_PLUGIN
        • Values accepted
      • NCCL_TUNER_PLUGIN
        • Values accepted
      • NCCL_PROFILER_PLUGIN
        • Values accepted
      • NCCL_ENV_PLUGIN
        • Values accepted
      • NCCL_IGNORE_CPU_AFFINITY
        • Values accepted
      • NCCL_CONF_FILE
        • Values accepted
      • NCCL_DEBUG
        • Values accepted
      • NCCL_DEBUG_FILE
        • Values accepted
      • NCCL_DEBUG_SUBSYS
        • Values accepted
      • NCCL_DEBUG_TIMESTAMP_FORMAT
        • Value accepted
      • NCCL_DEBUG_TIMESTAMP_LEVELS
        • Value accepted
      • NCCL_COLLNET_ENABLE
        • Value accepted
      • NCCL_COLLNET_NODE_THRESHOLD
        • Value accepted
      • NCCL_CTA_POLICY
        • Value accepted
      • NCCL_NETDEVS_POLICY
        • Value accepted
      • NCCL_MULTI_RANK_GPU_ENABLE
        • Values accepted
      • NCCL_TOPO_FILE
        • Value accepted
      • NCCL_TOPO_DUMP_FILE
        • Value accepted
      • NCCL_SET_THREAD_NAME
        • Value accepted
    • Debugging
      • NCCL_P2P_DISABLE
        • Values accepted
      • NCCL_P2P_LEVEL
        • Values accepted
        • Integer Values (Legacy)
      • NCCL_P2P_DIRECT_DISABLE
        • Values accepted
      • NCCL_SHM_DISABLE
        • Values accepted
      • NCCL_BUFFSIZE
        • Values accepted
      • NCCL_NTHREADS
        • Values accepted
      • NCCL_MAX_NCHANNELS
        • Values accepted
      • NCCL_MIN_NCHANNELS
        • Values accepted
      • NCCL_CHECKS_DISABLE
        • Values accepted
      • NCCL_CHECK_POINTERS
        • Values accepted
      • NCCL_CHECK_MODE
        • Values accepted
      • NCCL_LAUNCH_MODE
        • Values accepted
      • NCCL_IB_DISABLE
        • Values accepted
      • NCCL_IB_AR_THRESHOLD
        • Values accepted
      • NCCL_IB_OOO_RQ
        • Values accepted
      • NCCL_IB_QPS_PER_CONNECTION
        • Values accepted
      • NCCL_IB_SPLIT_DATA_ON_QPS
        • Values accepted
      • NCCL_IB_CUDA_SUPPORT
        • Values accepted
      • NCCL_IB_PCI_RELAXED_ORDERING
        • Values accepted
      • NCCL_IB_ADAPTIVE_ROUTING
        • Values accepted
      • NCCL_IB_ECE_ENABLE
        • Values accepted
      • NCCL_MEM_SYNC_DOMAIN
        • Values accepted
      • NCCL_CUMEM_ENABLE
        • Values accepted
      • NCCL_CUMEM_HOST_ENABLE
        • Values accepted
      • NCCL_NET_GDR_LEVEL (formerly NCCL_IB_GDR_LEVEL)
        • Values accepted
        • Integer Values (Legacy)
      • NCCL_NET_GDR_C2C
        • Values accepted
      • NCCL_NET_GDR_READ
        • Values accepted
      • NCCL_NET_SHARED_BUFFERS
        • Value accepted
      • NCCL_NET_SHARED_COMMS
        • Value accepted
      • NCCL_IGNORE_NET_MISMATCH
        • Values accepted
      • NCCL_IGNORE_COLLNET_MISMATCH
        • Values accepted
      • NCCL_SINGLE_RING_THRESHOLD
        • Values accepted
      • NCCL_LL_THRESHOLD
        • Values accepted
      • NCCL_TREE_THRESHOLD
        • Values accepted
      • NCCL_ALGO
        • Values accepted
      • NCCL_PROTO
        • Values accepted
      • NCCL_NVB_DISABLE
        • Value accepted
      • NCCL_PXN_DISABLE
        • Value accepted
      • NCCL_P2P_PXN_LEVEL
        • Value accepted
      • NCCL_PXN_C2C
        • Value accepted
      • NCCL_RUNTIME_CONNECT
        • Value accepted
      • NCCL_GRAPH_REGISTER
        • Value accepted
      • NCCL_LOCAL_REGISTER
        • Value accepted
      • NCCL_LEGACY_CUDA_REGISTER
        • Value accepted
      • NCCL_WIN_ENABLE
        • Value accepted
      • NCCL_SET_STACK_SIZE
        • Value accepted
      • NCCL_GRAPH_MIXING_SUPPORT
        • Value accepted
      • NCCL_GRAPH_STREAM_ORDERING
        • Value accepted
      • NCCL_DMABUF_ENABLE
        • Value accepted
      • NCCL_P2P_NET_CHUNKSIZE
        • Values accepted
      • NCCL_P2P_LL_THRESHOLD
        • Values accepted
      • NCCL_ALLOC_P2P_NET_LL_BUFFERS
        • Values accepted
      • NCCL_COMM_BLOCKING
        • Values accepted
      • NCCL_CGA_CLUSTER_SIZE
        • Values accepted
      • NCCL_MAX_CTAS
        • Values accepted
      • NCCL_MIN_CTAS
        • Values accepted
      • NCCL_NVLS_ENABLE
        • Values accepted
      • NCCL_IB_MERGE_NICS
        • Values accepted
      • NCCL_NET_MERGE_POLICY
        • Values accepted
      • NCCL_MNNVL_ENABLE
        • Values accepted
      • NCCL_MNNVL_UUID
        • Values accepted
      • NCCL_MNNVL_CLIQUE_ID
        • Values accepted
      • NCCL_RAS_ENABLE
        • Values accepted
      • NCCL_RAS_ADDR
        • Values accepted
      • NCCL_RAS_TIMEOUT_FACTOR
        • Values accepted
      • NCCL_LAUNCH_ORDER_IMPLICIT
        • Values accepted
      • NCCL_LAUNCH_RACE_FATAL
        • Values accepted
      • NCCL_IPC_USE_ABSTRACT_SOCKET
        • Values accepted
      • NCCL_SYM_GIN_KERNELS_ENABLE
        • Values accepted
  • Troubleshooting
    • GPU troubleshooting
      • GPU Direct
        • GPU-to-GPU communication
        • GPU-to-NIC communication
        • PCI Access Control Services (ACS)
      • Topology detection
    • Networking Troubleshooting
      • Networking issues
        • IP network interfaces
        • IP ports
        • NIC-level diagnostics
        • InfiniBand
        • RDMA over Converged Ethernet (RoCE)
    • Runtime and MPI issues
      • Errors
      • Memory issues
        • Shared memory
        • Stack size
        • Unified Memory (UVM)
      • File Descriptors
      • MPI
        • Open MPI based MPIs (e.g. NVIDIA HPC-X)
    • Performance and tuning
      • Performance issues
        • Intra-node communication
        • Inter-node communication
        • Multi-node NVLink (MNNVL) issues
        • Tuning NCCL configuration
        • Process/thread affinity
    • Logging
      • Logging Environment Variables
      • Logging Levels
        • Setting the Logging Level
        • Example Output
      • Filtering by Subsystem
        • Basic Usage
        • Available Subsystems
        • Example Output by Subsystem
      • Logging to Files
      • Timestamp Configuration
        • Timestamp Format
        • Timestamp Levels
      • Common Debugging Scenarios
        • Diagnosing Initialization Hangs
        • Investigating Network Issues
        • Understanding Topology Detection
        • Debugging Performance Issues
        • Tracing NCCL API Calls
        • Full Debugging Session
    • RAS
      • Principle of Operation
      • RAS Queries
      • Sample Output
      • JSON Output
      • Monitoring Mode
NCCL
  • Language Bindings
  • Python Bindings (NCCL4Py)
  • View page source

Python Bindings (NCCL4Py)

  • Communicator
    • Communicator Class
    • Creation and Lifecycle Methods
    • Collective Communication Methods
    • Point-to-Point and Signal Methods
    • Memory Registration Methods
    • Device Communicator Setup
    • Status and Utility Methods
  • Configuration
    • NCCLConfig
    • NCCLDevCommRequirements
    • CTAPolicy
  • Group Operations
    • group
    • group_start
    • group_end
    • GroupSimInfo
  • Memory Management
    • mem_alloc
    • mem_free
  • Communicator Resources
    • CommResource
    • RegisteredBufferHandle
    • RegisteredWindowHandle
    • CustomRedOp
    • DevCommResource
  • Types and Constants
    • Data type
    • Reduction operator
    • Exceptions
  • Versions
    • show_versions
    • get_version
    • VersionInfo
    • LibraryInfo
  • Framework Interop
    • CuPy
    • PyTorch
Previous Next

© Copyright 2020-2026, NVIDIA Corporation.

Built with Sphinx using a theme provided by Read the Docs.