Environment Variables

Standard options

NVSHMEM_VERSION
Type: bool
Default: false

Print library version at startup.

NVSHMEM_INFO
Type: bool
Default: false

Print environment variable options at startup.

NVSHMEM_SYMMETRIC_SIZE
Type: size
Default: 1073741824

Specifies the size (in bytes) of the symmetric heap memory per PE. The resulting size is implementation-defined and must be at least as large as the integer ceiling of the product of the numeric prefix and the scaling factor. The allowed character suffixes for the scaling factor are as follows:

  • k or K multiplies by 2^10 (kibibytes)
  • m or M multiplies by 2^20 (mebibytes)
  • g or G multiplies by 2^30 (gibibytes)
  • t or T multiplies by 2^40 (tebibytes)

For example, string ‘20m’ is equivalent to the integer value 20971520, or 20 mebibytes. Similarly the string ‘3.1M’ is equivalent to the integer value 3250586. Only one multiplier is recognized and any characters following the multiplier are ignored, so ‘20kk’ will not produce the same result as ‘20m’. Usage of string ‘.5m’ will yield the same result as the string ‘0.5m’. An invalid value for NVSHMEM_SYMMETRIC_SIZE is an error, which the NVSHMEM library shall report by either returning a nonzero value from nvshmem_init_thread or causing program termination.

NVSHMEM_DEBUG
Type: string
Default: “”

Set to enable debugging messages. Optional values: VERSION, WARN, INFO, ABORT, TRACE

Bootstrap options

NVSHMEM_BOOTSTRAP
Type: string
Default: “PMI”

Name of the default bootstrap that should be used to initialize NVSHMEM. Allowed values: PMI, MPI, SHMEM, plugin

NVSHMEM_BOOTSTRAP_PMI
Type: string
Default: “PMI”

Name of the PMI bootstrap that should be used to initialize NVSHMEM. Allowed values: PMI, PMI-2, PMIX

NVSHMEM_BOOTSTRAP_PLUGIN
Type: string
Default: “”

Name of the bootstrap plugin file to load when NVSHMEM_BOOTSTRAP=plugin is specified.

NVSHMEM_BOOTSTRAP_MPI_PLUGIN
Type: string
Default: “nvshmem_bootstrap_mpi.so”

Name of the MPI bootstrap plugin file.

NVSHMEM_BOOTSTRAP_SHMEM_PLUGIN
Type: string
Default: “nvshmem_bootstrap_shmem.so”

Name of the SHMEM bootstrap plugin file.

NVSHMEM_BOOTSTRAP_PMI_PLUGIN
Type: string
Default: “nvshmem_bootstrap_pmi.so”

Name of the PMI bootstrap plugin file.

NVSHMEM_BOOTSTRAP_PMI2_PLUGIN
Type: string
Default: “nvshmem_bootstrap_pmi2.so”

Name of the PMI-2 bootstrap plugin file.

NVSHMEM_BOOTSTRAP_PMIX_PLUGIN
Type: string
Default: “nvshmem_bootstrap_pmix.so”

Name of the PMIx bootstrap plugin file.

NVSHMEM_BOOTSTRAP_UID_PLUGIN
Type: string
Default: “nvshmem_bootstrap_uid.so”

Name of the UID bootstrap plugin file.

NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME
Type: string
Default: “”

Define to a list of prefixes to filter interfaces to be used by NVSHMEM. Using the ^ symbol, NVSHMEM will exclude interfaces starting with any prefix in that list. To match (or not) an exact interface name instead of a prefix, prefix the string with the = character.

Examples: eth : Use all interfaces starting with eth, e.g. eth0, eth1, … =eth0 : Use only interface eth0 ^docker : Do not use any interface starting with docker ^=docker0 : Do not use interface docker0.

Note: By default, the loopback interface (lo) and docker interfaces (docker*) would not be selected unless there are no other interfaces available. If you prefer to use lo or docker* over other interfaces, you would need to explicitly select them using NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME. The default algorithm will also favor interfaces starting with ib over others. Setting NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME will bypass the automatic interface selection algorithm and may use all interfaces matching the manual selection.

NVSHMEM_BOOTSTRAP_UID_SOCK_FAMILY
Type: string
Default: “AF_INET”

Name of the socket family that interface belongs to. Allowed values: AF_INET6, AF_INET.

NVSHMEM_BOOTSTRAP_UID_SESSION_ID
Type: string
Default: “”

Name of the UID session identifier, as specified by a combination of <ipv4>:<TCP port> or [<ipv6>]:<TCP port> or <hostname>:<TCP port>.

NVSHMEM_BOOTSTRAP_SHMEM_MODE
Type: string
Default: “auto”

Select the OpenSHMEM bootstrap collective path. Allowed values: auto, legacy, teams.

Additional options

NVSHMEM_DEBUG_FILE
Type: string
Default: “”

Debugging output filename, may contain %h for hostname and %p for pid.

NVSHMEM_DEBUG_ATTACH_DELAY
Type: int
Default: 0

Delay, in seconds, during the first NVSHMEM initialization call to allow for attaching a debugger.

NVSHMEM_MAX_TEAMS
Type: long
Default: 128

Maximum number of simultaneous teams allowed. This limit includes both user-visible teams and internal teams created by NVSHMEM. For multi-CTA collectives, NVSHMEM creates internal teams corresponding to reserved and user-created teams. With NVLS enabled, each newly created team can require up to 48 internal teams. Set NVSHMEM_MAX_TEAMS large enough to accommodate these internal teams; NVSHMEM reports an error at runtime if the limit is insufficient.

NVSHMEM_MAX_MEMORY_PER_GPU
Type: size
Default: 137438953472

Maximum memory per GPU

NVSHMEM_DISABLE_CUDA_VMM
Type: bool
Default: false

Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version and CUDA Driver version to be greater than or equal to 11.3.

NVSHMEM_DISABLE_P2P
Type: bool
Default: false

Disable P2P connectivity of GPUs even when available.

NVSHMEM_DISABLE_MNNVL
Type: bool
Default: false

Disable MNNVL connectivity of GPUs even when available.

NVSHMEM_DISABLE_NVLS
Type: bool
Default: false

Disable NVLINK SHARP collectives for P2P connected GPUs over NVSwitch even when available.

NVSHMEM_ENABLE_LOGICAL_ENDPOINT
Type: bool
Default: false

Enable logical endpoint support.

NVSHMEM_CUMEM_GRANULARITY
Type: size
Default: 536870912

Granularity for cuMemAlloc/cuMemCreate.

NVSHMEM_CUDA_LIMIT_STACK_SIZE
Type: size
Default: 0

Specify limit on stack size of each GPU thread on P9.

NVSHMEM_CUDA_PATH
Type: string
Default: “”

Path to directory containing libcuda.so for use when not in the default location.

NVSHMEM_PROXY_REQUEST_BATCH_MAX
Type: int
Default: 32

Maxmum number of requests that the proxy thread processes in a single iteration of the progress loop.

NVSHMEM_G_BUF_SIZE
Type: int
Default: 4194304

Size of the g_buf used to perform shmem_g operations in parallel. Must be a multiple of 16 bytes.

NVSHMEM_G_COALESCING_BUF_SIZE
Type: int
Default: 67108864

Size of the buffer used for coalescing shmem_g operations. Must be a multiple of 256 bytes. NVSHMEM requires its value to be NVSHMEM_G_BUF_SIZE * 16.

NVSHMEM_MAX_PEER_STREAMS
Type: int
Default: 16

Maximum number of CUDA streams per node.

NVSHMEM_CPU_AFFINITY
Type: string
Default: “AUTO”

Controls NUMA-aware CPU affinity pinning during initialization. Allowed values: AUTO enables automatic NUMA-local pinning; OFF disables pinning. AUTO only narrows current affinity and does not overwrite existing settings.

Collectives options

NVSHMEM_DISABLE_NCCL
Type: bool
Default: false

Disable use of NCCL for collective operations.

NVSHMEM_BARRIER_DISSEM_KVAL
Type: int
Default: 2

Radix of the dissemination algorithm used for barriers.

NVSHMEM_BARRIER_TG_DISSEM_KVAL
Type: int
Default: 2

Radix of the dissemination algorithm used for thread group barriers.

NVSHMEM_FCOLLECT_LL_THRESHOLD
Type: size
Default: 2048

Message size threshold up to which fcollect LL algo will be used.

NVSHMEM_BCAST_ALGO
Type: int
Default: 0
Broadcast algorithm to be used.
  • 0 - use default algorithm selection strategy
NVSHMEM_REDMAXLOC_ALGO
Type: int
Default: 1
Reduction algorithm to be used.
  • 1 - default, flag alltoall algorithm
  • 2 - flat reduce + flat bcast
  • 3 - topo-aware two-level reduce + topo-aware bcast
NVSHMEM_REDUCE_SCRATCH_SIZE
Type: size_t
Default: 524288

Amount of symmetric heap memory (minimum 16B, multiple of 8B) reserved by runtime for every team to implement reduce and reducescatter collectives.

Transport options

NVSHMEM_REMOTE_TRANSPORT
Type: string
Default: “ibrc”

Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, gpunetio, none.

NVSHMEM_DISABLE_IB_NATIVE_ATOMICS
Type: bool
Default: false

Disable use of InfiniBand native atomics.

NVSHMEM_DISABLE_GDRCOPY
Type: bool
Default: false

Disable use of GDRCopy in IB RC Transport.

NVSHMEM_ENABLE_NIC_PE_MAPPING
Type: bool
Default: false

When not set or set to 0, a PE is assigned to the NIC on the node that is closest to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a round-robin basis or uses NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST when they are specified.

NVSHMEM_NETDEVS_POLICY
Type: string
Default: “AUTO”

Policy for automatic NIC assignment when NVSHMEM_ENABLE_NIC_PE_MAPPING is 0. AUTO preserves the default NVSHMEM behavior and balances over local NVSHMEM PEs. EXTERNAL_SHARING_PCIE_SWITCH_NIC_EXCLUSIVE balances over all node-local GPUs to avoid external NVSHMEM instances sharing PCIe-switch-local NICs when topology permits; NICs may still be shared when there are fewer NICs than GPUs.

NVSHMEM_TRANSPORT_BATCH_MAX_OPS
Type: int
Default: 16

Maximum number of consecutive proxy requests to mark with a transport batching hint before ending the current transport-level batch. Applies only to transports and operation types that implement batching hints.

NVSHMEM_IB_GID_INDEX
Type: int
Default: -1

Source GID Index for ROCE. By default, it would dynamically discover the GID supported by the NIC.

NVSHMEM_IB_TRAFFIC_CLASS
Type: int
Default: 0

Traffic calss for ROCE.

NVSHMEM_IB_SL
Type: int
Default: 0

Service level to use over IB/ROCE.

NVSHMEM_IB_ADDR_FAMILY
Type: string
Default: AF_INET

IP address family associated to GID dynamically selected by NVSHMEM when NVSHMEM_IB_GID_INDEX is left unset.

NVSHMEM_IB_ADDR_RANGE
Type: string
Default: ::/0

Defines the range of valid GIDs dynamically selected by NVSHMEM when NVSHMEM_IB_GID_INDEX is left unset.

NVSHMEM_IB_ROCE_VERSION_NUM
Type: int
Default: 2

ROCE version associated to IB GID dynamically selected by NVSHMEM when NVSHMEM_IB_GID_INDEX is left unset.

NVSHMEM_IB_TIMEOUT
Type: int
Default: 20

QP acknowledgement timeout for IB transports. Valid range: 0-31.

NVSHMEM_IB_RETRY_CNT
Type: int
Default: 7

QP retry count for IB transports. Valid range: 0-7.

NVSHMEM_IB_PKEY_INDEX
Type: int
Default: 0

Partition key (pkey) index to use for InfiniBand transport queue pairs. The default is 0, the default partition.

NVSHMEM_IB_ENABLE_RELAXED_ORDERING
Type: bool
Default: true

Enable PCIe relaxed ordering on transports over IB/ROCE, such as IBRC, IBGDA, and IBDEVX.

NVSHMEM_IB_NUM_RC_PER_DEVICE
Type: int
Default: 1

Number of RC QPs to create per device in the IB proxy-based transports. A device is each enumerated IB device, either a full HCA or a single port of a multi-port HCA.

NVSHMEM_HCA_PREFIX
Type: string
Default: “mlx5”

Prefix of HCA interface names. Example, mlx5, ibp.

NVSHMEM_HCA_LIST
Type: string
Default: “”

Comma-separated list of HCAs to use in the NVSHMEM application. Entries are of the form hca_name:port, e.g. mlx5_1:1,mlx5_2:2 and entries prefixed by ^ are excluded. NVSHMEM_ENABLE_NIC_PE_MAPPING must be set to 1 for this variable to be effective.

NVSHMEM_HCA_PE_MAPPING
Type: string
Default: “”

Specifies mapping of HCAs to PEs as a comma-separated list. Each entry in the comma separated list is of the form hca_name:port:count. For example, mlx5_0:1:2,mlx5_0:2:2 indicates that PE0, PE1 are mapped to port 1 of mlx5_0, and PE2, PE3 are mapped to port 2 of mlx5_0. NVSHMEM_ENABLE_NIC_PE_MAPPING must be set to 1 for this variable to be effective.

NVSHMEM_DISABLE_LOCAL_ONLY_PROXY
Type: bool
Default: false

When running on an NVLink-only configuaration (No-IB, No-UCX), completely disable the proxy thread. This will disable device side global exit and device side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING build-time variable) because these are processed by the proxy thread.

NVSHMEM_TMA_POLICY
Type: string
Default: “DISABLE”

Controls TMA usage for device-side point-to-point operations over NVLink. Valid values are DISABLE, ENABLE, and FORCE. DISABLE prevents NVSHMEM from using TMA. ENABLE allows NVSHMEM to use TMA when the GPU architecture, registered CTA shared memory, topology, and transfer shape support it. FORCE requires TMA support to be available during initialization and fails on unsupported devices.

NVSHMEM_LIBFABRIC_PROVIDER
Type: string
Default: “cxi”

Set the feature set provider for the libfabric transport: cxi, efa, verbs

NVSHMEM_LIBFABRIC_MAX_NIC_PER_PE
Type: int
Default: 16

Set the maximum number of NICs per PE for use in the libfabric provider.

NVSHMEM_LIBFABRIC_PROXY_REQUEST_BATCH_MAX
Type: int
Default: 32

Maximum number of requests that the libfabric transport processes per queue in a single iteration of the progress loop.

NVSHMEM_LIBFABRIC_DISABLE_BATCH_RMA
Type: bool
Default: false

Disable support for batched RMA with FI_MORE.

NVSHMEM_LIBFABRIC_SIGNAL_WAIT_SPIN_COUNT
Type: int
Default: 1024

Number of polling iterations for pending signal-delivery work in the libfabric transport before sleeping. Increasing this can reduce latency for signal-heavy operations at the cost of CPU usage; 0 sleeps immediately. The value must be non-negative.

NVSHMEM_IBGDA_NUM_DCT
Type: int
Default: 2

Number of DCT QPs used in GPU-initiated communication transport.

NVSHMEM_IBGDA_NUM_DCI
Type: int
Default: 1

Total number of DCI QPs used in GPU-initiated communication transport. Set to 0 or a negative number to use automatic configuration.

NVSHMEM_IBGDA_NUM_SHARED_DCI
Type: int
Default: 1

Number of DCI QPs in the shared pool. The rest of DCI QPs (NVSHMEM_IBGDA_NUM_DCI - NVSHMEM_IBGDA_NUM_SHARED_DCI) are exclusively assigned. Valid value: [1, NVSHMEM_IBGDA_NUM_DCI].

NVSHMEM_IBGDA_DCI_MAP_BY
Type: string
Default: “cta”

Specifies how exclusive DCI QPs are assigned. Choices are: cta, sm, warp, dct.

  • cta: round-robin by CTA ID (default).
  • sm: round-robin by SM ID.
  • warp: round-robin by Warp ID.
  • dct: round-robin by DCT ID.
NVSHMEM_IBGDA_NUM_RC_PER_PE
Type: int
Default: 2

Number of RC QPs per peer PE used in GPU-initiated communication transport. Set to 0 to disable RC QPs (default 2). If set to a positive number, DCI will be used for enforcing consistency only.

NVSHMEM_IBGDA_RC_MAP_BY
Type: string
Default: “cta”

Specifies how RC QPs are assigned. Choices are: cta, sm, warp.

  • cta: round-robin by CTA ID (default).
  • sm: round-robin by SM ID.
  • warp: round-robin by Warp ID.
NVSHMEM_IBGDA_FORCE_NIC_BUF_MEMTYPE
Type: string
Default: “gpumem”

Force NIC buffer memory type. Valid choices are: gpumem (default), hostmem. For other values, use auto discovery.

NVSHMEM_IBGDA_NUM_REQUESTS_IN_BATCH
Type: int
Default: 32

Number of requests to be batched before submitting to the NIC. It will be rounded up to the nearest power of 2. Set to 1 for aggressive submission.

NVSHMEM_IBGDA_NUM_FETCH_SLOTS_PER_DCI
Type: int
Default: 1024

Number of internal buffer slots for fetch operations for each DCI QP. It will be rounded up to the nearest power of 2.

NVSHMEM_IBGDA_NUM_FETCH_SLOTS_PER_RC
Type: int
Default: 1024

Number of internal buffer slots for fetch operations for each RC QP. It will be rounded up to the nearest power of 2.

NVSHMEM_IB_ENABLE_IBGDA
Type: bool
Default: false

Set to enable GPU-initiated communication transport.

NVSHMEM_IBGDA_NIC_HANDLER
Type: string
Default: auto

Selects the processor used for ringing NIC’s doorbell. Choices are auto, gpu, cpu, cpu_cuda_memory.

auto: Use GPU SMs and fallback to CPU if it is not supported (default). gpu: Use GPU SMs. cpu: Use CPU proxy thread. cpu_cuda_memory: Use CPU with CUDA memory.

NVSHMEM_IB_DISABLE_DMABUF
Type: bool
Default: false

Set to disable DMAbuf in any IB based remote transport.

NVSHMEM_DISABLE_DATA_DIRECT
Type: bool
Default: false

Disable use of DirectNIC in IB transport

NVSHMEM_IBGDA_ENABLE_MULTI_PORT
Type: bool
Default: false

Set to enable multiple NICs per PE if available.

NVSHMEM_GPUNETIO_ENABLE_GDAKI
Type: bool
Default: false

Set to enable GPU-initiated communication transport via GPUNetIO. When set to 1, NVSHMEM_REMOTE_TRANSPORT must be set to gpunetio.

NVSHMEM_GPUNETIO_NIC_HANDLER
Type: string
Default: “auto”

Specifies the processor used for ringing the NIC’s doorbell. Choices are auto, gpu, gpu_sm_bf, cpu.

  • auto: Use GPU SMs and fallback to CPU if it is not supported.
  • gpu: Use GPU SMs, regular doorbell.
  • gpu_sm_bf: Use GPU SMs, BlueFlame doorbell.
  • cpu: Use CPU.
NVSHMEM_GPUNETIO_NUM_RC_PER_PE_GPU
Type: int
Default: 2

Number of GPU-data-path RC QPs per peer PE in the GPUNetIO transport. This only takes effect when NVSHMEM_GPUNETIO_ENABLE_GDAKI is set to 1. Otherwise, the GPU data path is disabled and this value is ignored.

NVSHMEM_GPUNETIO_NUM_RC_PER_PE_CPU
Type: int
Default: 2

Number of CPU-data-path RC QPs per peer PE in the GPUNetIO transport. The CPU data path is always active and does not require GDAKI. This value must be greater than 0.

NVSHMEM_GPUNETIO_NUM_REQUESTS_IN_BATCH
Type: int
Default: 32

Number of requests to be batched before submitting to the NIC when using GDAKI. It will be rounded up to the nearest power of 2. Set to 1 for aggressive submission. Only takes effect when NVSHMEM_GPUNETIO_ENABLE_GDAKI=1. This value must be positive and must not be larger than the QP depth.

NVSHMEM_GPUNETIO_NUM_FETCH_SLOTS_PER_RC
Type: int
Default: 1024

Number of internal buffer slots for fetch operations for each RC QP when using GDAKI. It will be rounded up to the nearest power of 2. Only takes effect when NVSHMEM_GPUNETIO_ENABLE_GDAKI=1.

NVSHMEM_GPUNETIO_ENABLE_ORDERING_SEMANTIC
Type: bool
Default: false

Set to enable ordering semantic for DDP (Direct Data Placement) mode for GPUNetIO QPs. This depends on the DOCA SDK provided via DOCA_SDK_LIB_PATH to GPUNetIO.

NVTX options

NVSHMEM_NVTX
Type: string
Default: “off”

Set to enable NVTX instrumentation. Accepts a comma separated list of instrumentation groups. By default the NVTX instrumentation is disabled.

init                : library setup
alloc               : memory management
launch              : kernel launch routines
coll                : collective communications
wait                : blocking point-to-point synchronization
wait_on_stream      : point-to-point synchronization (on stream)
test                : non-blocking point-to-point synchronization
memorder            : memory ordering (quiet, fence)
quiet_on_stream     : nvshmemx_quiet_on_stream
atomic_fetch        : fetching atomic memory operations
atomic_set          : non-fetchong atomic memory operations
rma_blocking        : blocking remote memory access operations
rma_nonblocking     : non-blocking remote memory access operations
proxy               : activity of the proxy thread
common              : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
all                 : all groups
off                 : disable all NVTX instrumentation