Environment Variables¶
Standard options¶
-
NVSHMEM_VERSION
¶
Print library version at startup
-
NVSHMEM_INFO
¶
Print environment variable options at startup
-
NVSHMEM_SYMMETRIC_SIZE
¶
Specifies the size (in bytes) of the symmetric heap memory per PE. The resulting size is implementation-defined and must be at least as large as the integer ceiling of the product of the numeric prefix and the scaling factor. The allowed character suffixes for the scaling factor are as follows:
- k or K multiplies by 2^10 (kibibytes)
- m or M multiplies by 2^20 (mebibytes)
- g or G multiplies by 2^30 (gibibytes)
- t or T multiplies by 2^40 (tebibytes)
For example, string ‘20m’ is equivalent to the integer value 20971520, or 20
mebibytes. Similarly the string ‘3.1M’ is equivalent to the integer value
3250586. Only one multiplier is recognized and any characters following the
multiplier are ignored, so ‘20kk’ will not produce the same result as ‘20m’.
Usage of string ‘.5m’ will yield the same result as the string ‘0.5m’.
An invalid value for NVSHMEM_SYMMETRIC_SIZE
is an error, which the NVSHMEM
library shall report by either returning a nonzero value from
nvshmem_init_thread
or causing program termination.
-
NVSHMEM_DEBUG
¶
Set to enable debugging messages. Optional values: VERSION, WARN, INFO, ABORT, TRACE
Bootstrap options¶
-
NVSHMEM_BOOTSTRAP
¶
Name of the default bootstrap that should be used to initialize NVSHMEM. Allowed values: PMI, MPI, SHMEM, plugin
-
NVSHMEM_BOOTSTRAP_PMI
¶
Name of the PMI bootstrap that should be used to initialize NVSHMEM. Allowed values: PMI, PMI-2, PMIX
-
NVSHMEM_BOOTSTRAP_PLUGIN
¶
Name of the bootstrap plugin file to load
Additional options¶
-
NVSHMEM_DEBUG_FILE
¶
Debugging output filename, may contain %h for hostname and %p for pid
-
NVSHMEM_MAX_TEAMS
¶
Maximum number of simultaneous teams allowed
-
NVSHMEM_MAX_P2P_GPUS
¶
Maximum number of P2P GPUs
-
NVSHMEM_MAX_MEMORY_PER_GPU
¶
Maximum memory per GPU
-
NVSHMEM_DISABLE_CUDA_VMM
¶
Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version and CUDA Driver version to be greater than or equal to 11.3.
-
NVSHMEM_DISABLE_P2P
¶
Disable P2P connectivity of GPUs even when available
-
NVSHMEM_CUMEM_GRANULARITY
¶
Granularity for cuMemAlloc
/cuMemCreate
-
NVSHMEM_PROXY_REQUEST_BATCH_MAX
¶
Maxmum number of requests that the proxy thread processes in a single iteration of the progress loop.
Collectives options¶
-
NVSHMEM_DISABLE_NCCL
¶
Disable use of NCCL for collective operations
-
NVSHMEM_BARRIER_DISSEM_KVAL
¶
Radix of the dissemination algorithm used for barriers
-
NVSHMEM_BARRIER_TG_DISSEM_KVAL
¶
Radix of the dissemination algorithm used for thread group barriers
-
NVSHMEM_FCOLLECT_LL_THRESHOLD
¶
Message size threshold up to which fcollect LL algo will be used
-
NVSHMEM_BCAST_LL_THRESHOLD
¶
Message size threshold up to which broadcast LL algo will be used
Transport options¶
-
NVSHMEM_REMOTE_TRANSPORT
¶
Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, none
-
NVSHMEM_DISABLE_IB_NATIVE_ATOMICS
¶
Disable use of InfiniBand native atomics
-
NVSHMEM_DISABLE_GDRCOPY
¶
Disable use of GDRCopy in IB RC Transport
-
NVSHMEM_ENABLE_NIC_PE_MAPPING
¶
When not set or set to 0, a PE is assigned the NIC on the node that is closest
to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a
round-robin basis or uses NVSHMEM_HCA_PE_MAPPING
or NVSHMEM_HCA_LIST
when they are specified.
-
NVSHMEM_IB_GID_INDEX
¶
Source GID Index for ROCE
-
NVSHMEM_IB_TRAFFIC_CLASS
¶
Traffic calss for ROCE
-
NVSHMEM_IB_SL
¶
Service level to use over IB/ROCE
-
NVSHMEM_HCA_LIST
¶
Comma-separated list of HCAs to use in the NVSHMEM application. Entries are of
the form hca_name:port
, e.g. mlx5_1:1,mlx5_2:2
and entries prefixed by ^
are excluded. NVSHMEM_ENABLE_NIC_PE_MAPPING
must be set to 1 for this
variable to be effective.
-
NVSHMEM_HCA_PE_MAPPING
¶
Specifies mapping of HCAs to PEs as a comma-separated list. Each entry in the
comma separated list is of the form hca_name:port:count
. For example,
mlx5_0:1:2,mlx5_0:2:2
indicates that PE0, PE1 are mapped to port 1 of
mlx5_0, and PE2, PE3 are mapped to port 2 of mlx5_0.
NVSHMEM_ENABLE_NIC_PE_MAPPING
must be set to 1 for this variable to be
effective.
-
NVSHMEM_DISABLE_LOCAL_ONLY_PROXY
¶
When running on an NVLink-only configuaration (No-IB, No-UCX), completely
disable the proxy thread. This will disable device side global exit and device
side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING
build-time variable) because these are processed by the proxy thread.
-
NVSHMEM_IB_GPUINITIATED_NUM_DCT
¶
Number of DCT QPs used in GPU-initiated communication transport.
-
NVSHMEM_IB_GPUINITIATED_NUM_DCI
¶
Total number of DCI QPs used in GPU-initiated communication transport. Set to 0 or a negative number to use automatic configuration.
-
NVSHMEM_IB_GPUINITIATED_NUM_SHARED_DCI
¶
Number of DCI QPs in the shared pool. The rest of DCI QPs (NVSHMEM_IB_GPUINITIATED_NUM_DCI - NVSHMEM_IB_GPUINITIATED_NUM_SHARED_DCI) are exclusively assigned. Valid value: [1, NVSHMEM_IB_GPUINITIATED_NUM_DCI].
-
NVSHMEM_IB_GPUINITIATED_DCI_MAP_BY
¶
Specifies how exclusive DCI QPs are assigned. Choices are: cta, sm, warp, dct.
- cta: round-robin by CTA ID (default)
- sm: round-robin by SM ID
- warp: round-robin by Warp ID
- dct: round-robin by DCT ID
-
NVSHMEM_IB_GPUINITIATED_FORCE_NIC_BUF_MEMTYPE
¶
Force NIC buffer memory type. Valid choices are: gpumem, hostmem. For other values, use auto discovery (default).
-
NVSHMEM_IB_GPUINITIATED_NUM_REQUESTS_IN_BATCH
¶
Number of requests to be batched before submitting to the NIC. It will be rounded up to the nearest power of 2. Set to 1 for aggressive submission.
-
NVSHMEM_IB_GPUINITIATED_NUM_FETCH_SLOTS_PER_DCI
¶
Number of internal buffer slots for fetch operations. It will be rounded up to the nearest power of 2.
-
NVSHMEM_IB_ENABLE_GPUINITIATED
¶
Set to enable GPU-initiated communication transport.
NVTX options¶
-
NVSHMEM_NVTX
¶
Set to enable NVTX instrumentation. Accepts a comma separated list of instrumentation groups. By default the NVTX instrumentation is disabled.
init : library setup
alloc : memory management
launch : kernel launch routines
coll : collective communications
wait : blocking point-to-point synchronization
wait_on_stream : point-to-point synchronization (on stream)
test : non-blocking point-to-point synchronization
memorder : memory ordering (quiet, fence)
quiet_on_stream : nvshmemx_quiet_on_stream
atomic_fetch : fetching atomic memory operations
atomic_set : non-fetchong atomic memory operations
rma_blocking : blocking remote memory access operations
rma_nonblocking : non-blocking remote memory access operations
proxy : activity of the proxy thread
common : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
all : all groups
off : disable all NVTX instrumentation