Environment Variables

Standard options

NVSHMEM_VERSION
Type: bool
Default: false

Print library version at startup

NVSHMEM_INFO
Type: bool
Default: false

Print environment variable options at startup

NVSHMEM_SYMMETRIC_SIZE
Type: size
Default: 1073741824

Specifies the size (in bytes) of the symmetric heap memory per PE. The resulting size is implementation-defined and must be at least as large as the integer ceiling of the product of the numeric prefix and the scaling factor. The allowed character suffixes for the scaling factor are as follows:

  • k or K multiplies by 2^10 (kibibytes)
  • m or M multiplies by 2^20 (mebibytes)
  • g or G multiplies by 2^30 (gibibytes)
  • t or T multiplies by 2^40 (tebibytes)

For example, string ‘20m’ is equivalent to the integer value 20971520, or 20 mebibytes. Similarly the string ‘3.1M’ is equivalent to the integer value 3250586. Only one multiplier is recognized and any characters following the multiplier are ignored, so ‘20kk’ will not produce the same result as ‘20m’. Usage of string ‘.5m’ will yield the same result as the string ‘0.5m’. An invalid value for NVSHMEM_SYMMETRIC_SIZE is an error, which the NVSHMEM library shall report by either returning a nonzero value from nvshmem_init_thread or causing program termination.

NVSHMEM_DEBUG
Type: string
Default: “”

Set to enable debugging messages. Optional values: VERSION, WARN, INFO, ABORT, TRACE

Bootstrap options

NVSHMEM_BOOTSTRAP
Type: string
Default: “PMI”

Name of the default bootstrap that should be used to initialize NVSHMEM. Allowed values: PMI, MPI, SHMEM, plugin

NVSHMEM_BOOTSTRAP_PMI
Type: string
Default: “PMI”

Name of the PMI bootstrap that should be used to initialize NVSHMEM. Allowed values: PMI, PMI-2, PMIX

NVSHMEM_BOOTSTRAP_PLUGIN
Type: string
Default: “”

Name of the bootstrap plugin file to load

Additional options

NVSHMEM_DEBUG_FILE
Type: string
Default: “”

Debugging output filename, may contain %h for hostname and %p for pid

NVSHMEM_MAX_TEAMS
Type: long
Default: 20

Maximum number of simultaneous teams allowed

NVSHMEM_MAX_P2P_GPUS
Type: int
Default: 128

Maximum number of P2P GPUs

NVSHMEM_MAX_MEMORY_PER_GPU
Type: size
Default: 137438953472

Maximum memory per GPU

NVSHMEM_DISABLE_CUDA_VMM
Type: bool
Default: false

Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version and CUDA Driver version to be greater than or equal to 11.3.

NVSHMEM_CUMEM_GRANULARITY
Type: size
Default: 536870912

Granularity for cuMemAlloc/cuMemCreate

NVSHMEM_PROXY_REQUEST_BATCH_MAX
Type: int
Default: 32

Maxmum number of requests that the proxy thread processes in a single iteration of the progress loop.

Collectives options

NVSHMEM_DISABLE_NCCL
Type: bool
Default: false

Disable use of NCCL for collective operations

NVSHMEM_BARRIER_DISSEM_KVAL
Type: int
Default: 2

Radix of the dissemination algorithm used for barriers

NVSHMEM_BARRIER_TG_DISSEM_KVAL
Type: int
Default: 2

Radix of the dissemination algorithm used for thread group barriers

Transport options

NVSHMEM_REMOTE_TRANSPORT
Type: string
Default: “ibrc”

Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, none

NVSHMEM_DISABLE_IB_NATIVE_ATOMICS
Type: bool
Default: false

Disable use of InfiniBand native atomics

NVSHMEM_DISABLE_GDRCOPY
Type: bool
Default: false

Disable use of GDRCopy in IB RC Transport

NVSHMEM_ENABLE_NIC_PE_MAPPING
Type: bool
Default: false

When not set or set to 0, a PE is assigned the NIC on the node that is closest to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a round-robin basis or uses NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST when they are specified.

NVSHMEM_IB_GID_INDEX
Type: int
Default: 0

Source GID Index for ROCE

NVSHMEM_IB_TRAFFIC_CLASS
Type: int
Default: 0

Traffic calss for ROCE

NVSHMEM_IB_SL
Type: int
Default: 0

Service level to use over IB/ROCE

NVSHMEM_HCA_LIST
Type: string
Default: “”

Comma-separated list of HCAs to use in the NVSHMEM application. Entries are of the form hca_name:port, e.g. mlx5_1:1,mlx5_2:2 and entries prefixed by ^ are excluded. NVSHMEM_ENABLE_NIC_PE_MAPPING must be set to 1 for this variable to be effective.

NVSHMEM_HCA_PE_MAPPING
Type: string
Default: “”

Specifies mapping of HCAs to PEs as a comma-separated list. Each entry in the comma separated list is of the form hca_name:port:count. For example, mlx5_0:1:2,mlx5_0:2:2 indicates that PE0, PE1 are mapped to port 1 of mlx5_0, and PE2, PE3 are mapped to port 2 of mlx5_0. NVSHMEM_ENABLE_NIC_PE_MAPPING must be set to 1 for this variable to be effective.

NVSHMEM_DISABLE_LOCAL_ONLY_PROXY
Type: bool
Default: false

When running on an NVLink-only configuaration (No-IB, No-UCX), completely disable the proxy thread. This will disable device side global exit and device side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING build-time variable) because these are processed by the proxy thread.

NVSHMEM_IB_GPUINITIATED_NUM_DCT
Type: int
Default: 2

Number of DCT QPs used in GPU-initiated communication transport.

NVSHMEM_IB_GPUINITIATED_NUM_DCI
Type: int
Default: 0

Total number of DCI QPs used in GPU-initiated communication transport. Set to 0 or a negative number to use automatic configuration.

NVSHMEM_IB_GPUINITIATED_NUM_DCI_PER_SM
Type: int
Default: 1

Number of exclusive DCI QPs assigned to each SM.

NVSHMEM_IB_GPUINITIATED_FORCE_NIC_BUF_MEMTYPE
Type: string
Default: “auto”

Force NIC buffer memory type. Valid choices are: gpumem, hostmem. For other values, use auto discovery (default).

NVSHMEM_IB_ENABLE_GPUINITIATED
Type: bool
Default: false

Set to enable GPU-initiated communication transport.

NVTX options

NVSHMEM_NVTX
Type: string
Default: “off”

Set to enable NVTX instrumentation. Accepts a comma separated list of instrumentation groups. By default the NVTX instrumentation is disabled.

init                : library setup
alloc               : memory management
launch              : kernel launch routines
coll                : collective communications
wait                : blocking point-to-point synchronization
wait_on_stream      : point-to-point synchronization (on stream)
test                : non-blocking point-to-point synchronization
memorder            : memory ordering (quiet, fence)
quiet_on_stream     : nvshmemx_quiet_on_stream
atomic_fetch        : fetching atomic memory operations
atomic_set          : non-fetchong atomic memory operations
rma_blocking        : blocking remote memory access operations
rma_nonblocking     : non-blocking remote memory access operations
proxy               : activity of the proxy thread
common              : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
all                 : all groups
off                 : disable all NVTX instrumentation