Environment Variables¶

Standard options¶

NVSHMEM_VERSION¶

Type: bool

Default: false

Print library version at startup.

NVSHMEM_INFO¶

Type: bool

Default: false

Print environment variable options at startup.

NVSHMEM_SYMMETRIC_SIZE¶

Type: size

Default: 1073741824

Specifies the size (in bytes) of the symmetric heap memory per PE. The resulting size is implementation-defined and must be at least as large as the integer ceiling of the product of the numeric prefix and the scaling factor. The allowed character suffixes for the scaling factor are as follows:

k or K multiplies by 2^10 (kibibytes)

m or M multiplies by 2^20 (mebibytes)

g or G multiplies by 2^30 (gibibytes)

t or T multiplies by 2^40 (tebibytes)

For example, string ‘20m’ is equivalent to the integer value 20971520, or 20 mebibytes. Similarly the string ‘3.1M’ is equivalent to the integer value 3250586. Only one multiplier is recognized and any characters following the multiplier are ignored, so ‘20kk’ will not produce the same result as ‘20m’. Usage of string ‘.5m’ will yield the same result as the string ‘0.5m’. An invalid value for NVSHMEM_SYMMETRIC_SIZE is an error, which the NVSHMEM library shall report by either returning a nonzero value from nvshmem_init_thread or causing program termination.

NVSHMEM_DEBUG¶

Type: string

Default: “”

Set to enable debugging messages. Optional values: VERSION, WARN, INFO, ABORT, TRACE

Bootstrap options¶

NVSHMEM_BOOTSTRAP¶

Type: string

Default: “PMI”

Name of the default bootstrap that should be used to initialize NVSHMEM. Allowed values: PMI, MPI, SHMEM, plugin

NVSHMEM_BOOTSTRAP_PMI¶

Type: string

Default: “PMI”

Name of the PMI bootstrap that should be used to initialize NVSHMEM. Allowed values: PMI, PMI-2, PMIX

NVSHMEM_BOOTSTRAP_PLUGIN¶

Type: string

Default: “”

Name of the bootstrap plugin file to load when NVSHMEM_BOOTSTRAP=plugin is specified.

NVSHMEM_BOOTSTRAP_MPI_PLUGIN¶

Type: string

Default: “nvshmem_bootstrap_mpi.so”

Name of the MPI bootstrap plugin file.

NVSHMEM_BOOTSTRAP_SHMEM_PLUGIN¶

Type: string

Default: “nvshmem_bootstrap_shmem.so”

Name of the SHMEM bootstrap plugin file.

NVSHMEM_BOOTSTRAP_PMI_PLUGIN¶

Type: string

Default: “nvshmem_bootstrap_pmi.so”

Name of the PMI bootstrap plugin file.

NVSHMEM_BOOTSTRAP_PMI2_PLUGIN¶

Type: string

Default: “nvshmem_bootstrap_pmi2.so”

Name of the PMI-2 bootstrap plugin file.

NVSHMEM_BOOTSTRAP_PMIX_PLUGIN¶

Type: string

Default: “nvshmem_bootstrap_pmix.so”

Name of the PMIx bootstrap plugin file.

NVSHMEM_BOOTSTRAP_UID_PLUGIN¶

Type: string

Default: “nvshmem_bootstrap_uid.so”

Name of the UID bootstrap plugin file.

NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME¶

Type: string

Default: “”

Define to a list of prefixes to filter interfaces to be used by NVSHMEM. Using the ^ symbol, NVSHMEM will exclude interfaces starting with any prefix in that list. To match (or not) an exact interface name instead of a prefix, prefix the string with the = character.

Examples: eth : Use all interfaces starting with eth, e.g. eth0, eth1, … =eth0 : Use only interface eth0 ^docker : Do not use any interface starting with docker ^=docker0 : Do not use interface docker0.

Note: By default, the loopback interface (lo) and docker interfaces (docker*) would not be selected unless there are no other interfaces available. If you prefer to use lo or docker* over other interfaces, you would need to explicitly select them using NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME. The default algorithm will also favor interfaces starting with ib over others. Setting NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME will bypass the automatic interface selection algorithm and may use all interfaces matching the manual selection.

NVSHMEM_BOOTSTRAP_UID_SOCK_FAMILY¶

Type: string

Default: “AF_INET”

Name of the socket family that interface belongs to. Allowed values: AF_INET6, AF_INET.

NVSHMEM_BOOTSTRAP_UID_SESSION_ID¶

Type: string

Default: “”

Name of the UID session identifier, as specified by a combination of <ipv4>:<TCP port> or [<ipv6>]:<TCP port> or <hostname>:<TCP port>.

Additional options¶

NVSHMEM_DEBUG_FILE¶

Type: string

Default: “”

Debugging output filename, may contain %h for hostname and %p for pid.

NVSHMEM_MAX_TEAMS¶

Type: long

Default: 32

Maximum number of simultaneous teams allowed.

NVSHMEM_MAX_MEMORY_PER_GPU¶

Type: size

Default: 137438953472

Maximum memory per GPU

NVSHMEM_DISABLE_CUDA_VMM¶

Type: bool

Default: false

Disable use of CUDA VMM for P2P memory mapping. By default, CUDA VMM is enabled on x86 and disabled on P9. CUDA VMM feature in NVSHMEM requires CUDA RT version and CUDA Driver version to be greater than or equal to 11.3.

NVSHMEM_DISABLE_P2P¶

Type: bool

Default: false

Disable P2P connectivity of GPUs even when available.

NVSHMEM_DISABLE_NVLS¶

Type: bool

Default: false

Disable NVLINK SHARP collectives for P2P connected GPUs over NVSwitch even when available.

NVSHMEM_CUMEM_GRANULARITY¶

Type: size

Default: 536870912

Granularity for cuMemAlloc/cuMemCreate.

NVSHMEM_CUDA_LIMIT_STACK_SIZE¶

Type: size

Default: 0

Specify limit on stack size of each GPU thread on P9.

NVSHMEM_PROXY_REQUEST_BATCH_MAX¶

Type: int

Default: 32

Maxmum number of requests that the proxy thread processes in a single iteration of the progress loop.

Collectives options¶

NVSHMEM_DISABLE_NCCL¶

Type: bool

Default: false

Disable use of NCCL for collective operations.

NVSHMEM_BARRIER_DISSEM_KVAL¶

Type: int

Default: 2

Radix of the dissemination algorithm used for barriers.

NVSHMEM_BARRIER_TG_DISSEM_KVAL¶

Type: int

Default: 2

Radix of the dissemination algorithm used for thread group barriers.

NVSHMEM_FCOLLECT_LL_THRESHOLD¶

Type: size

Default: 2048

Message size threshold up to which fcollect LL algo will be used.

NVSHMEM_BCAST_ALGO¶

Type: int

Default: 0

Broadcast algorithm to be used.

0 - use default algorithm selection strategy

NVSHMEM_REDMAXLOC_ALGO¶

Type: int

Default: 1

Reduction algorithm to be used.

1 - default, flag alltoall algorithm
2 - flat reduce + flat bcast
3 - topo-aware two-level reduce + topo-aware bcast

NVSHMEM_REDUCE_SCRATCH_SIZE¶

Type: size_t

Default: 524288

Amount of symmetric heap memory (minimum 16B, multiple of 8B) reserved by runtime for every team to implement reduce and reducescatter collectives.

Transport options¶

NVSHMEM_REMOTE_TRANSPORT¶

Type: string

Default: “ibrc”

Selected transport for remote operations: ibrc, ucx, libfabric, ibdevx, none.

NVSHMEM_DISABLE_IB_NATIVE_ATOMICS¶

Type: bool

Default: false

Disable use of InfiniBand native atomics.

NVSHMEM_DISABLE_GDRCOPY¶

Type: bool

Default: false

Disable use of GDRCopy in IB RC Transport.

NVSHMEM_ENABLE_NIC_PE_MAPPING¶

Type: bool

Default: false

When not set or set to 0, a PE is assigned to the NIC on the node that is closest to it by distance. When set to 1, NVSHMEM either assigns NICs to PEs on a round-robin basis or uses NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST when they are specified.

NVSHMEM_IB_GID_INDEX¶

Type: int

Default: -1

Source GID Index for ROCE. By default, it would dynamically discover the GID supported by the NIC.

NVSHMEM_IB_TRAFFIC_CLASS¶

Type: int

Default: 0

Traffic calss for ROCE.

NVSHMEM_IB_SL¶

Type: int

Default: 0

Service level to use over IB/ROCE.

NVSHMEM_IB_ADDR_FAMILY¶

Type: string

Default: AF_INET

IP address family associated to GID dynamically selected by NVSHMEM when NVSHMEM_IB_GID_INDEX is left unset.

NVSHMEM_IB_ADDR_RANGE¶

Type: string

Default: ::/0

Defines the range of valid GIDs dynamically selected by NVSHMEM when NVSHMEM_IB_GID_INDEX is left unset.

NVSHMEM_IB_ROCE_VERSION_NUM¶

Type: int

Default: 2

ROCE version associated to IB GID dynamically selected by NVSHMEM when NVSHMEM_IB_GID_INDEX is left unset.

NVSHMEM_HCA_LIST¶

Type: string

Default: “”

Comma-separated list of HCAs to use in the NVSHMEM application. Entries are of the form hca_name:port, e.g. mlx5_1:1,mlx5_2:2 and entries prefixed by ^ are excluded. NVSHMEM_ENABLE_NIC_PE_MAPPING must be set to 1 for this variable to be effective.

NVSHMEM_HCA_PE_MAPPING¶

Type: string

Default: “”

Specifies mapping of HCAs to PEs as a comma-separated list. Each entry in the comma separated list is of the form hca_name:port:count. For example, mlx5_0:1:2,mlx5_0:2:2 indicates that PE0, PE1 are mapped to port 1 of mlx5_0, and PE2, PE3 are mapped to port 2 of mlx5_0. NVSHMEM_ENABLE_NIC_PE_MAPPING must be set to 1 for this variable to be effective.

NVSHMEM_DISABLE_LOCAL_ONLY_PROXY¶

Type: bool

Default: false

When running on an NVLink-only configuaration (No-IB, No-UCX), completely disable the proxy thread. This will disable device side global exit and device side wait timeout polling (enabled by NVSHMEM_TIMEOUT_DEVICE_POLLING build-time variable) because these are processed by the proxy thread.

NVSHMEM_LIBFABRIC_PROVIDER¶

Type: string

Default: “cxi”

Set the feature set provider for the libfabric transport: cxi, efa, verbs

NVSHMEM_IBGDA_NUM_DCT¶

Type: int

Default: 2

Number of DCT QPs used in GPU-initiated communication transport.

NVSHMEM_IBGDA_NUM_DCI¶

Type: int

Default: 1

Total number of DCI QPs used in GPU-initiated communication transport. Set to 0 or a negative number to use automatic configuration.

NVSHMEM_IBGDA_NUM_SHARED_DCI¶

Type: int

Default: 1

Number of DCI QPs in the shared pool. The rest of DCI QPs (NVSHMEM_IBGDA_NUM_DCI - NVSHMEM_IBGDA_NUM_SHARED_DCI) are exclusively assigned. Valid value: [1, NVSHMEM_IBGDA_NUM_DCI].

NVSHMEM_IBGDA_DCI_MAP_BY¶

Type: string

Default: “cta”

Specifies how exclusive DCI QPs are assigned. Choices are: cta, sm, warp, dct.

cta: round-robin by CTA ID (default).
sm: round-robin by SM ID.
warp: round-robin by Warp ID.
dct: round-robin by DCT ID.

NVSHMEM_IBGDA_NUM_RC_PER_PE¶

Type: int

Default: 2

Number of RC QPs per peer PE used in GPU-initiated communication transport. Set to 0 to disable RC QPs (default 2). If set to a positive number, DCI will be used for enforcing consistency only.

NVSHMEM_IBGDA_RC_MAP_BY¶

Type: string

Default: “cta”

Specifies how RC QPs are assigned. Choices are: cta, sm, warp.

cta: round-robin by CTA ID (default).
sm: round-robin by SM ID.
warp: round-robin by Warp ID.

NVSHMEM_IBGDA_FORCE_NIC_BUF_MEMTYPE¶

Type: string

Default: “gpumem”

Force NIC buffer memory type. Valid choices are: gpumem (default), hostmem. For other values, use auto discovery.

NVSHMEM_IBGDA_NUM_REQUESTS_IN_BATCH¶

Type: int

Default: 32

Number of requests to be batched before submitting to the NIC. It will be rounded up to the nearest power of 2. Set to 1 for aggressive submission.

NVSHMEM_IBGDA_NUM_FETCH_SLOTS_PER_DCI¶

Type: int

Default: 1024

Number of internal buffer slots for fetch operations for each DCI QP. It will be rounded up to the nearest power of 2.

NVSHMEM_IBGDA_NUM_FETCH_SLOTS_PER_RC¶

Type: int

Default: 1024

Number of internal buffer slots for fetch operations for each RC QP. It will be rounded up to the nearest power of 2.

NVSHMEM_IB_ENABLE_IBGDA¶

Type: bool

Default: false

Set to enable GPU-initiated communication transport.

NVSHMEM_IBGDA_NIC_HANDLER¶

Type: string

Default: auto

Selects the processor used for ringing NIC’s doorbell. Choices are auto, gpu, cpu.

auto: Use GPU SMs and fallback to CPU if it is not supported (default). gpu: Use GPU SMs. cpu: Use CPU proxy thread.

NVSHMEM_IB_DISABLE_DMABUF¶

Type: bool

Default: false

Set to disable DMAbuf in any IB based remote transport.

NVSHMEM_IBGDA_ENABLE_MULTI_PORT¶

Type: bool

Default: false

Set to enable multiple NICs per PE if available.

NVTX options¶

NVSHMEM_NVTX¶

Type: string

Default: “off”

Set to enable NVTX instrumentation. Accepts a comma separated list of instrumentation groups. By default the NVTX instrumentation is disabled.

init                : library setup
alloc               : memory management
launch              : kernel launch routines
coll                : collective communications
wait                : blocking point-to-point synchronization
wait_on_stream      : point-to-point synchronization (on stream)
test                : non-blocking point-to-point synchronization
memorder            : memory ordering (quiet, fence)
quiet_on_stream     : nvshmemx_quiet_on_stream
atomic_fetch        : fetching atomic memory operations
atomic_set          : non-fetchong atomic memory operations
rma_blocking        : blocking remote memory access operations
rma_nonblocking     : non-blocking remote memory access operations
proxy               : activity of the proxy thread
common              : init,alloc,launch,coll,memorder,wait,atomic_fetch,rma_blocking,proxy
all                 : all groups
off                 : disable all NVTX instrumentation