Environment Variables

This document describes the environment variables used by Transformer Engine. They provide an alternate method to alter Transformer Engine’s behavior during build and runtime, but are less rigorously maintained compared to the API and may be subject to change.

Build-Time Environment Variables

These environment variables control the build and compilation process of Transformer Engine.

Build Configuration

NVTE_BUILD_DEBUG
Type:

int (0 or 1)

Default:

0

Description:

Enable debug build mode. When set to 1, the build includes debug symbols (-g) and disables optimizations.

NVTE_BUILD_MAX_JOBS
Type:

int

Default:

Maximum available

Description:

Number of parallel jobs to use during the build process. If not set, the system will use the maximum available parallel jobs. Also respects the standard MAX_JOBS environment variable.

NVTE_BUILD_THREADS_PER_JOB
Type:

int

Default:

1

Description:

Number of threads to use per parallel build job. This is passed to the CUDA compiler via the --threads flag.

NVTE_FRAMEWORK
Type:

str

Default:

Auto-detected

Description:

Comma-separated list of frameworks to build support for (pytorch, jax, all, or none). If not specified, automatically detects installed frameworks.

NVTE_USE_CCACHE
Type:

int (0 or 1)

Default:

0

Description:

Enable ccache for faster recompilation. When set to 1, uses ccache as a compiler launcher for both C++ and CUDA compilation.

NVTE_CCACHE_BIN
Type:

str

Default:

ccache

Description:

Path to the ccache binary. Only used when NVTE_USE_CCACHE is set to 1.

NVTE_CMAKE_BUILD_DIR
Type:

str

Default:

None

Description:

Path to the CMake build directory for incremental builds. If set, CMake will use this directory for build artifacts.

NVTE_RELEASE_BUILD
Type:

int (0 or 1)

Default:

0

Description:

Enable release build mode. When set to 1, prepares the build for distribution (e.g., PyPI wheel). This affects library installation paths and build tool management.

NVTE_PROJECT_BUILDING
Type:

int (0 or 1)

Default:

Not set

Description:

Internal flag set to 1 during the build process to indicate that the project is being built. Not intended for external use.

Optional Dependencies

NVTE_UB_WITH_MPI
Type:

int (0 or 1)

Default:

0

Description:

Enable MPI support for userbuffers. When set to 1, requires MPI_HOME to be set to the MPI installation directory.

NVTE_ENABLE_NVSHMEM
Type:

int (0 or 1)

Default:

0

Description:

Enable NVSHMEM support. When set to 1, requires NVSHMEM_HOME to be set to the NVSHMEM installation directory.

NVTE_BUILD_ACTIVATION_WITH_FAST_MATH
Type:

CMake option

Default:

OFF

Description:

Compile activation kernels (GELU, ReLU, SwiGLU) with the --use_fast_math CUDA compiler flag for improved performance at the cost of some precision.

CUDA Configuration

NVTE_CUDA_ARCHS
Type:

str

Default:

Auto-detected based on CUDA version

Description:

Semicolon-separated list of CUDA compute architectures to compile for (e.g., "80;90" for A100 and H100, or "75;80;89;90"). If not set, automatically determined based on the installed CUDA Toolkit version. CUDA 13.0+ defaults to "75;80;89;90;100;120", CUDA 12.8+ defaults to "70;80;89;90;100;120", and earlier versions default to "70;80;89;90". Setting this can significantly reduce build time and binary size by targeting only the GPU architectures you need.

NVTE_CUDA_INCLUDE_DIR
Type:

str

Default:

Auto-detected

Description:

Path to CUDA include directory containing cuda_runtime.h. If not set, Transformer Engine searches in common locations (CUDA_HOME, CUDA_DIR, /usr/local/cuda). This is used for NVRTC kernel compilation.

Runtime Environment Variables

These environment variables control the behavior of Transformer Engine during execution.

Attention Backend Selection

NVTE_FLASH_ATTN
Type:

int (0 or 1)

Default:

1

Description:

Enable or disable FlashAttention backend for DotProductAttention. When set to 0, FlashAttention will not be used.

NVTE_FUSED_ATTN
Type:

int (0 or 1)

Default:

1

Description:

Enable or disable FusedAttention backend (cuDNN-based) for DotProductAttention. When set to 0, FusedAttention will not be used.

NVTE_UNFUSED_ATTN
Type:

int (0 or 1)

Default:

1

Description:

Enable or disable UnfusedDotProductAttention backend (native PyTorch). When set to 0, UnfusedDotProductAttention will not be used.

NVTE_FUSED_ATTN_BACKEND
Type:

int (0, 1, or 2)

Default:

Auto-selected

Description:

Force a specific FusedAttention backend. 0 = F16_max512_seqlen (cuDNN, ≤512 seq len), 1 = F16_arbitrary_seqlen (cuDNN, any seq len), 2 = FP8 backend. If not set, the backend is automatically selected based on the input configuration.

NVTE_FUSED_ATTN_FORCE_WORKSPACE_OPT
Type:

int (0 or 1)

Default:

Auto-determined

Description:

Control workspace-related optimizations in FusedAttention. 0 disables optimizations, 1 enables them. These optimizations trade memory for performance. When unset, Transformer Engine determines the code path based on internal logic. For deterministic behavior with cuDNN ≥8.9.5 and <9.0.0, this is automatically set to 1.

NVTE_FUSED_ATTN_USE_FAv2_BWD
Type:

int (0 or 1)

Default:

0

Description:

When using FusedAttention, use FlashAttention-2 implementation for the backward pass instead of the cuDNN implementation. This can be useful due to performance differences between various versions of flash-attn and FusedAttention.

NVTE_ALLOW_NONDETERMINISTIC_ALGO
Type:

int (0 or 1)

Default:

1

Description:

Allow non-deterministic algorithms for Transformer Engine execution. When set to 0, only deterministic algorithms are allowed. This is relevant for both PyTorch and JAX attention implementations.

NVTE_FUSED_RING_ATTENTION_USE_SCAN
Type:

int (0 or 1)

Default:

0

Description:

(JAX only) Use scan loop for ring attention implementation. When set to 1, the fused ring attention will use a scan-based iteration approach.

NVTE_APPLY_QK_LAYER_SCALING
Type:

int (0 or 1)

Default:

0

Description:

Apply QK layer scaling in UnfusedDotProductAttention. This is an FP16 training trick required for certain GPT-like models. When set to 1 and a layer number is provided, the softmax scale is divided by the layer number, and the layer number is used as the softmax scale during the softmax operation. Only effective when using FP16 dtype and when the layer number is specified.

Context Parallelism

NVTE_BATCH_MHA_P2P_COMM
Type:

int (0 or 1)

Default:

0 (or auto-enabled for pre-Blackwell GPUs with CP size 2)

Description:

Use batched P2P communication (batch_isend_irecv) for KV exchange in context parallel MultiheadAttention. When enabled, send and receive operations are batched together, which can improve communication efficiency. This is automatically enabled for devices with compute capability < 10.0 (pre-Blackwell GPUs) when context parallel size is 2. Setting this to 1 forces batched P2P communication regardless of device architecture.

FP8 Configuration

NVTE_UNFUSED_FP8_UPDATE
Type:

int (0 or 1)

Default:

0

Description:

Use unfused kernel for FP8 amax and scale updates. When set to 1, amax and scale updates are computed using separate unfused kernels instead of fused operations.

NVTE_FP8_DPA_BWD
Type:

int (0 or 1)

Default:

1

Description:

Enable FP8 in the backward pass of DotProductAttention. 1 = FP8 forward and backward, 0 = FP8 forward and FP16/BF16 backward.

NVTE_DPA_FP8CS_O_in_F16
Type:

int (0 or 1)

Default:

1

Description:

For Float8CurrentScaling in DotProductAttention, use FP16/BF16 for the output tensor in the backward pass. 1 = use F16/BF16 output in backward, 0 = use FP8 output in backward.

NVTE_DPA_FP8_RECIPE
Type:

str

Default:

Empty (use same as linear layers)

Description:

Override FP8 recipe for DotProductAttention layers. Valid values: "F16" (disable FP8), "DelayedScaling", or "Float8CurrentScaling". This allows using different FP8 recipes for attention vs. linear layers.

NVTE_DPA_FP8_RECIPE_DPA
Type:

int (0 or 1)

Default:

0

Description:

Enable FP8 in DotProductAttention when using NVTE_DPA_FP8_RECIPE. When set to 1, the DotProductAttention layer will use the FP8 recipe specified by NVTE_DPA_FP8_RECIPE. This provides fine-grained control over which attention components use FP8.

NVTE_DPA_FP8_RECIPE_MHA
Type:

int (0 or 1)

Default:

0

Description:

Enable FP8 in MultiheadAttention (MHA) when using NVTE_DPA_FP8_RECIPE. When set to 1, the MultiheadAttention QKV and output projection layers will use the FP8 recipe specified by NVTE_DPA_FP8_RECIPE. This provides fine-grained control over which attention components use FP8.

NVTE_DPA_FP8_FORMAT
Type:

str

Default:

"HYBRID"

Description:

FP8 format for DotProductAttention when switching recipes. Valid values: "HYBRID", "E4M3", "E5M2". Only used when NVTE_DPA_FP8_RECIPE is set.

NVTE_DPA_FP8DS_AMAX_ALGO
Type:

str

Default:

"most_recent"

Description:

Amax computation algorithm for DelayedScaling recipe in DotProductAttention. Valid values: "most_recent", "max". Only used when NVTE_DPA_FP8_RECIPE is set to "DelayedScaling".

NVTE_DPA_FP8DS_AMAX_HISTLEN
Type:

int

Default:

1

Description:

Amax history length for DelayedScaling recipe in DotProductAttention. Only used when NVTE_DPA_FP8_RECIPE is set to "DelayedScaling".

NVTE_DPA_FP8DS_REDUCE_AMAX
Type:

int (0 or 1)

Default:

1

Description:

Reduce amax across distributed ranks for DelayedScaling recipe in DotProductAttention. Only used when NVTE_DPA_FP8_RECIPE is set to "DelayedScaling".

NVTE_UnfusedDPA_Emulate_FP8
Type:

int (0 or 1)

Default:

0

Description:

Allow FP8 emulation in UnfusedDotProductAttention. When set to 1, UnfusedDotProductAttention can emulate FP8 operations using FP16/BF16 computation.

Kernel Configuration

NVTE_USE_FAST_MATH
Type:

int (0 or 1)

Default:

0

Description:

Enable fast math optimizations in runtime-compiled (NVRTC) kernels. This trades numerical accuracy for performance. These optimizations are experimental and inconsistently implemented.

NVTE_DISABLE_NVRTC
Type:

int (0 or 1)

Default:

0

Description:

Disable NVRTC (CUDA Runtime Compilation) support. When set to 1, runtime kernel compilation is disabled. This can be useful in environments where NVRTC is not available or not desired.

NVTE_USE_CUTLASS_GROUPED_GEMM
Type:

int (0 or 1)

Default:

0

Description:

Use CUTLASS implementation for grouped GEMM operations instead of cuBLAS. When set to 1, enables CUTLASS grouped GEMM kernels, which may provide better performance for certain workloads on Hopper (SM90) GPUs.

NVTE_CUTLASS_GROUPED_GEMM_WARN_FALLBACK
Type:

int (0 or 1)

Default:

0

Description:

Emit a warning when falling back from CUTLASS to cuBLAS for grouped GEMM operations.

Torch Compilation and Fusion

NVTE_TORCH_COMPILE
Type:

int (0 or 1)

Default:

1

Description:

Enable PyTorch 2.x torch.compile support for compatible Transformer Engine operations. When set to 0, disables compilation support and uses regular PyTorch eager mode.

NVTE_BIAS_GELU_NVFUSION
Type:

int (0 or 1)

Default:

1

Description:

Enable GELU fusion with bias using NVFusion in PyTorch. When set to 0, uses separate bias and GELU operations.

NVTE_BIAS_DROPOUT_FUSION
Type:

int (0 or 1)

Default:

1

Description:

Enable fusion of bias and dropout operations. When set to 0, bias and dropout are computed separately.

LayerNorm/RMSNorm SM Margins

NVTE_FWD_LAYERNORM_SM_MARGIN
Type:

int

Default:

0

Description:

Number of SMs (Streaming Multiprocessors) to reserve (not use) during forward LayerNorm/RMSNorm operations. This can be used to control resource allocation and overlap computation with communication.

NVTE_BWD_LAYERNORM_SM_MARGIN
Type:

int

Default:

0

Description:

Number of SMs to reserve during backward LayerNorm/RMSNorm operations.

NVTE_INF_LAYERNORM_SM_MARGIN
Type:

int

Default:

0

Description:

Number of SMs to reserve during inference LayerNorm/RMSNorm operations.

GEMM Configuration

NVTE_EXT_MARGIN_SM
Type:

int

Default:

Total SM count

Description:

External SM margin for GEMM operations. Specifies the number of SMs to use for GEMM operations. The actual number of SMs used is sm_count - NVTE_EXT_MARGIN_SM.

NVTE_AG_P2P_MULTI_ATOMIC
Type:

int (0 or 1)

Default:

0

Description:

Enable multi-atomic mode for AllGather with atomic GEMM using P2P communication. When set to 1, uses userbuffers_sendrecv_multiatomic for communication during atomic GEMM overlap with AllGather operations. This disables copy engine (CE) usage and enables push mode for userbuffers. This is an advanced optimization for tensor-parallel communication-computation overlap.

CPU Offloading

NVTE_CPU_OFFLOAD_V1
Type:

int (0 or 1)

Default:

0

Description:

Enable legacy version of CPU offloading implementation.

Debugging and Profiling

NVTE_DEBUG
Type:

int (0 or 1)

Default:

0

Description:

Enable debug mode. When set to 1, enables verbose debug output and additional checks in attention operations.

NVTE_DEBUG_LEVEL
Type:

int (0, 1, or 2)

Default:

0

Description:

Debug verbosity level. Higher values enable more verbose debug output. Only effective when NVTE_DEBUG is set to 1.

NVTE_PRINT_LAYER_NUMBER
Type:

int

Default:

1

Description:

Layer number to print debug information for during attention operations.

NVTE_PRINT_RANK
Type:

int

Default:

0

Description:

Distributed rank to print debug information for during attention operations.

NVTE_NVTX_ENABLED
Type:

int (0 or 1)

Default:

0

Description:

Enable NVTX (NVIDIA Tools Extension) range profiling for Transformer Engine operations. When set to 1, NVTX markers are added to operations for profiling with NVIDIA Nsight Systems.

NVTE_DEBUG_NUMERICS
Type:

int (0 or 1)

Default:

0

Description:

(JAX only) Enable verbose printing of tensor numerics for debugging purposes.

Testing

NVTE_TEST_NVINSPECT_ENABLED
Type:

int (0 or 1)

Default:

0

Description:

Enable NVInspect integration for testing. When set to 1, enables the NVInspect debugging API for numerical analysis during tests.

NVTE_TEST_NVINSPECT_CONFIG_FILE
Type:

str

Default:

None

Description:

Path to NVInspect configuration file. Required when NVTE_TEST_NVINSPECT_ENABLED is set to 1.

NVTE_TEST_NVINSPECT_FEATURE_DIRS
Type:

str

Default:

None

Description:

Comma-separated list of directories containing NVInspect features. Required when NVTE_TEST_NVINSPECT_ENABLED is set to 1.

NVTE_TEST_ARTIFACTS_DIR
Type:

str

Default:

System temp directory

Description:

Directory for storing test artifacts (e.g., generated ONNX models).

ONNX Export

NVTE_ONNX_KVCACHE_MAX_SEQ_LEN
Type:

int

Default:

128

Description:

Maximum sequence length for KV cache during ONNX export. This is used for attention masking in exported ONNX models.

JAX-Specific Variables

NVTE_JAX_CUSTOM_CALLS
Type:

str

Default:

None

Description:

Control which JAX custom call primitives are enabled or disabled. Format: "true" (enable all), "false" (disable all), or comma-separated key-value pairs like "GemmPrimitive=false,DBiasQuantizePrimitive=true". This provides fine-grained control over which operations use custom CUDA kernels vs. JAX native implementations.

NVTE_JAX_CUSTOM_CALLS_RE
Type:

str

Default:

None

Description:

Deprecated (use NVTE_JAX_CUSTOM_CALLS instead). Regex pattern to match primitive names for enabling/disabling. Example: "DBiasQuantizePrimitive" or "^(?!DBiasQuantizePrimitive$).+$".

NVTE_JAX_UNITTEST_LEVEL
Type:

str

Default:

None

Description:

Test level for JAX unit tests ("L0", "L1", "L2"). Used internally by the test suite.

Examples

Building with Debug Symbols

export NVTE_BUILD_DEBUG=1
export NVTE_USE_CCACHE=1
pip install -e .

Using Specific Attention Backend

# Use only FlashAttention, disable FusedAttention
export NVTE_FLASH_ATTN=1
export NVTE_FUSED_ATTN=0
python train.py

Configuring FP8 for Attention

# Use DelayedScaling for attention, CurrentScaling for linear layers
export NVTE_DPA_FP8_RECIPE="DelayedScaling"
export NVTE_DPA_FP8_FORMAT="HYBRID"
export NVTE_DPA_FP8DS_AMAX_ALGO="most_recent"
export NVTE_DPA_FP8DS_AMAX_HISTLEN=1024
python train.py

Enable Profiling

# Enable NVTX markers for profiling
export NVTE_NVTX_ENABLED=1
nsys profile --trace=nvtx,cuda python train.py

JAX Custom Calls Control

# Disable all custom calls
export NVTE_JAX_CUSTOM_CALLS="false"
python train_jax.py

# Disable specific primitives
export NVTE_JAX_CUSTOM_CALLS="GemmPrimitive=false,DBiasQuantizePrimitive=false"
python train_jax.py