Environment Variables

This document describes the environment variables used by Transformer Engine. They provide an alternate method to alter Transformer Engine’s behavior during build and runtime, but are less rigorously maintained compared to the API and may be subject to change.

Build-Time Environment Variables

These environment variables control the build and compilation process of Transformer Engine.

Build Configuration

NVTE_BUILD_DEBUG

Type:: int (0 or 1)
Default:: 0
Description:: Enable debug build mode. When set to 1, the build includes debug symbols (-g) and disables optimizations.

NVTE_BUILD_MAX_JOBS

Type:: int
Default:: Maximum available
Description:: Number of parallel jobs to use during the build process. If not set, the system will use the maximum available parallel jobs. Also respects the standard MAX_JOBS environment variable.

NVTE_BUILD_THREADS_PER_JOB

Type:: int
Default:: 1
Description:: Number of threads to use per parallel build job. This is passed to the CUDA compiler via the --threads flag.

NVTE_FRAMEWORK

Type:: str
Default:: Auto-detected
Description:: Comma-separated list of frameworks to build support for (pytorch, jax, all, or none). If not specified, automatically detects installed frameworks.

NVTE_USE_CCACHE

Type:: int (0 or 1)
Default:: 0
Description:: Enable ccache for faster recompilation. When set to 1, uses ccache as a compiler launcher for both C++ and CUDA compilation.

NVTE_CCACHE_BIN

Type:: str
Default:: ccache
Description:: Path to the ccache binary. Only used when NVTE_USE_CCACHE is set to 1.

NVTE_CMAKE_BUILD_DIR

Type:: str
Default:: None
Description:: Path to the CMake build directory for incremental builds. If set, CMake will use this directory for build artifacts.

NVTE_RELEASE_BUILD

Type:: int (0 or 1)
Default:: 0
Description:: Enable release build mode. When set to 1, prepares the build for distribution (e.g., PyPI wheel). This affects library installation paths and build tool management.

NVTE_PROJECT_BUILDING

Type:: int (0 or 1)
Default:: Not set
Description:: Internal flag set to 1 during the build process to indicate that the project is being built. Not intended for external use.

NVTE_BUILD_NUM_PHILOX_ROUNDS

Type:: int (positive integer)
Default:: 10
Description:: Number of Philox4x32 rounds used by stochastic rounding kernels. Must be a positive integer.

Optional Dependencies

NVTE_UB_WITH_MPI

Type:: int (0 or 1)
Default:: 0
Description:: Enable MPI support for userbuffers. When set to 1, requires MPI_HOME to be set to the MPI installation directory.

NVTE_ENABLE_NVSHMEM

Type:: int (0 or 1)
Default:: 0
Description:: Enable NVSHMEM support. When set to 1, requires NVSHMEM_HOME to be set to the NVSHMEM installation directory.

NVTE_BUILD_ACTIVATION_WITH_FAST_MATH

Type:: CMake option
Default:: OFF
Description:: Compile activation kernels (GELU, ReLU, SwiGLU) with the --use_fast_math CUDA compiler flag for improved performance at the cost of some precision.

CUDA Configuration

NVTE_CUDA_ARCHS

Type:: str
Default:: Auto-detected based on CUDA version
Description:: Semicolon-separated list of CUDA compute architectures to compile for (e.g., "80;90" for A100 and H100, or "75;80;89;90"). If not set, automatically determined based on the installed CUDA Toolkit version. CUDA 13.0+ defaults to "75;80;89;90;100;120", CUDA 12.8+ defaults to "70;80;89;90;100;120", and earlier versions default to "70;80;89;90". Setting this can significantly reduce build time and binary size by targeting only the GPU architectures you need.

NVTE_CUDA_INCLUDE_DIR

Type:: str
Default:: Auto-detected
Description:: Path to CUDA include directory containing cuda_runtime.h. If not set, Transformer Engine searches in common locations (CUDA_HOME, CUDA_DIR, /usr/local/cuda). This is used for NVRTC kernel compilation.

Runtime Environment Variables

These environment variables control the behavior of Transformer Engine during execution.

Attention Backend Selection

NVTE_FLASH_ATTN

Type:: int (0 or 1)
Default:: 1
Description:: Enable or disable FlashAttention backend for DotProductAttention. When set to 0, FlashAttention will not be used.

NVTE_FUSED_ATTN

Type:: int (0 or 1)
Default:: 1
Description:: Enable or disable FusedAttention backend (cuDNN-based) for DotProductAttention. When set to 0, FusedAttention will not be used.

NVTE_UNFUSED_ATTN

Type:: int (0 or 1)
Default:: 1
Description:: Enable or disable UnfusedDotProductAttention backend (native PyTorch). When set to 0, UnfusedDotProductAttention will not be used.

NVTE_FUSED_ATTN_BACKEND

Type:: int (0, 1, or 2)
Default:: Auto-selected
Description:: Force a specific FusedAttention backend. 0 = F16_max512_seqlen (cuDNN, ≤512 seq len), 1 = F16_arbitrary_seqlen (cuDNN, any seq len), 2 = FP8 backend. If not set, the backend is automatically selected based on the input configuration.

NVTE_FUSED_ATTN_FORCE_WORKSPACE_OPT

Type:: int (0 or 1)
Default:: Auto-determined
Description:: Control workspace-related optimizations in FusedAttention. 0 disables optimizations, 1 enables them. These optimizations trade memory for performance. When unset, Transformer Engine determines the code path based on internal logic. For deterministic behavior with cuDNN ≥8.9.5 and <9.0.0, this is automatically set to 1.

NVTE_FUSED_ATTN_USE_FAv2_BWD

Type:: int (0 or 1)
Default:: 0
Description:: When using FusedAttention, use FlashAttention-2 implementation for the backward pass instead of the cuDNN implementation. This can be useful due to performance differences between various versions of flash-attn and FusedAttention.

NVTE_ALLOW_NONDETERMINISTIC_ALGO

Type:: int (0 or 1)
Default:: 1
Description:: Allow non-deterministic algorithms for Transformer Engine execution. When set to 0, only deterministic algorithms are allowed. This is relevant for both PyTorch and JAX attention implementations.

NVTE_FUSED_RING_ATTENTION_USE_SCAN

Type:: int (0 or 1)
Default:: 0
Description:: (JAX only) Use scan loop for ring attention implementation. When set to 1, the fused ring attention will use a scan-based iteration approach.

NVTE_APPLY_QK_LAYER_SCALING

Type:: int (0 or 1)
Default:: 0
Description:: Apply QK layer scaling in UnfusedDotProductAttention. This is an FP16 training trick required for certain GPT-like models. When set to 1 and a layer number is provided, the softmax scale is divided by the layer number, and the layer number is used as the softmax scale during the softmax operation. Only effective when using FP16 dtype and when the layer number is specified.

Context Parallelism

NVTE_BATCH_MHA_P2P_COMM

Type:: int (0 or 1)
Default:: 0 (or auto-enabled for pre-Blackwell GPUs with CP size 2)
Description:: Use batched P2P communication (batch_isend_irecv) for KV exchange in context parallel MultiheadAttention. When enabled, send and receive operations are batched together, which can improve communication efficiency. This is automatically enabled for devices with compute capability < 10.0 (pre-Blackwell GPUs) when context parallel size is 2. Setting this to 1 forces batched P2P communication regardless of device architecture.

FP8 Configuration

NVTE_UNFUSED_FP8_UPDATE

Type:: int (0 or 1)
Default:: 0
Description:: Use unfused kernel for FP8 amax and scale updates. When set to 1, amax and scale updates are computed using separate unfused kernels instead of fused operations.

NVTE_FP8_DPA_BWD

Type:: int (0 or 1)
Default:: 1
Description:: Enable FP8 in the backward pass of DotProductAttention. 1 = FP8 forward and backward, 0 = FP8 forward and FP16/BF16 backward.

NVTE_DPA_FP8CS_O_in_F16

Type:: int (0 or 1)
Default:: 1
Description:: For Float8CurrentScaling in DotProductAttention, use FP16/BF16 for the output tensor in the backward pass. 1 = use F16/BF16 output in backward, 0 = use FP8 output in backward.

NVTE_DPA_FP8_RECIPE

Type:: str
Default:: Empty (use same as linear layers)
Description:: Override FP8 recipe for DotProductAttention layers. Valid values: "F16" (disable FP8), "DelayedScaling", or "Float8CurrentScaling". This allows using different FP8 recipes for attention vs. linear layers.

NVTE_DPA_FP8_RECIPE_DPA

Type:: int (0 or 1)
Default:: 0
Description:: Enable FP8 in DotProductAttention when using NVTE_DPA_FP8_RECIPE. When set to 1, the DotProductAttention layer will use the FP8 recipe specified by NVTE_DPA_FP8_RECIPE. This provides fine-grained control over which attention components use FP8.

NVTE_DPA_FP8_RECIPE_MHA

Type:: int (0 or 1)
Default:: 0
Description:: Enable FP8 in MultiheadAttention (MHA) when using NVTE_DPA_FP8_RECIPE. When set to 1, the MultiheadAttention QKV and output projection layers will use the FP8 recipe specified by NVTE_DPA_FP8_RECIPE. This provides fine-grained control over which attention components use FP8.

NVTE_DPA_FP8_FORMAT

Type:: str
Default:: "HYBRID"
Description:: FP8 format for DotProductAttention when switching recipes. Valid values: "HYBRID", "E4M3", "E5M2". Only used when NVTE_DPA_FP8_RECIPE is set.

NVTE_DPA_FP8DS_AMAX_ALGO

Type:: str
Default:: "most_recent"
Description:: Amax computation algorithm for DelayedScaling recipe in DotProductAttention. Valid values: "most_recent", "max". Only used when NVTE_DPA_FP8_RECIPE is set to "DelayedScaling".

NVTE_DPA_FP8DS_AMAX_HISTLEN

Type:: int
Default:: 1
Description:: Amax history length for DelayedScaling recipe in DotProductAttention. Only used when NVTE_DPA_FP8_RECIPE is set to "DelayedScaling".

NVTE_DPA_FP8DS_REDUCE_AMAX

Type:: int (0 or 1)
Default:: 1
Description:: Reduce amax across distributed ranks for DelayedScaling recipe in DotProductAttention. Only used when NVTE_DPA_FP8_RECIPE is set to "DelayedScaling".

NVTE_UnfusedDPA_Emulate_FP8

Type:: int (0 or 1)
Default:: 0
Description:: Allow FP8 emulation in UnfusedDotProductAttention. When set to 1, UnfusedDotProductAttention can emulate FP8 operations using FP16/BF16 computation.

Kernel Configuration

NVTE_USE_FAST_MATH

Type:: int (0 or 1)
Default:: 0
Description:: Enable fast math optimizations in runtime-compiled (NVRTC) kernels. This trades numerical accuracy for performance. These optimizations are experimental and inconsistently implemented.

NVTE_DISABLE_NVRTC

Type:: int (0 or 1)
Default:: 0
Description:: Disable NVRTC (CUDA Runtime Compilation) support. When set to 1, runtime kernel compilation is disabled. This can be useful in environments where NVRTC is not available or not desired.

NVTE_USE_CUTLASS_GROUPED_GEMM

Type:: int (0 or 1)
Default:: 0
Description:: Use CUTLASS implementation for grouped GEMM operations instead of cuBLAS. When set to 1, enables CUTLASS grouped GEMM kernels, which may provide better performance for certain workloads on Hopper (SM90) GPUs.

NVTE_CUTLASS_GROUPED_GEMM_WARN_FALLBACK

Type:: int (0 or 1)
Default:: 0
Description:: Emit a warning when falling back from CUTLASS to cuBLAS for grouped GEMM operations.

Torch Compilation and Fusion

NVTE_TORCH_COMPILE

Type:: int (0 or 1)
Default:: 1
Description:: Enable PyTorch 2.x torch.compile support for compatible Transformer Engine operations. When set to 0, disables compilation support and uses regular PyTorch eager mode.

NVTE_BIAS_GELU_NVFUSION

Type:: int (0 or 1)
Default:: 1
Description:: Enable GELU fusion with bias using NVFusion in PyTorch. When set to 0, uses separate bias and GELU operations.

NVTE_BIAS_DROPOUT_FUSION

Type:: int (0 or 1)
Default:: 1
Description:: Enable fusion of bias and dropout operations. When set to 0, bias and dropout are computed separately.

LayerNorm/RMSNorm SM Margins

NVTE_FWD_LAYERNORM_SM_MARGIN

Type:: int
Default:: 0
Description:: Number of SMs (Streaming Multiprocessors) to reserve (not use) during forward LayerNorm/RMSNorm operations. This can be used to control resource allocation and overlap computation with communication.

NVTE_BWD_LAYERNORM_SM_MARGIN

Type:: int
Default:: 0
Description:: Number of SMs to reserve during backward LayerNorm/RMSNorm operations.

NVTE_INF_LAYERNORM_SM_MARGIN

Type:: int
Default:: 0
Description:: Number of SMs to reserve during inference LayerNorm/RMSNorm operations.

GEMM Configuration

NVTE_EXT_MARGIN_SM

Type:: int
Default:: Total SM count
Description:: External SM margin for GEMM operations. Specifies the number of SMs to use for GEMM operations. The actual number of SMs used is sm_count - NVTE_EXT_MARGIN_SM.

NVTE_AG_P2P_MULTI_ATOMIC

Type:: int (0 or 1)
Default:: 0
Description:: Enable multi-atomic mode for AllGather with atomic GEMM using P2P communication. When set to 1, uses userbuffers_sendrecv_multiatomic for communication during atomic GEMM overlap with AllGather operations. This disables copy engine (CE) usage and enables push mode for userbuffers. This is an advanced optimization for tensor-parallel communication-computation overlap.

CPU Offloading

NVTE_CPU_OFFLOAD_V1

Type:: int (0 or 1)
Default:: 0
Description:: Enable legacy version of CPU offloading implementation.

Debugging and Profiling

NVTE_DEBUG

Type:: int (0 or 1)
Default:: 0
Description:: Enable debug mode. When set to 1, enables verbose debug output and additional checks in attention operations.

NVTE_DEBUG_LEVEL

Type:: int (0, 1, or 2)
Default:: 0
Description:: Debug verbosity level. Higher values enable more verbose debug output. Only effective when NVTE_DEBUG is set to 1.

NVTE_PRINT_LAYER_NUMBER

Type:: int
Default:: 1
Description:: Layer number to print debug information for during attention operations.

NVTE_PRINT_RANK

Type:: int
Default:: 0
Description:: Distributed rank to print debug information for during attention operations.

NVTE_NVTX_ENABLED

Type:: int (0 or 1)
Default:: 0
Description:: Enable NVTX (NVIDIA Tools Extension) range profiling for Transformer Engine operations. When set to 1, NVTX markers are added to operations for profiling with NVIDIA Nsight Systems.

NVTE_DEBUG_NUMERICS

Type:: int (0 or 1)
Default:: 0
Description:: (JAX only) Enable verbose printing of tensor numerics for debugging purposes.

Testing

NVTE_TEST_NVINSPECT_ENABLED

Type:: int (0 or 1)
Default:: 0
Description:: Enable NVInspect integration for testing. When set to 1, enables the NVInspect debugging API for numerical analysis during tests.

NVTE_TEST_NVINSPECT_CONFIG_FILE

Type:: str
Default:: None
Description:: Path to NVInspect configuration file. Required when NVTE_TEST_NVINSPECT_ENABLED is set to 1.

NVTE_TEST_NVINSPECT_FEATURE_DIRS

Type:: str
Default:: None
Description:: Comma-separated list of directories containing NVInspect features. Required when NVTE_TEST_NVINSPECT_ENABLED is set to 1.

NVTE_TEST_ARTIFACTS_DIR

Type:: str
Default:: System temp directory
Description:: Directory for storing test artifacts (e.g., generated ONNX models).

ONNX Export

NVTE_ONNX_KVCACHE_MAX_SEQ_LEN

Type:: int
Default:: 128
Description:: Maximum sequence length for KV cache during ONNX export. This is used for attention masking in exported ONNX models.

JAX-Specific Variables

NVTE_JAX_CUSTOM_CALLS

Type:: str
Default:: None
Description:: Control which JAX custom call primitives are enabled or disabled. Format: "true" (enable all), "false" (disable all), or comma-separated key-value pairs like "GemmPrimitive=false,DBiasQuantizePrimitive=true". This provides fine-grained control over which operations use custom CUDA kernels vs. JAX native implementations.

NVTE_JAX_CUSTOM_CALLS_RE

Type:: str
Default:: None
Description:: Deprecated (use NVTE_JAX_CUSTOM_CALLS instead). Regex pattern to match primitive names for enabling/disabling. Example: "DBiasQuantizePrimitive" or "^(?!DBiasQuantizePrimitive$).+$".

NVTE_JAX_UNITTEST_LEVEL

Type:: str
Default:: None
Description:: Test level for JAX unit tests ("L0", "L1", "L2"). Used internally by the test suite.

JAX Triton Extensions

NVTE_USE_PYTORCH_TRITON

Type:: int (0 or 1)
Default:: 0
Description:: Explicitly acknowledge using pytorch-triton for JAX Triton kernels. When both JAX and PyTorch are installed in the same environment, PyTorch’s pytorch-triton package may be imported instead of the standard triton package from OpenAI. Setting this to 1 suppresses the compatibility warning emitted in that situation. pytorch-triton (the real package from PyTorch’s package index, not the placeholder on PyPI) is compatible with JAX Triton kernels.

NVTE_JAX_ENFORCE_TRITON_AUTOTUNING

Type:: int (0 or 1)
Default:: 0
Description:: Raise a RuntimeError when the installed JAX is too old to safely run TritonAutotunedKernelCall (jax-ml/jax#35218) instead of silently falling back to non-autotuned dispatch. Useful for CI or debugging to ensure Triton autotuning is active. When set to 0 (default), old JAX versions silently fall back to single-config (non-autotuned) kernel dispatch for compatibility.

Examples

Building with Debug Symbols

export NVTE_BUILD_DEBUG=1
export NVTE_USE_CCACHE=1
pip install -e .

Using Specific Attention Backend

# Use only FlashAttention, disable FusedAttention
export NVTE_FLASH_ATTN=1
export NVTE_FUSED_ATTN=0
python train.py

Configuring FP8 for Attention

# Use DelayedScaling for attention, CurrentScaling for linear layers
export NVTE_DPA_FP8_RECIPE="DelayedScaling"
export NVTE_DPA_FP8_FORMAT="HYBRID"
export NVTE_DPA_FP8DS_AMAX_ALGO="most_recent"
export NVTE_DPA_FP8DS_AMAX_HISTLEN=1024
python train.py

Enable Profiling

# Enable NVTX markers for profiling
export NVTE_NVTX_ENABLED=1
nsys profile --trace=nvtx,cuda python train.py

JAX Custom Calls Control

# Disable all custom calls
export NVTE_JAX_CUSTOM_CALLS="false"
python train_jax.py

# Disable specific primitives
export NVTE_JAX_CUSTOM_CALLS="GemmPrimitive=false,DBiasQuantizePrimitive=false"
python train_jax.py