Deterministic Training and Reproducibility#
Reproducible training—obtaining bit-for-bit identical results across repeated runs with the same seed, data, and hardware—is important for debugging, regression testing, and scientific comparison of experiments. By default, several GPU kernels used in deep learning (cuDNN convolution algorithms, fused attention backward passes, and the custom MultiScaleDeformableAttention CUDA op) are non-deterministic: they use atomic accumulation or auto-tuned algorithm selection, so gradients can differ slightly between runs even with a fixed seed.
Starting in TAO 7.0.1, the train command exposes a set of flags that route these
operations through deterministic code paths, yielding reproducible gradients—at some cost
in training speed and memory.
The primary control, train.cudnn.deterministic, defaults to True in
TAO 7.0.1, so cuDNN determinism and the deterministic SDPA math backend are active out of
the box; you do not need to set anything to get them. The per-model
model.precise_msda flag defaults to False and must be set explicitly to make
MultiScaleDeformableAttention deterministic. To turn cuDNN determinism off for maximum
speed, set train.cudnn.deterministic=False.
Overview of the Deterministic Controls#
TAO 7.0.1 introduces three layered controls:
cuDNN determinism (
train.cudnn.deterministic, defaultTrue) — a common training flag, available for every PyTorch model. When enabled, TAO enables cuDNN deterministic algorithms, callstorch.use_deterministic_algorithms(True, warn_only=True), sets the cuBLAS workspace configuration, and forces the deterministic math backend for scaled dot-product attention (SDPA).Deterministic SDPA (math backend) — enabled automatically as part of
train.cudnn.deterministic. The flash and memory-efficient attention backends have non-deterministic backward kernels; only the math backend is deterministic. TAO disables the former two and forces the math backend so attention-heavy backbones (for example, C-RADIO ViT) are reproducible.Deterministic MSDeformAttn (
model.precise_msda) — a per-model flag for networks that use MultiScaleDeformableAttention. When set, TAO routes attention through a deterministic pure-PyTorch implementation instead of the custom CUDA op, whoseatomicAddbackward has no deterministic kernel.
For fully reproducible training of a MSDeformAttn-based model, set both
train.cudnn.deterministic=True and model.precise_msda=True. You should also
fix the seed via train.seed (the default is 1234; a value below 0 disables the
fixed seed).
Enabling Deterministic Training#
cuDNN Determinism and Deterministic SDPA#
The train.cudnn configuration is common to all TAO PyTorch models:
train:
seed: 1234
cudnn:
benchmark: false # leave false for determinism (default)
deterministic: true # enable deterministic cuDNN + SDPA math backend
When train.cudnn.deterministic is True, TAO performs the following at the
start of training:
Sets
torch.backends.cudnn.deterministic = Trueandtorch.backends.cudnn.benchmark = False.Sets
CUBLAS_WORKSPACE_CONFIG=:4096:8(viaos.environ.setdefault) so cuBLAS GEMM kernels are deterministic. Becausesetdefaultis used, a value you have already exported forCUBLAS_WORKSPACE_CONFIGis left untouched rather than overridden.Calls
torch.use_deterministic_algorithms(True, warn_only=True). Thewarn_only=Truesetting means that any op without a deterministic CUDA kernel (for example, the bilinearF.interpolatebackward used by several TAO models) will emit a warning rather than raise an error, so existing training runs continue to work.Disables the flash and memory-efficient SDPA backends and forces the deterministic math SDPA backend.
Note
train.cudnn.deterministic defaults to True and
train.cudnn.benchmark defaults to False in TAO 7.0.1, so the YAML above
simply restates the defaults. Keep benchmark set to False when you need
reproducibility, because cuDNN benchmarking auto-selects algorithms that can vary between
runs.
Deterministic MSDeformAttn (precise_msda)#
Models that use MultiScaleDeformableAttention rely on a fused custom CUDA op whose backward
pass accumulates gradients with atomicAdd, which is non-deterministic. The
model.precise_msda flag (default False) opts in to a numerically equivalent
pure-PyTorch implementation that samples with index_select (whose backward,
index_add, has a deterministic CUDA kernel under
torch.use_deterministic_algorithms):
model:
precise_msda: true
The flag is read once at model-build time. On CPU—or when precise_msda is enabled
on CUDA—attention always runs through the deterministic path; otherwise the fused CUDA op
is used (the default, fastest behavior).
Supported Models#
The train.cudnn.deterministic flag and the deterministic SDPA math backend apply to
all TAO PyTorch models that train through the common training flow.
The model.precise_msda flag is available for the models that use
MultiScaleDeformableAttention:
Model / task |
Config field |
Notes |
|---|---|---|
|
|
Deformable DETR detection head. |
|
|
Includes |
|
|
Applies to the C-RADIO ViT-Adapter’s MSDeformAttn. |
Note
The DINO detection head also uses MultiScaleDeformableAttention, but DINO does not
expose a model.precise_msda field in TAO 7.0.1—so its MSDeformAttn op cannot be
forced onto the deterministic path. cuDNN determinism and the deterministic SDPA math
backend still apply to DINO (they are model-independent), but full MSDeformAttn
determinism is unavailable for DINO. To confirm whether a given task exposes the flag,
inspect its generated spec schema, for example:
<task> train --help 2>&1 | grep precise_msda
If the field is listed (as it is for deformable_detr, grounding_dino, and
visual_changenet in the table above), the model supports deterministic MSDeformAttn.
Caveats and Performance Trade-offs#
Slower and more memory-hungry. Deterministic kernels are generally slower than their auto-tuned or fused counterparts. The math SDPA backend in particular uses more memory and is slower than flash / memory-efficient attention, and the pure-PyTorch MSDeformAttn path costs additional speed and memory versus the fused CUDA op. These costs are only incurred when determinism is requested.
Not all ops have deterministic kernels. Because TAO uses
warn_only=True, some operations (for example, certain bilinear interpolation backward passes) fall back to non-deterministic kernels and emit a warning. Full bit-for-bit reproducibility cannot be guaranteed for every model; the controls minimize non-determinism rather than eliminate it universally.Reproducibility is per-hardware. Identical results are expected on the same GPU architecture, library versions, and GPU count. Changing the GPU model, CUDA/cuDNN version, or the number of GPUs can still alter results.
Set the seed. Determinism flags do not fix initialization randomness on their own. Use
train.seed(default1234) to obtain repeatable initialization and data ordering; a seed below 0 disables the fixed seed.
Example#
The following override (shown for grounding_dino, but identical in form for
deformable_detr and visual_changenet) enables fully deterministic training from the
command line:
grounding_dino train \
-e /path/to/experiment.yaml \
results_dir=/path/to/results \
train.seed=1234 \
train.cudnn.deterministic=True \
train.cudnn.benchmark=False \
model.precise_msda=True
Equivalently, in the experiment spec:
train:
seed: 1234
cudnn:
benchmark: false
deterministic: true
model:
precise_msda: true