Deterministic Training and Reproducibility#

Reproducible training—obtaining bit-for-bit identical results across repeated runs with the same seed, data, and hardware—is important for debugging, regression testing, and scientific comparison of experiments. By default, several GPU kernels used in deep learning (cuDNN convolution algorithms, fused attention backward passes, and the custom MultiScaleDeformableAttention CUDA op) are non-deterministic: they use atomic accumulation or auto-tuned algorithm selection, so gradients can differ slightly between runs even with a fixed seed.

Starting in TAO 7.0.1, the train command exposes a set of flags that route these operations through deterministic code paths, yielding reproducible gradients—at some cost in training speed and memory.

The primary control, train.cudnn.deterministic, defaults to True in TAO 7.0.1, so cuDNN determinism and the deterministic SDPA math backend are active out of the box; you do not need to set anything to get them. The per-model model.precise_msda flag defaults to False and must be set explicitly to make MultiScaleDeformableAttention deterministic. To turn cuDNN determinism off for maximum speed, set train.cudnn.deterministic=False.

Overview of the Deterministic Controls#

TAO 7.0.1 introduces three layered controls:

  • cuDNN determinism (train.cudnn.deterministic, default True) — a common training flag, available for every PyTorch model. When enabled, TAO enables cuDNN deterministic algorithms, calls torch.use_deterministic_algorithms(True, warn_only=True), sets the cuBLAS workspace configuration, and forces the deterministic math backend for scaled dot-product attention (SDPA).

  • Deterministic SDPA (math backend) — enabled automatically as part of train.cudnn.deterministic. The flash and memory-efficient attention backends have non-deterministic backward kernels; only the math backend is deterministic. TAO disables the former two and forces the math backend so attention-heavy backbones (for example, C-RADIO ViT) are reproducible.

  • Deterministic MSDeformAttn (model.precise_msda) — a per-model flag for networks that use MultiScaleDeformableAttention. When set, TAO routes attention through a deterministic pure-PyTorch implementation instead of the custom CUDA op, whose atomicAdd backward has no deterministic kernel.

For fully reproducible training of a MSDeformAttn-based model, set both train.cudnn.deterministic=True and model.precise_msda=True. You should also fix the seed via train.seed (the default is 1234; a value below 0 disables the fixed seed).

Enabling Deterministic Training#

cuDNN Determinism and Deterministic SDPA#

The train.cudnn configuration is common to all TAO PyTorch models:

train:
  seed: 1234
  cudnn:
    benchmark: false        # leave false for determinism (default)
    deterministic: true     # enable deterministic cuDNN + SDPA math backend

When train.cudnn.deterministic is True, TAO performs the following at the start of training:

  • Sets torch.backends.cudnn.deterministic = True and torch.backends.cudnn.benchmark = False.

  • Sets CUBLAS_WORKSPACE_CONFIG=:4096:8 (via os.environ.setdefault) so cuBLAS GEMM kernels are deterministic. Because setdefault is used, a value you have already exported for CUBLAS_WORKSPACE_CONFIG is left untouched rather than overridden.

  • Calls torch.use_deterministic_algorithms(True, warn_only=True). The warn_only=True setting means that any op without a deterministic CUDA kernel (for example, the bilinear F.interpolate backward used by several TAO models) will emit a warning rather than raise an error, so existing training runs continue to work.

  • Disables the flash and memory-efficient SDPA backends and forces the deterministic math SDPA backend.

Note

train.cudnn.deterministic defaults to True and train.cudnn.benchmark defaults to False in TAO 7.0.1, so the YAML above simply restates the defaults. Keep benchmark set to False when you need reproducibility, because cuDNN benchmarking auto-selects algorithms that can vary between runs.

Deterministic MSDeformAttn (precise_msda)#

Models that use MultiScaleDeformableAttention rely on a fused custom CUDA op whose backward pass accumulates gradients with atomicAdd, which is non-deterministic. The model.precise_msda flag (default False) opts in to a numerically equivalent pure-PyTorch implementation that samples with index_select (whose backward, index_add, has a deterministic CUDA kernel under torch.use_deterministic_algorithms):

model:
  precise_msda: true

The flag is read once at model-build time. On CPU—or when precise_msda is enabled on CUDA—attention always runs through the deterministic path; otherwise the fused CUDA op is used (the default, fastest behavior).

Supported Models#

The train.cudnn.deterministic flag and the deterministic SDPA math backend apply to all TAO PyTorch models that train through the common training flow.

The model.precise_msda flag is available for the models that use MultiScaleDeformableAttention:

Model / task

Config field

Notes

deformable_detr

model.precise_msda

Deformable DETR detection head.

grounding_dino

model.precise_msda

Includes mask_grounding_dino.

visual_changenet

model.precise_msda

Applies to the C-RADIO ViT-Adapter’s MSDeformAttn.

Note

The DINO detection head also uses MultiScaleDeformableAttention, but DINO does not expose a model.precise_msda field in TAO 7.0.1—so its MSDeformAttn op cannot be forced onto the deterministic path. cuDNN determinism and the deterministic SDPA math backend still apply to DINO (they are model-independent), but full MSDeformAttn determinism is unavailable for DINO. To confirm whether a given task exposes the flag, inspect its generated spec schema, for example:

<task> train --help 2>&1 | grep precise_msda

If the field is listed (as it is for deformable_detr, grounding_dino, and visual_changenet in the table above), the model supports deterministic MSDeformAttn.

Caveats and Performance Trade-offs#

  • Slower and more memory-hungry. Deterministic kernels are generally slower than their auto-tuned or fused counterparts. The math SDPA backend in particular uses more memory and is slower than flash / memory-efficient attention, and the pure-PyTorch MSDeformAttn path costs additional speed and memory versus the fused CUDA op. These costs are only incurred when determinism is requested.

  • Not all ops have deterministic kernels. Because TAO uses warn_only=True, some operations (for example, certain bilinear interpolation backward passes) fall back to non-deterministic kernels and emit a warning. Full bit-for-bit reproducibility cannot be guaranteed for every model; the controls minimize non-determinism rather than eliminate it universally.

  • Reproducibility is per-hardware. Identical results are expected on the same GPU architecture, library versions, and GPU count. Changing the GPU model, CUDA/cuDNN version, or the number of GPUs can still alter results.

  • Set the seed. Determinism flags do not fix initialization randomness on their own. Use train.seed (default 1234) to obtain repeatable initialization and data ordering; a seed below 0 disables the fixed seed.

Example#

The following override (shown for grounding_dino, but identical in form for deformable_detr and visual_changenet) enables fully deterministic training from the command line:

grounding_dino train \
  -e /path/to/experiment.yaml \
  results_dir=/path/to/results \
  train.seed=1234 \
  train.cudnn.deterministic=True \
  train.cudnn.benchmark=False \
  model.precise_msda=True

Equivalently, in the experiment spec:

train:
  seed: 1234
  cudnn:
    benchmark: false
    deterministic: true
model:
  precise_msda: true