Distributed Setup (Python API)
Distributed Setup (Python API)
When you use the YAML recipes or the automodel CLI, distributed training is
configured from the distributed: section of your config (see
Configuration). When you call the NeMoAutoModel*
loaders directly from Python, you describe the same topology and execution
policies with a single typed object: DistributedSetup.
Quick start
Build a DistributedSetup and pass it to from_pretrained (or from_config):
The same distributed_setup= keyword works on every NeMo AutoModel loader,
including NeMoAutoModelForImageTextToText,
NeMoAutoModelForSequenceClassification, and
NeMoAutoModelForTokenClassification.
DistributedSetup.build
DistributedSetup.build resolves a device mesh and the execution policies from
your requested parallelism sizes. It is intentionally forgiving about input
types — strategy accepts a string or a config object, and the pipeline / MoE
configs accept either a dataclass or a plain dict.
ParallelismSizes is durable user intent (what you requested). The resolved
runtime topology lives on DistributedSetup.mesh_context
(MeshContext), which derives its sizes from the live DeviceMesh after build.
Validation happens at construction time, so invalid combinations fail fast
instead of deep inside training. For example, passing a pipeline_config
without pp_size > 1, or a moe_parallel_config without ep_size > 1, raises
a ValueError.
Plain device mesh shortcut
If you only need a topology and no NeMo-specific policies, you can pass a
pre-created Hugging Face-style DeviceMesh directly as device_mesh=. NeMo
wraps it in a topology-only DistributedSetup internally:
Pass either distributed_setup or device_mesh, not both. Use
distributed_setup whenever you need strategy, pipeline, MoE, or
activation-checkpointing policies.
Migrating from the per-keyword API
Earlier releases accepted a flat set of distributed keywords on
from_pretrained / from_config. These are now consolidated into the single
distributed_setup object.
Before:
After:
The following keywords are no longer accepted directly on the loaders and
raise a TypeError if passed: distributed_config, moe_config, moe_mesh,
pipeline_config, tp_plan, and activation_checkpointing. Move them onto
DistributedSetup.build. Note the rename moe_config → moe_parallel_config,
and that flat size keywords (e.g. tp_size) now live on ParallelismSizes.
See also
- Configuration — the equivalent
distributed:YAML section used by recipes and the CLI. - Pipeline Parallelism — the lower-level
AutoPipelinebuilding blocks for custom training loops. - Gradient Checkpointing — full and selective activation checkpointing.
DistributedSetupsource.