Distributed Setup (Python API)

View as Markdown

When you use the YAML recipes or the automodel CLI, distributed training is configured from the distributed: section of your config (see Configuration). When you call the NeMoAutoModel* loaders directly from Python, you describe the same topology and execution policies with a single typed object: DistributedSetup.

Quick start

Build a DistributedSetup and pass it to from_pretrained (or from_config):

1import torch
2from nemo_automodel import NeMoAutoModelForCausalLM
3from nemo_automodel.components.distributed import (
4 DistributedSetup,
5 FSDP2Config,
6 ParallelismSizes,
7 initialize_distributed,
8)
9
10dist_env = initialize_distributed("nccl")
11
12distributed_setup = DistributedSetup.build(
13 strategy=FSDP2Config(sequence_parallel=True),
14 parallelism_sizes=ParallelismSizes(tp_size=2),
15 activation_checkpointing=True,
16 world_size=dist_env.world_size,
17)
18
19model = NeMoAutoModelForCausalLM.from_pretrained(
20 "meta-llama/Llama-3.2-1B",
21 distributed_setup=distributed_setup,
22)

The same distributed_setup= keyword works on every NeMo AutoModel loader, including NeMoAutoModelForImageTextToText, NeMoAutoModelForSequenceClassification, and NeMoAutoModelForTokenClassification.

DistributedSetup.build

DistributedSetup.build resolves a device mesh and the execution policies from your requested parallelism sizes. It is intentionally forgiving about input types — strategy accepts a string or a config object, and the pipeline / MoE configs accept either a dataclass or a plain dict.

ArgumentTypeDefaultPurpose
strategystr | DistributedStrategyConfig"fsdp2"Sharding strategy: "fsdp2", "ddp", or "megatron_fsdp" (or a config object such as FSDP2Config/DDPConfig).
parallelism_sizesParallelismSizes | NoneNoneRequested parallel dimensions (tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size). dp_replicate_size is FSDP2-only.
pipeline_configPipelineConfig | dict | NoneNonePipeline-parallel options; requires pp_size > 1.
moe_parallel_configMoEParallelizerConfig | dict | NoneNoneExpert-parallel options; requires ep_size > 1.
activation_checkpointingbool | "full" | "selective"FalseTrue or "full" enables full activation checkpointing; "selective" enables selective AC for FSDP2 or DDP.
world_sizeint | NoneNoneTotal ranks; auto-detected from the process group when omitted.

ParallelismSizes is durable user intent (what you requested). The resolved runtime topology lives on DistributedSetup.mesh_context (MeshContext), which derives its sizes from the live DeviceMesh after build.

Validation happens at construction time, so invalid combinations fail fast instead of deep inside training. For example, passing a pipeline_config without pp_size > 1, or a moe_parallel_config without ep_size > 1, raises a ValueError.

Plain device mesh shortcut

If you only need a topology and no NeMo-specific policies, you can pass a pre-created Hugging Face-style DeviceMesh directly as device_mesh=. NeMo wraps it in a topology-only DistributedSetup internally:

1from torch.distributed.device_mesh import init_device_mesh
2
3mesh = init_device_mesh("cuda", mesh_shape=(2,), mesh_dim_names=("tp",))
4
5model = NeMoAutoModelForCausalLM.from_pretrained(
6 "meta-llama/Llama-3.2-1B",
7 device_mesh=mesh,
8)

Pass either distributed_setup or device_mesh, not both. Use distributed_setup whenever you need strategy, pipeline, MoE, or activation-checkpointing policies.

Migrating from the per-keyword API

Earlier releases accepted a flat set of distributed keywords on from_pretrained / from_config. These are now consolidated into the single distributed_setup object.

Before:

1model = NeMoAutoModelForCausalLM.from_pretrained(
2 "meta-llama/Llama-3.2-1B",
3 distributed_config=FSDP2Config(activation_checkpointing=True),
4 tp_size=2,
5 pipeline_config=pp_cfg,
6 moe_config=moe_cfg,
7 moe_mesh=moe_mesh,
8 activation_checkpointing=True,
9)

After:

1distributed_setup = DistributedSetup.build(
2 strategy=FSDP2Config(),
3 parallelism_sizes=ParallelismSizes(tp_size=2),
4 pipeline_config=pp_cfg,
5 moe_parallel_config=moe_cfg,
6 activation_checkpointing=True,
7)
8
9model = NeMoAutoModelForCausalLM.from_pretrained(
10 "meta-llama/Llama-3.2-1B",
11 distributed_setup=distributed_setup,
12)

The following keywords are no longer accepted directly on the loaders and raise a TypeError if passed: distributed_config, moe_config, moe_mesh, pipeline_config, tp_plan, and activation_checkpointing. Move them onto DistributedSetup.build. Note the rename moe_configmoe_parallel_config, and that flat size keywords (e.g. tp_size) now live on ParallelismSizes.

See also