nemo_automodel.distributed.fsdp2#

Module Contents#

Classes#

FSDP2Manager

Manager for setting up and parallelizing models using FSDP2 with TP, DP, CP sharding.

API#

class nemo_automodel.distributed.fsdp2.FSDP2Manager[source]#

Manager for setting up and parallelizing models using FSDP2 with TP, DP, CP sharding.

This manager initializes the torch.distributed process group, infers the group sizes for data parallelism (DP) and tensor parallelism (TP), builds the device mesh for distributed operations, and applies parallelization to the model using a prescribed TP sharding plan. It also supports mixed precision and CPU offloading options.

.. attribute:: dp_size

Data-parallel group size. If None or non-positive, it is inferred from WORLD_SIZE.

Type:

Optional[int]

.. attribute:: tp_size

Tensor-parallel group size. Defaults to 1 if zero/None.

Type:

Optional[int]

.. attribute:: cp_size

Context-parallel group size for pipeline-like sharding.

Type:

int

.. attribute:: sequence_parallel

Enables sequence parallelism in the TP plan when True.

Type:

bool

.. attribute:: mp_policy

Defines the mixed precision policy for parameters, reductions, and outputs.

Type:

MixedPrecisionPolicy

.. attribute:: offload_policy

Policy to offload parameters or optimizer states to CPU, if specified.

Type:

CPUOffloadPolicy

.. attribute:: backend

Distributed backend to use (e.g., ‘nccl’ for GPUs or ‘gloo’ for CPUs).

Type:

str

.. attribute:: world_size

Total number of processes.

Type:

int

.. method:: post_init()

Automatically sets up the distributed environment after initialization.

.. method:: _setup_distributed()

Initializes the torch.distributed process group, infers parallel sizes, builds the device mesh, and registers a destroy handler.

.. method:: parallelize(model)

Applies FSDP2 and Tensor-Parallel sharding strategies to the given model.

dp_size: Optional[int]#

‘field(…)’

tp_size: Optional[int]#

‘field(…)’

cp_size: Optional[int]#

‘field(…)’

sequence_parallel: Optional[bool]#

‘field(…)’

mp_policy: Optional[torch.distributed.fsdp.MixedPrecisionPolicy]#

‘field(…)’

offload_policy: Optional[torch.distributed.fsdp.CPUOffloadPolicy]#

‘field(…)’

backend: Optional[str]#

‘field(…)’

world_size: Optional[int]#

‘field(…)’

__post_init__()[source]#

Post-initialization hook that sets up the distributed environment.

_setup_distributed()[source]#

Initializes the distributed environment.

  • Checks availability and initialization of torch.distributed.

  • Infers data-parallel and tensor-parallel sizes if not provided.

  • Builds a device mesh based on the specified mesh shape and dimension names.

  • Flattens data and context dimensions if context parallelism is enabled.

Requires the environment variables: RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT.

Raises:

RuntimeError – If torch.distributed is not available or not initialized.

Returns:

Instance with the device mesh configured.

Return type:

FSDP2Manager

parallelize(model, use_hf_tp_plan=False)[source]#

Parallelizes the given model using FSDP2 and TP sharding strategies.

This method must be called after the distributed environment has been set up. It selects a TP sharding plan (currently supporting Hugging Face TP plan via get_hf_tp_shard_plan) and applies the FSDP2 parallelization strategy.

Parameters:
  • model (nn.Module) – The model to be parallelized.

  • use_hf_tp_plan (bool) – if true, will attempt to get the TP plan from the model.

Returns:

The parallelized model.

Raises:

NotImplemented – If the required TP sharding plan is not supported.