nemo_automodel.distributed.fsdp2
#
Module Contents#
Classes#
Manager for setting up and parallelizing models using FSDP2 with TP, DP, CP sharding. |
API#
- class nemo_automodel.distributed.fsdp2.FSDP2Manager[source]#
Manager for setting up and parallelizing models using FSDP2 with TP, DP, CP sharding.
This manager initializes the torch.distributed process group, infers the group sizes for data parallelism (DP) and tensor parallelism (TP), builds the device mesh for distributed operations, and applies parallelization to the model using a prescribed TP sharding plan. It also supports mixed precision and CPU offloading options.
.. attribute:: dp_size
Data-parallel group size. If None or non-positive, it is inferred from WORLD_SIZE.
- Type:
Optional[int]
.. attribute:: tp_size
Tensor-parallel group size. Defaults to 1 if zero/None.
- Type:
Optional[int]
.. attribute:: cp_size
Context-parallel group size for pipeline-like sharding.
- Type:
int
.. attribute:: sequence_parallel
Enables sequence parallelism in the TP plan when True.
- Type:
bool
.. attribute:: mp_policy
Defines the mixed precision policy for parameters, reductions, and outputs.
- Type:
MixedPrecisionPolicy
.. attribute:: offload_policy
Policy to offload parameters or optimizer states to CPU, if specified.
- Type:
CPUOffloadPolicy
.. attribute:: backend
Distributed backend to use (e.g., ‘nccl’ for GPUs or ‘gloo’ for CPUs).
- Type:
str
.. attribute:: world_size
Total number of processes.
- Type:
int
.. method:: post_init()
Automatically sets up the distributed environment after initialization.
.. method:: _setup_distributed()
Initializes the torch.distributed process group, infers parallel sizes, builds the device mesh, and registers a destroy handler.
.. method:: parallelize(model)
Applies FSDP2 and Tensor-Parallel sharding strategies to the given model.
- dp_size: Optional[int]#
‘field(…)’
- tp_size: Optional[int]#
‘field(…)’
- cp_size: Optional[int]#
‘field(…)’
- sequence_parallel: Optional[bool]#
‘field(…)’
- mp_policy: Optional[torch.distributed.fsdp.MixedPrecisionPolicy]#
‘field(…)’
- offload_policy: Optional[torch.distributed.fsdp.CPUOffloadPolicy]#
‘field(…)’
- backend: Optional[str]#
‘field(…)’
- world_size: Optional[int]#
‘field(…)’
- _setup_distributed()[source]#
Initializes the distributed environment.
Checks availability and initialization of torch.distributed.
Infers data-parallel and tensor-parallel sizes if not provided.
Builds a device mesh based on the specified mesh shape and dimension names.
Flattens data and context dimensions if context parallelism is enabled.
Requires the environment variables: RANK, WORLD_SIZE, MASTER_ADDR, MASTER_PORT.
- Raises:
RuntimeError – If torch.distributed is not available or not initialized.
- Returns:
Instance with the device mesh configured.
- Return type:
- parallelize(model, use_hf_tp_plan=False)[source]#
Parallelizes the given model using FSDP2 and TP sharding strategies.
This method must be called after the distributed environment has been set up. It selects a TP sharding plan (currently supporting Hugging Face TP plan via get_hf_tp_shard_plan) and applies the FSDP2 parallelization strategy.
- Parameters:
model (nn.Module) – The model to be parallelized.
use_hf_tp_plan (bool) – if true, will attempt to get the TP plan from the model.
- Returns:
The parallelized model.
- Raises:
NotImplemented – If the required TP sharding plan is not supported.