nemo_automodel.components.distributed.parallelizer
#
Module Contents#
Functions#
Recursively apply FSDP sharding to modules, with optimizations for ModuleList. |
|
Get the Hugging Face tensor parallel plan from the model. |
|
Import a class from a string path (e.g. ‘torch.optim.AdamW’). |
|
Helper function to import classes from string paths. |
|
Translates string descriptions to parallelism plans. |
|
Apply parallelisms and activation checkpointing to the model. |
|
Apply tensor/data parallelism (nvFSDP) and optional activation-checkpointing to the model. |
|
Destroy process group. |
|
Explicitly unshard and then reshard the FSDP2 modules. Useful for logprob inference. |
Data#
API#
- nemo_automodel.components.distributed.parallelizer.HAVE_NVFSDP#
False
- nemo_automodel.components.distributed.parallelizer.apply_fsdp_sharding_recursively(
- module: torch.nn.Module,
- mesh: torch.distributed.device_mesh.DeviceMesh,
- mp_policy: Optional[torch.distributed.fsdp.MixedPrecisionPolicy],
- offload_policy: Optional[torch.distributed.fsdp.CPUOffloadPolicy] = None,
Recursively apply FSDP sharding to modules, with optimizations for ModuleList.
This utility function traverses a model hierarchy and applies FSDP sharding to each module. For ModuleList instances (commonly used for transformer layers), it applies an optimization where the last layer doesn’t reshard after forward since FSDP will prefetch it immediately.
- Parameters:
module (nn.Module) – The module to apply FSDP sharding to.
mesh (DeviceMesh) – The device mesh for FSDP sharding.
mp_policy (Optional[MixedPrecisionPolicy]) – Mixed precision policy for FSDP.
offload_policy (Optional[CPUOffloadPolicy]) – CPU offload policy for FSDP. Defaults to None.
.. note::
This function modifies the module in-place by replacing modules with their FSDP-wrapped versions.
- nemo_automodel.components.distributed.parallelizer.get_hf_tp_shard_plan(model)[source]#
Get the Hugging Face tensor parallel plan from the model.
This function:
Retrieves TP strategies from model class, instance, and inner model levels.
Handles special cases for
embed_tokens
andlm_head
for speed up.Converts string-based parallel styles to DTensor parallelization strategies.
Taken and modified from: https://github.com/NVIDIA/NeMo/blob/6c6169db01bcca73ae8ad3ac35242fadbb9a78ba/nemo/lightning/pytorch/strategies/utils.py#L532
- Parameters:
model – A Hugging Face model instance
- Returns:
A dictionary mapping model component paths to their parallelization strategies
- Return type:
dict
- Raises:
AssertionError – If no TP plan is found
- nemo_automodel.components.distributed.parallelizer.import_class_from_path(name: str) Any [source]#
Import a class from a string path (e.g. ‘torch.optim.AdamW’).
- Parameters:
full_path – Full path to class including module path and class name
- Returns:
The imported class object
- nemo_automodel.components.distributed.parallelizer.import_classes_from_paths(class_paths: List[str])[source]#
Helper function to import classes from string paths.
- Parameters:
class_paths (List[str]) – The list of string paths to the classes.
- Returns:
List of imported classes.
- nemo_automodel.components.distributed.parallelizer.translate_to_torch_parallel_style(style: str)[source]#
Translates string descriptions to parallelism plans.
In model configurations, we use a neutral type (string) to specify parallel styles, here we translate them into torch.distributed tensor-parallel types.
- nemo_automodel.components.distributed.parallelizer.fsdp2_strategy_parallelize(
- model,
- device_mesh: torch.distributed.device_mesh.DeviceMesh,
- param_dtype: torch.dtype = torch.bfloat16,
- mp_policy: Optional[torch.distributed.fsdp.MixedPrecisionPolicy] = None,
- offload_policy: Optional[torch.distributed.fsdp.CPUOffloadPolicy] = None,
- sequence_parallel: bool = False,
- activation_checkpointing: bool = False,
- cpu_offload: bool = False,
- tp_shard_plan: Optional[Union[Dict[str, torch.distributed.tensor.parallel.ParallelStyle], str]] = None,
- dp_mesh_name: str = 'data_parallel',
- tp_mesh_name: str = 'tensor_parallel',
- dp_cp_mesh_name: str = 'dp_cp',
Apply parallelisms and activation checkpointing to the model.
Enhanced version that incorporates advanced features from nemo-rl’s _parallelize_model:
Automatic parallel plan generation based on model type
Custom parallel plan support (dict or string path)
Sequence parallel support
Activation checkpointing for MLP layers
Model validation (attention heads divisible by TP size)
Better fallback logic
- Parameters:
model – The model to be parallelized.
device_mesh (DeviceMesh) – The device mesh for distributed training.
param_dtype (torch.dtype) – Data type for model parameters. Defaults to torch.bfloat16.
mp_policy (Optional[MixedPrecisionPolicy]) – Mixed precision policy for model parallelism.
offload_policy (Optional[CPUOffloadPolicy]) – The offload policy for FSDP.
sequence_parallel (bool) – Whether to use sequence parallelism. Defaults to False.
activation_checkpointing (bool) – Whether to use activation checkpointing. Defaults to False.
cpu_offload (bool) – Whether to enable cpu offloading for FSDP. Defaults to False.
tp_shard_plan (Optional[Union[Dict[str, ParallelStyle], str]]) –
Custom tensor parallel plan for the model. Can be:
A dictionary mapping module names to parallel styles
A string path to a dictionary or function that returns a dictionary If provided, this takes precedence over automatic plan generation.
dp_mesh_name (str) – Key name for the data parallel mesh in device_mesh. Defaults to “data_parallel”.
dp_cp_mesh_name (str) – Key name for the data parallel + context parallel mesh in device_mesh. Used when context parallelism is enabled. Defaults to “dp_cp”.
tp_mesh_name (str) – Key name for the tensor parallel mesh in device_mesh. Defaults to “tensor_parallel”.
- Returns:
The parallelized model.
NOTE: The passed-in model preferably should be on meta device. Otherwise, the model must fit on GPU or CPU memory.
- nemo_automodel.components.distributed.parallelizer.nvfsdp_strategy_parallelize(
- model,
- device_mesh: torch.distributed.device_mesh.DeviceMesh,
- optimizer=None,
- nvfsdp_unit_modules: Optional[List[str]] = None,
- tp_shard_plan: Optional[Dict[str, Union[torch.distributed.tensor.parallel.RowwiseParallel, torch.distributed.tensor.parallel.ColwiseParallel, torch.distributed.tensor.parallel.SequenceParallel]]] = None,
- data_parallel_sharding_strategy: str = 'optim_grads_params',
- init_nvfsdp_with_meta_device: bool = False,
- grad_reduce_in_fp32: bool = False,
- preserve_fp32_weights: bool = False,
- overlap_grad_reduce: bool = True,
- overlap_param_gather: bool = True,
- check_for_nan_in_grad: bool = True,
- average_in_collective: bool = False,
- disable_bucketing: bool = False,
- calculate_per_token_loss: bool = False,
- keep_fp8_transpose_cache_when_using_custom_fsdp: bool = False,
- nccl_ub: bool = False,
- fsdp_double_buffer: bool = False,
- dp_mesh_name: str = 'data_parallel',
- cp_mesh_name: str = 'context_parallel',
- tp_mesh_name: str = 'tensor_parallel',
Apply tensor/data parallelism (nvFSDP) and optional activation-checkpointing to the model.
- Parameters:
model – The model to be parallelized.
device_mesh (DeviceMesh) – The device mesh describing the physical devices used for distributed training.
nvfsdp_unit_modules (Optional[List[str]]) – Names of sub-modules that should become individual nvFSDP units. If None, the full model is wrapped as a single unit.
tp_shard_plan (Optional[Dict[str, Union[RowwiseParallel, ColwiseParallel, SequenceParallel]]]) – A tensor-parallel sharding plan. Keys are module names; values specify the parallel style to apply (e.g., RowwiseParallel, ColwiseParallel, SequenceParallel).
data_parallel_sharding_strategy (str) – Strategy for sharding parameters, gradients, and optimizer states across data-parallel ranks. Valid options include “params”, “grads_params”, and “optim_grads_params” (default).
init_nvfsdp_with_meta_device (bool) – If True, construct the model on a meta device first and materialize weights lazily to reduce memory fragmentation.
grad_reduce_in_fp32 (bool) – Reduce gradients in FP32 irrespective of the parameter precision to improve numerical stability.
preserve_fp32_weights (bool) – Keep a master FP32 copy of weights when training in reduced precision (e.g., FP16/BF16).
overlap_grad_reduce (bool) – If True, overlap gradient reduction with backward computation.
overlap_param_gather (bool) – If True, overlap parameter gathering with forward computation.
check_for_nan_in_grad (bool) – Whether to check gradients for NaNs/Infs before applying the optimizer step.
average_in_collective (bool) – Perform gradient averaging inside the collective operation instead of dividing afterward.
disable_bucketing (bool) – Disable gradient bucketing; gradients are reduced immediately as they are produced.
calculate_per_token_loss (bool) – Compute loss normalized by the number of tokens instead of the number of sequences.
keep_fp8_transpose_cache_when_using_custom_fsdp (bool) – Retain the FP8 transpose cache when using a custom nvFSDP wrapper.
nccl_ub (bool) – Enable NCCL user-buffer API (experimental) for reduced latency on some networks.
fsdp_double_buffer (bool) – Enable double buffering of parameters to overlap communication and computation in nvFSDP.
dp_mesh_name (str) – Key name for the data parallel mesh in device_mesh. Defaults to “data_parallel”.
cp_mesh_name (str) – Key name for the context parallel mesh in device_mesh. Defaults to “context_parallel”.
tp_mesh_name (str) – Key name for the tensor parallel mesh in device_mesh. Defaults to “tensor_parallel”.
NOTE: The passed-in model should preferably reside on the meta device. Otherwise, ensure the model fits into available GPU or CPU memory.
NOTE: The user must ensure that the provided tp_shard_plan is compatible with the model architecture.