nemo_rl.models.dtensor.parallelize#
Module Contents#
Classes#
Custom SequenceParallel class for Qwen2 / Gemma3 rotary embeddings because the input is a tuple. |
Functions#
Parallelizes a Gemma3ForCausalLM model across data and tensor parallel dimensions. |
|
Parallelizes a LlamaForCausalLM model across data and tensor parallel dimensions. |
|
Parallelizes a Qwen2ForCausalLM model across data and tensor parallel dimensions. |
|
Translate parallel style str to parallel type. |
|
Get the Hugging Face tensor parallel plan from the model. |
|
Parallelize a NemotronHForCausalLM model across data and tensor parallel dimensions. |
|
Parallelize a model using DTensor. |
|
Returns the local shard of the given tensor if it is a DTensor. |
|
Clips gradient of an iterable of parameters by total norm. |
|
Calculate the norm of gradients. |
Data#
API#
- class nemo_rl.models.dtensor.parallelize.RotaryEmbedParallel#
Bases:
torch.distributed.tensor.parallel.SequenceParallelCustom SequenceParallel class for Qwen2 / Gemma3 rotary embeddings because the input is a tuple.
- static _prepare_input_fn(sequence_sharding, mod, inputs, device_mesh)#
- static _prepare_output_fn(use_local_output, mod, outputs, device_mesh)#
- nemo_rl.models.dtensor.parallelize._parallelize_gemma3(
- model: Union[transformers.models.gemma3.modeling_gemma3.Gemma3ForCausalLM, transformers.models.gemma3.modeling_gemma3.Gemma3ForConditionalGeneration],
- sequence_parallel: bool = False,
Parallelizes a Gemma3ForCausalLM model across data and tensor parallel dimensions.
- nemo_rl.models.dtensor.parallelize._parallelize_llama(
- model: transformers.models.llama.modeling_llama.LlamaForCausalLM,
- sequence_parallel: bool = False,
Parallelizes a LlamaForCausalLM model across data and tensor parallel dimensions.
- nemo_rl.models.dtensor.parallelize._parallelize_qwen(
- model: Union[transformers.models.qwen2.modeling_qwen2.Qwen2ForCausalLM, transformers.models.qwen3.modeling_qwen3.Qwen3ForCausalLM],
- sequence_parallel: bool = False,
Parallelizes a Qwen2ForCausalLM model across data and tensor parallel dimensions.
- nemo_rl.models.dtensor.parallelize.PARALLIZE_FUNCTIONS: dict[type[torch.nn.Module], Callable[..., dict[str, torch.distributed.tensor.parallel.ParallelStyle]]]#
None
- nemo_rl.models.dtensor.parallelize.translate_parallel_style(style: str)#
Translate parallel style str to parallel type.
Taken and modified from: https://github.com/NVIDIA/NeMo/blob/6c6169db01bcca73ae8ad3ac35242fadbb9a78ba/nemo/lightning/pytorch/strategies/utils.py#L547
- nemo_rl.models.dtensor.parallelize.get_hf_tp_plan(model: transformers.modeling_utils.PreTrainedModel)#
Get the Hugging Face tensor parallel plan from the model.
This function:
Retrieves TP strategies from model class, instance, and inner model levels.
Handles special cases for
embed_tokensandlm_headfor speed up.Converts string-based parallel styles to DTensor parallelization strategies.
Taken and modified from: https://github.com/NVIDIA/NeMo/blob/6c6169db01bcca73ae8ad3ac35242fadbb9a78ba/nemo/lightning/pytorch/strategies/utils.py#L532
- Parameters:
model – A Hugging Face model instance
- Returns:
A dictionary mapping model component paths to their parallelization strategies
- Return type:
dict
- Raises:
AssertionError – If no TP plan is found
- nemo_rl.models.dtensor.parallelize._parallelize_nm5_h(
- model,
- dp_mesh: torch.distributed.device_mesh.DeviceMesh,
- tp_mesh: torch.distributed.device_mesh.DeviceMesh,
- param_dtype: torch.dtype,
- sequence_parallel: bool = False,
- activation_checkpointing: bool = False,
- cpu_offload: bool = False,
- custom_parallel_plan: Optional[Union[dict, str]] = None,
Parallelize a NemotronHForCausalLM model across data and tensor parallel dimensions.
- nemo_rl.models.dtensor.parallelize._parallelize_model(
- model: Union[transformers.models.qwen2.modeling_qwen2.Qwen2ForCausalLM, transformers.models.qwen3.modeling_qwen3.Qwen3ForCausalLM, transformers.models.llama.modeling_llama.LlamaForCausalLM, transformers.models.gemma3.modeling_gemma3.Gemma3ForCausalLM, transformers.models.gemma3.modeling_gemma3.Gemma3ForConditionalGeneration],
- dp_mesh: torch.distributed.device_mesh.DeviceMesh,
- tp_mesh: torch.distributed.device_mesh.DeviceMesh,
- param_dtype: torch.dtype,
- sequence_parallel: bool = False,
- activation_checkpointing: bool = False,
- cpu_offload: bool = False,
- custom_parallel_plan: Optional[Union[dict, str]] = None,
Parallelize a model using DTensor.
- Parameters:
model – The model to parallelize.
dp_mesh – Device mesh for data parallelism.
tp_mesh – Device mesh for tensor parallelism.
param_dtype – Data type for model parameters.
sequence_parallel – Whether to use sequence parallelism. Defaults to False.
activation_checkpointing – Whether to use activation checkpointing. Defaults to False.
cpu_offload – Whether to enable cpu offloading for FSDP. Defaults to False.
custom_parallel_plan – Custom parallel plan for the model. Defaults to None. If it’s a dict, it will be used as the parallel plan directly. If it’s a string, it must be a path that points to a dict or a function that returns a dict. The usage example can refer to
docs/design-docs/fsdp2-parallel-plan.md.
- Returns:
The parallelized model.
- Raises:
ValueError – If the model type is not supported for parallelization.
- nemo_rl.models.dtensor.parallelize.to_local_if_dtensor(
- tensor: Union[torch.Tensor, torch.distributed.tensor.DTensor],
Returns the local shard of the given tensor if it is a DTensor.
Taken and modified from: https://github.com/NVIDIA/Megatron-LM/blob/605f618f237cda8fa80132bc2ccff933512d5a0d/megatron/core/utils.py#L746
- nemo_rl.models.dtensor.parallelize.clip_grad_by_total_norm_(
- parameters: Union[list[Union[torch.Tensor, torch.distributed.tensor.DTensor]], Union[torch.Tensor, torch.distributed.tensor.DTensor]],
- max_grad_norm: Union[int, float],
- total_norm: float,
Clips gradient of an iterable of parameters by total norm.
Taken and modified from: https://github.com/NVIDIA/Megatron-LM/blob/a695b2bd2a0ca9ca63385a48c41a1c5a033cdd1e/megatron/core/optimizer/clip_grads.py#L138
Note that the gradients are modified in place.
- Parameters:
parameters (Union[list[Union[torch.Tensor, DTensor]], Union[torch.Tensor, DTensor]]) – An iterable of Tensors or DTensors, or a single Tensor or DTensor that will have gradients normalized.
max_grad_norm (Union[float, int]) – Maximum norm of the gradients.
total_norm (float) – The pre-computed total norm of the gradients to use for scaling.
- nemo_rl.models.dtensor.parallelize.get_grad_norm(
- parameters: Union[list[Union[torch.Tensor, torch.distributed.tensor.DTensor]], Union[torch.Tensor, torch.distributed.tensor.DTensor]],
- dp_cp_group: torch.distributed.ProcessGroup,
- tp_group: torch.distributed.ProcessGroup,
- norm_type: Union[int, float] = 2,
- dtype: torch.dtype = torch.float32,
Calculate the norm of gradients.
Taken and modified from: https://github.com/NVIDIA/Megatron-LM/blob/a695b2bd2a0ca9ca63385a48c41a1c5a033cdd1e/megatron/core/optimizer/clip_grads.py#L51
- Parameters:
parameters (Union[list[Union[torch.Tensor, DTensor]], Union[torch.Tensor, DTensor]]) – An iterable of Tensors or DTensors, or a single Tensor or DTensor that will have gradient norm calculated.
dp_group (torch.distributed.ProcessGroup) – Process group for data parallel communication.
cp_group (torch.distributed.ProcessGroup) – Process group for context parallel communication.
tp_group (torch.distributed.ProcessGroup) – Process group for tensor parallel communication.
norm_type (Union[int, float]) – Type of the used p-norm. Can be
'inf'for infinity norm.
- Returns:
Total norm of the gradients (viewed as a single vector)
- Return type:
float