nemo_automodel.components.models.deepseek_v4.fsdp#
Module Contents#
Functions#
Return the 1D PyTorch FSDP2 group used for HCA graph alignment. |
|
Apply FSDP2 to DeepSeek-V4 without mixing fp32 and bf16 params in one unit. |
Data#
API#
- nemo_automodel.components.models.deepseek_v4.fsdp._DSV4_CLASS_NAMES#
None
- nemo_automodel.components.models.deepseek_v4.fsdp._DSV4_FP32_MODULE_SUFFIXES#
(‘attn_hc’, ‘ffn_hc’, ‘hc_head’, ‘lm_head’, ‘self_attn.sinks_param’, ‘self_attn.compressor.wkv’, ‘se…
- nemo_automodel.components.models.deepseek_v4.fsdp._hca_param_sync_group_from_1d_mesh(mesh)#
Return the 1D PyTorch FSDP2 group used for HCA graph alignment.
HCA graph alignment is an FSDP/FSDP2 parameter-sync invariant: ranks that synchronize the same sharded HCA parameters must agree on whether the HCA compressor path participates in backward. This DeepSeek-V4 wrapper gets that domain from its 1D PyTorch FSDP2 mesh. The mesh may be named or unnamed; multi-dimensional meshes need an explicit owner dimension to avoid reducing across unrelated parallel groups. Until that is available, disable HCA graph alignment instead of using a broader or wrong group.
- nemo_automodel.components.models.deepseek_v4.fsdp._matches_suffix(name: str, suffix: str) bool#
- nemo_automodel.components.models.deepseek_v4.fsdp._has_fsdp_state(module: torch.nn.Module) bool#
- nemo_automodel.components.models.deepseek_v4.fsdp._module_config_model_type(module: torch.nn.Module) str | None#
- nemo_automodel.components.models.deepseek_v4.fsdp._is_deepseek_v4_module(module: torch.nn.Module) bool#
- nemo_automodel.components.models.deepseek_v4.fsdp._floating_param_dtypes(module: torch.nn.Module) set[torch.dtype]#
- nemo_automodel.components.models.deepseek_v4.fsdp._fp32_mp_policy(mp_policy)#
- nemo_automodel.components.models.deepseek_v4.fsdp._fsdp_kwargs_for_module(
- module: torch.nn.Module,
- fsdp_kwargs: dict,
- nemo_automodel.components.models.deepseek_v4.fsdp._fully_shard_once(
- module: torch.nn.Module,
- *,
- mesh,
- mp_policy,
- offload_policy,
- fp32_policy: bool = False,
- **fsdp_kwargs,
- nemo_automodel.components.models.deepseek_v4.fsdp._iter_dsv4_fp32_modules(module: torch.nn.Module)#
- nemo_automodel.components.models.deepseek_v4.fsdp._attach_hca_param_sync_group(module: torch.nn.Module, mesh) None#
- nemo_automodel.components.models.deepseek_v4.fsdp.fully_shard_deepseek_v4(
- module: torch.nn.Module,
- mesh,
- mp_policy,
- offload_policy=None,
- **fsdp_kwargs,
Apply FSDP2 to DeepSeek-V4 without mixing fp32 and bf16 params in one unit.
This is intentionally model-specific. DeepSeek-V4 keeps a small set of reference-sensitive tensors in fp32, while the existing DeepEP path expects the transformer block itself to remain the main FSDP unit.