Training Backends#
NeMo RL supports multiple training backends to accommodate different model sizes and hardware configurations.
Available Backends#
DTensor (FSDP2) - PyTorch’s next-generation distributed training with improved memory efficiency
Megatron - NVIDIA’s high-performance training framework for scaling to large models (>100B parameters)
Backend Selection#
The training backend is automatically determined based on your YAML configuration settings. Here’s how to configure each backend.
Megatron Backend#
To enable Megatron-based training:
Add the
megatron_cfg
key to your policy configuration.Set
policy.megatron_cfg.enabled=True
.Refer to examples/configs/grpo_math_1B_megatron.yaml for a complete configuration example.
Note: When using Megatron, the optimizer and learning rate schedule are configured through policy.megatron_cfg.optimizer
and policy.megatron_cfg.scheduler
, respectively.
DTensor Backend#
To enable DTensor (FSDP2) training:
Set
policy.dtensor_config.enabled=True
.Refer to examples/configs/grpo_math_1B.yaml for a configuration example.
Backend Priority#
Megatron takes precedence over DTensor. If both backends are enabled simultaneously (policy.megatron_cfg.enabled=True
and policy.dtensor_config.enabled=True
), the Megatron backend will be used.
Configuration Examples#
For comprehensive examples of each algorithm and backend, see the examples/configs/recipes/llm folder. This directory contains ready-to-use configurations for various supported combinations.