Tips and Tricks#

Missing Submodules Error#

If you forget to initialize the NeMo and Megatron submodules when cloning the NeMo-RL repository, you may run into an error like this:

ModuleNotFoundError: No module named 'megatron'

If you see this error, there is likely an issue with your virtual environments. To fix this, first initialize the submodules:

git submodule update --init --recursive

and then force a rebuild of the virtual environments by setting NRL_FORCE_REBUILD_VENVS=true next time you launch a run:

NRL_FORCE_REBUILD_VENVS=true uv run examples/run_grpo.py ...

Memory Fragmentation#

Large amounts of memory fragmentation might occur when running models without support for FlashAttention2. If OOM occurs after a few iterations of training, it may help to tweak the allocator settings to reduce memory fragmentation. To do so, specify max_split_size_mb at either one of the following places:

Launch training with:

# This will globally apply to all Ray actors
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:64 uv run python examples/run_dpo.py ...

Make the change more permanently by adding this flag in the training configuration:

policy:
  # ...
  dtensor_cfg:
    env_vars:
      PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:64"