bridge.recipes.run_plugins#

This file contains plugins based on NeMo-Run’s run.Plugin API. Plugins operate both on a configured task and an executor at the same time, and are specific to NeMo-Run. These plugins work by modifying the ConfigContainer configuration overrides.

For run.Script tasks, each plugin supports custom argument conversion via the script_args_converter_fn parameter. This allows users to specify their own conversion function if their training scripts don’t use hydra-style overrides.

Example usage with custom converter:

from megatron.bridge.recipes.run_plugins import (
    PreemptionPlugin,
    PreemptionPluginScriptArgs,
)

# Define a custom converter for argparse-style arguments
def argparse_preemption_converter(args: PreemptionPluginScriptArgs) -> List[str]:
    result = []
    if args.enable_exit_handler:
        result.append("--enable-exit-handler")
    if args.enable_exit_handler_for_data_loader:
        result.append("--enable-exit-handler-dataloader")
    return result

# Use the plugin with the custom converter
plugin = PreemptionPlugin(
    preempt_time=120,
    enable_exit_handler=True,
    script_args_converter_fn=argparse_preemption_converter,
)

If no converter is provided, the plugin will use the default hydra-style converter.

Module Contents#

Classes#

PreemptionPluginScriptArgs

Arguments for PreemptionPlugin to pass to run.Script.

PreemptionPlugin

A plugin for setting up preemption handling and signals.

FaultTolerancePluginScriptArgs

Arguments for FaultTolerancePlugin to pass to run.Script.

FaultTolerancePlugin

A plugin for setting up fault tolerance configuration. This plugin enables workload hang detection, automatic calculation of timeouts used for hang detection, detection of rank(s) terminated due to an error and workload respawning in case of a failure.

NsysPluginScriptArgs

Arguments for NsysPlugin to pass to run.Script.

NsysPlugin

A plugin for nsys profiling configuration.

PyTorchProfilerPluginScriptArgs

Arguments for PyTorchProfilerPlugin to pass to run.Script.

PyTorchProfilerPlugin

A plugin for PyTorch profiler configuration.

WandbPluginScriptArgs

Arguments for WandbPlugin to pass to run.Script.

WandbPlugin

A plugin for setting up Weights & Biases configuration.

PerfEnvPluginScriptArgs

Arguments for PerfEnvPlugin to pass to run.Script.

PerfEnvPlugin

A plugin for setting up performance optimized environments.

Functions#

_format_list_for_override

Render a Python list into a Hydra/CLI-safe list string without spaces.

_default_preemption_converter

Default converter for PreemptionPlugin that generates hydra-style overrides.

_default_fault_tolerance_converter

Default converter for FaultTolerancePlugin that generates hydra-style overrides.

_default_nsys_converter

Default converter for NsysPlugin that generates hydra-style overrides.

_default_pytorch_profiler_converter

Default converter for PyTorchProfilerPlugin that generates hydra-style overrides.

_default_wandb_converter

Default converter for WandbPlugin that generates hydra-style overrides.

_default_perf_env_converter

Default converter for PerfEnvPlugin that generates hydra-style overrides.

Data#

API#

bridge.recipes.run_plugins.logger: logging.Logger#

‘getLogger(…)’

bridge.recipes.run_plugins._format_list_for_override(values: List | int)#

Render a Python list into a Hydra/CLI-safe list string without spaces.

Example: [0, 3] -> “[0,3]”

class bridge.recipes.run_plugins.PreemptionPluginScriptArgs#

Arguments for PreemptionPlugin to pass to run.Script.

enable_exit_handler: bool#

None

enable_exit_handler_for_data_loader: bool#

None

bridge.recipes.run_plugins._default_preemption_converter(
args: bridge.recipes.run_plugins.PreemptionPluginScriptArgs,
) List[str]#

Default converter for PreemptionPlugin that generates hydra-style overrides.

class bridge.recipes.run_plugins.PreemptionPlugin#

Bases: nemo_run.Plugin

A plugin for setting up preemption handling and signals.

Parameters:
  • preempt_time (int) – The time, in seconds, before the task’s time limit at which the executor will send a SIGTERM preemption signal. This allows tasks to be gracefully stopped before reaching their time limit, reducing waste and promoting fair resource usage. The default value is 60 seconds (1 minute). This is only supported for run.SlurmExecutor.

  • enable_exit_handler (bool) – Whether to enable the exit signal handler in training config.

  • enable_exit_handler_for_data_loader (bool) – Whether to enable the exit signal handler for data loader.

  • script_args_converter_fn (Optional[Callable]) – A function that takes PreemptionPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.

preempt_time: int#

60

enable_exit_handler: bool#

True

enable_exit_handler_for_data_loader: bool#

False

script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.PreemptionPluginScriptArgs], List[str]]]#

None

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.FaultTolerancePluginScriptArgs#

Arguments for FaultTolerancePlugin to pass to run.Script.

enable_ft_package: bool#

None

calc_ft_timeouts: bool#

None

bridge.recipes.run_plugins._default_fault_tolerance_converter(
args: bridge.recipes.run_plugins.FaultTolerancePluginScriptArgs,
) List[str]#

Default converter for FaultTolerancePlugin that generates hydra-style overrides.

class bridge.recipes.run_plugins.FaultTolerancePlugin#

Bases: nemo_run.Plugin

A plugin for setting up fault tolerance configuration. This plugin enables workload hang detection, automatic calculation of timeouts used for hang detection, detection of rank(s) terminated due to an error and workload respawning in case of a failure.

Parameters:
  • enable_ft_package (bool) – Enable the fault tolerance package. Default is True.

  • calc_ft_timeouts (bool) – Automatically compute timeouts. Default is True.

  • num_in_job_restarts (int) – Max number of restarts on failure, within the same job. Default is 3.

  • num_job_retries_on_failure (int) – Max number of new job restarts on failure. Default is 2.

  • initial_rank_heartbeat_timeout (int) – Timeouts are time intervals used by a rank monitor to detect that a rank is not alive. This is the max timeout for the initial heartbeat. Default is 1800.

  • rank_heartbeat_timeout (int) – This is the timeout for subsequent hearbeats after the initial heartbeat. Default is 300.

  • script_args_converter_fn (Optional[Callable]) – A function that takes FaultTolerancePluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.

enable_ft_package: bool#

True

calc_ft_timeouts: bool#

True

num_in_job_restarts: int#

3

num_job_retries_on_failure: int#

2

initial_rank_heartbeat_timeout: int#

1800

rank_heartbeat_timeout: int#

300

script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.FaultTolerancePluginScriptArgs], List[str]]]#

None

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.NsysPluginScriptArgs#

Arguments for NsysPlugin to pass to run.Script.

profile_step_start: int#

None

profile_step_end: int#

None

profile_ranks: List[int]#

None

record_shapes: bool#

None

bridge.recipes.run_plugins._default_nsys_converter(
args: bridge.recipes.run_plugins.NsysPluginScriptArgs,
) List[str]#

Default converter for NsysPlugin that generates hydra-style overrides.

class bridge.recipes.run_plugins.NsysPlugin#

Bases: nemo_run.Plugin

A plugin for nsys profiling configuration.

The NsysPlugin allows you to profile your run using nsys. You can specify when to start and end the profiling, on which ranks to run the profiling, and what to trace during profiling.

Parameters:
  • profile_step_start (int) – The step at which to start the nsys profiling.

  • profile_step_end (int) – The step at which to end the nsys profiling.

  • profile_ranks (Optional[list[int]]) – The ranks on which to run the nsys profiling. If not specified, profiling will be run on rank 0.

  • nsys_trace (Optional[list[str]]) – The events to trace during profiling. If not specified, ‘nvtx’ and ‘cuda’ events will be traced.

  • record_shapes (bool) – Whether to record tensor shapes. Default is False.

  • nsys_gpu_metrics (bool) – Whether to enable GPU metrics collection. Default is False.

  • script_args_converter_fn (Optional[Callable]) – A function that takes NsysPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.

profile_step_start: int#

None

profile_step_end: int#

None

profile_ranks: Optional[list[int]]#

None

nsys_trace: Optional[list[str]]#

None

record_shapes: bool#

False

nsys_gpu_metrics: bool#

False

script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.NsysPluginScriptArgs], List[str]]]#

None

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.PyTorchProfilerPluginScriptArgs#

Arguments for PyTorchProfilerPlugin to pass to run.Script.

profile_step_start: int#

None

profile_step_end: int#

None

profile_ranks: List[int]#

None

record_memory_history: bool#

None

memory_snapshot_path: str#

None

record_shapes: bool#

None

bridge.recipes.run_plugins._default_pytorch_profiler_converter(
args: bridge.recipes.run_plugins.PyTorchProfilerPluginScriptArgs,
) List[str]#

Default converter for PyTorchProfilerPlugin that generates hydra-style overrides.

class bridge.recipes.run_plugins.PyTorchProfilerPlugin#

Bases: nemo_run.Plugin

A plugin for PyTorch profiler configuration.

The PyTorchProfilerPlugin allows you to use the built-in PyTorch profiler which can be viewed in TensorBoard.

Parameters:
  • profile_step_start (int) – The step at which to start profiling.

  • profile_step_end (int) – The step at which to end profiling.

  • profile_ranks (Optional[list[int]]) – The ranks on which to run the profiling. If not specified, profiling will be run on rank 0.

  • record_memory_history (bool) – Whether to record memory history. Default is False.

  • memory_snapshot_path (str) – Path to save memory snapshots. Default is “snapshot.pickle”.

  • record_shapes (bool) – Whether to record tensor shapes. Default is False.

  • script_args_converter_fn (Optional[Callable]) – A function that takes PyTorchProfilerPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.

profile_step_start: int#

None

profile_step_end: int#

None

profile_ranks: Optional[list[int]]#

None

record_memory_history: bool#

False

memory_snapshot_path: str#

‘snapshot.pickle’

record_shapes: bool#

False

script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.PyTorchProfilerPluginScriptArgs], List[str]]]#

None

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.WandbPluginScriptArgs#

Arguments for WandbPlugin to pass to run.Script.

project: str#

None

entity: Optional[str]#

None

name: Optional[str]#

None

save_dir: str#

None

bridge.recipes.run_plugins._default_wandb_converter(
args: bridge.recipes.run_plugins.WandbPluginScriptArgs,
) List[str]#

Default converter for WandbPlugin that generates hydra-style overrides.

class bridge.recipes.run_plugins.WandbPlugin#

Bases: nemo_run.Plugin

A plugin for setting up Weights & Biases configuration.

This plugin sets up Weights & Biases logging configuration. The plugin is only activated if the WANDB_API_KEY environment variable is set. The WANDB_API_KEY environment variables will also be set in the executor’s environment variables. Follow https://docs.wandb.ai/quickstart to retrieve your WANDB_API_KEY.

Parameters:
  • project (str) – The Weights & Biases project name.

  • name (Optional[str]) – The name for the Weights & Biases run. If not provided, uses experiment name.

  • entity (Optional[str]) – The Weights & Biases entity name.

  • save_dir (str) – Directory to save wandb logs. Default is “/nemo_run/wandb”.

  • log_task_config (bool, optional) – Whether to log the task configuration to wandb. Defaults to True.

  • script_args_converter_fn (Optional[Callable]) – A function that takes WandbPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.

project: str#

None

name: Optional[str]#

None

entity: Optional[str]#

None

save_dir: str#

‘/nemo_run/wandb’

log_task_config: bool#

True

script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.WandbPluginScriptArgs], List[str]]]#

None

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.PerfEnvPluginScriptArgs#

Arguments for PerfEnvPlugin to pass to run.Script.

enable_manual_gc: bool#

None

manual_gc_interval: int#

None

bridge.recipes.run_plugins._default_perf_env_converter(
args: bridge.recipes.run_plugins.PerfEnvPluginScriptArgs,
) List[str]#

Default converter for PerfEnvPlugin that generates hydra-style overrides.

class bridge.recipes.run_plugins.PerfEnvPlugin#

Bases: nemo_run.Plugin

A plugin for setting up performance optimized environments.

.. attribute:: enable_layernorm_sm_margin

Set SM margin for TransformerEngine’s Layernorm, so in order to not block DP level communication overlap.

Type:

bool

.. attribute:: layernorm_sm_margin

The SM margin for TransformerEngine Layernorm.

Type:

int

.. attribute:: enable_vboost

Whether to steer more power towards tensor cores via sudo nvidia-smi boost-slider --vboost 1. May not work on all systems.

Type:

bool

.. attribute:: nccl_pp_comm_chunksize

Chunk size for P2P communications.

Type:

Optional[int]

.. attribute:: gpu_sm100_or_newer

Whether GPU is SM100 or newer architecture.

Type:

bool

.. attribute:: enable_manual_gc

Enable manual garbage collection for better performance.

Type:

bool

.. attribute:: manual_gc_interval

Interval for manual garbage collection. Default is 100.

Type:

int

.. attribute:: tp_size

Tensor parallelism size. Default is 1.

Type:

int

.. attribute:: cp_size

Context parallelism size. Default is 1.

Type:

int

.. attribute:: pp_size

Pipeline parallelism size. Default is 1.

Type:

int

.. attribute:: script_args_converter_fn

A function that takes PerfEnvPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.

Type:

Optional[Callable]

enable_layernorm_sm_margin: bool#

True

layernorm_sm_margin: int#

16

enable_vboost: bool#

False

nccl_pp_comm_chunksize: Optional[int]#

None

gpu_sm100_or_newer: bool#

False

enable_manual_gc: bool#

True

manual_gc_interval: int#

100

tp_size: int#

1

cp_size: int#

1

pp_size: int#

1

script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.PerfEnvPluginScriptArgs], List[str]]]#

None

get_vboost_srun_cmd(nodes, job_dir)#

Create the vboost sudo nvidia-smi boost-slider --vboost 1 command

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#

Enable the performance environment settings