bridge.recipes.run_plugins#

Module Contents#

Classes#

PreemptionPlugin

A plugin for setting up preemption handling and signals.

FaultTolerancePlugin

A plugin for setting up fault tolerance configuration. This plugin enables workload hang detection, automatic calculation of timeouts used for hang detection, detection of rank(s) terminated due to an error and workload respawning in case of a failure.

NsysPlugin

A plugin for nsys profiling configuration.

PyTorchProfilerPlugin

A plugin for PyTorch profiler configuration.

WandbPlugin

A plugin for setting up Weights & Biases configuration.

PerfEnvPlugin

A plugin for setting up performance optimized environments.

Functions#

_format_list_for_override

Render a Python list into a Hydra/CLI-safe list string without spaces.

Data#

API#

bridge.recipes.run_plugins.logger: logging.Logger#

‘getLogger(…)’

bridge.recipes.run_plugins._format_list_for_override(values: List | int)#

Render a Python list into a Hydra/CLI-safe list string without spaces.

Example: [0, 3] -> “[0,3]”

class bridge.recipes.run_plugins.PreemptionPlugin#

Bases: nemo_run.Plugin

A plugin for setting up preemption handling and signals.

Parameters:
  • preempt_time (int) – The time, in seconds, before the task’s time limit at which the executor will send a SIGTERM preemption signal. This allows tasks to be gracefully stopped before reaching their time limit, reducing waste and promoting fair resource usage. The default value is 60 seconds (1 minute). This is only supported for run.SlurmExecutor.

  • enable_exit_handler (bool) – Whether to enable the exit signal handler in training config.

preempt_time: int#

60

enable_exit_handler: bool#

True

enable_exit_handler_for_data_loader: bool#

False

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.FaultTolerancePlugin#

Bases: nemo_run.Plugin

A plugin for setting up fault tolerance configuration. This plugin enables workload hang detection, automatic calculation of timeouts used for hang detection, detection of rank(s) terminated due to an error and workload respawning in case of a failure.

Parameters:
  • enable_ft_package (bool) – Enable the fault tolerance package. Default is True.

  • calc_ft_timeouts (bool) – Automatically compute timeouts. Default is True.

  • num_in_job_restarts (int) – Max number of restarts on failure, within the same job. Default is 3.

  • num_job_retries_on_failure (int) – Max number of new job restarts on failure. Default is 2.

  • initial_rank_heartbeat_timeout (int) – Timeouts are time intervals used by a rank monitor to detect that a rank is not alive. This is the max timeout for the initial heartbeat. Default is 1800.

  • rank_heartbeat_timeout (int) – This is the timeout for subsequent hearbeats after the initial heartbeat. Default is 300.

enable_ft_package: bool#

True

calc_ft_timeouts: bool#

True

num_in_job_restarts: int#

3

num_job_retries_on_failure: int#

2

initial_rank_heartbeat_timeout: int#

1800

rank_heartbeat_timeout: int#

300

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.NsysPlugin#

Bases: nemo_run.Plugin

A plugin for nsys profiling configuration.

The NsysPlugin allows you to profile your run using nsys. You can specify when to start and end the profiling, on which ranks to run the profiling, and what to trace during profiling.

Parameters:
  • profile_step_start (int) – The step at which to start the nsys profiling.

  • profile_step_end (int) – The step at which to end the nsys profiling.

  • profile_ranks (Optional[list[int]]) – The ranks on which to run the nsys profiling. If not specified, profiling will be run on rank 0.

  • nsys_trace (Optional[list[str]]) – The events to trace during profiling. If not specified, ‘nvtx’ and ‘cuda’ events will be traced.

  • record_shapes (bool) – Whether to record tensor shapes. Default is False.

profile_step_start: int#

None

profile_step_end: int#

None

profile_ranks: Optional[list[int]]#

None

nsys_trace: Optional[list[str]]#

None

record_shapes: bool#

False

nsys_gpu_metrics: bool#

False

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.PyTorchProfilerPlugin#

Bases: nemo_run.Plugin

A plugin for PyTorch profiler configuration.

The PyTorchProfilerPlugin allows you to use the built-in PyTorch profiler which can be viewed in TensorBoard.

Parameters:
  • profile_step_start (int) – The step at which to start profiling.

  • profile_step_end (int) – The step at which to end profiling.

  • profile_ranks (Optional[list[int]]) – The ranks on which to run the profiling. If not specified, profiling will be run on rank 0.

  • record_memory_history (bool) – Whether to record memory history. Default is False.

  • memory_snapshot_path (str) – Path to save memory snapshots. Default is “snapshot.pickle”.

  • record_shapes (bool) – Whether to record tensor shapes. Default is False.

profile_step_start: int#

None

profile_step_end: int#

None

profile_ranks: Optional[list[int]]#

None

record_memory_history: bool#

False

memory_snapshot_path: str#

‘snapshot.pickle’

record_shapes: bool#

False

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.WandbPlugin#

Bases: nemo_run.Plugin

A plugin for setting up Weights & Biases configuration.

This plugin sets up Weights & Biases logging configuration. The plugin is only activated if the WANDB_API_KEY environment variable is set. The WANDB_API_KEY environment variables will also be set in the executor’s environment variables. Follow https://docs.wandb.ai/quickstart to retrieve your WANDB_API_KEY.

Parameters:
  • project (str) – The Weights & Biases project name.

  • name (Optional[str]) – The name for the Weights & Biases run. If not provided, uses experiment name.

  • entity (Optional[str]) – The Weights & Biases entity name.

  • save_dir (str) – Directory to save wandb logs. Default is “/nemo_run/wandb”.

  • log_task_config (bool, optional) – Whether to log the task configuration to wandb. Defaults to True.

project: str#

None

name: Optional[str]#

None

entity: Optional[str]#

None

save_dir: str#

‘/nemo_run/wandb’

log_task_config: bool#

True

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#
class bridge.recipes.run_plugins.PerfEnvPlugin#

Bases: nemo_run.Plugin

A plugin for setting up performance optimized environments.

.. attribute:: enable_layernorm_sm_margin

Set SM margin for TransformerEngine’s Layernorm, so in order to not block DP level communication overlap.

Type:

bool

.. attribute:: layernorm_sm_margin

The SM margin for TransformerEngine Layernorm.

Type:

int

.. attribute:: enable_vboost

Whether to steer more power towards tensor cores via sudo nvidia-smi boost-slider --vboost 1. May not work on all systems.

Type:

bool

.. attribute:: nccl_pp_comm_chunksize

Chunk size for P2P communications.

Type:

Optional[int]

.. attribute:: gpu_sm100_or_newer

Whether GPU is SM100 or newer architecture.

Type:

bool

.. attribute:: enable_manual_gc

Enable manual garbage collection for better performance.

Type:

bool

.. attribute:: manual_gc_interval

Interval for manual garbage collection. Default is 100.

Type:

int

enable_layernorm_sm_margin: bool#

True

layernorm_sm_margin: int#

16

enable_vboost: bool#

False

nccl_pp_comm_chunksize: Optional[int]#

None

gpu_sm100_or_newer: bool#

False

enable_manual_gc: bool#

True

manual_gc_interval: int#

100

tp_size: int#

1

cp_size: int#

1

pp_size: int#

1

get_vboost_srun_cmd(nodes, job_dir)#

Create the vboost sudo nvidia-smi boost-slider --vboost 1 command

setup(
task: Union[nemo_run.Partial, nemo_run.Script],
executor: nemo_run.Executor,
)#

Enable the performance environment settings