bridge.recipes.run_plugins
#
This file contains plugins based on NeMo-Run’s run.Plugin API. Plugins operate both on a configured task and an executor at the same time, and are specific to NeMo-Run. These plugins work by modifying the ConfigContainer configuration overrides.
For run.Script tasks, each plugin supports custom argument conversion via the script_args_converter_fn
parameter. This allows users to specify their own conversion function if their training scripts don’t
use hydra-style overrides.
Example usage with custom converter:
from megatron.bridge.recipes.run_plugins import (
PreemptionPlugin,
PreemptionPluginScriptArgs,
)
# Define a custom converter for argparse-style arguments
def argparse_preemption_converter(args: PreemptionPluginScriptArgs) -> List[str]:
result = []
if args.enable_exit_handler:
result.append("--enable-exit-handler")
if args.enable_exit_handler_for_data_loader:
result.append("--enable-exit-handler-dataloader")
return result
# Use the plugin with the custom converter
plugin = PreemptionPlugin(
preempt_time=120,
enable_exit_handler=True,
script_args_converter_fn=argparse_preemption_converter,
)
If no converter is provided, the plugin will use the default hydra-style converter.
Module Contents#
Classes#
Arguments for PreemptionPlugin to pass to run.Script. |
|
A plugin for setting up preemption handling and signals. |
|
Arguments for FaultTolerancePlugin to pass to run.Script. |
|
A plugin for setting up fault tolerance configuration. This plugin enables workload hang detection, automatic calculation of timeouts used for hang detection, detection of rank(s) terminated due to an error and workload respawning in case of a failure. |
|
Arguments for NsysPlugin to pass to run.Script. |
|
A plugin for nsys profiling configuration. |
|
Arguments for PyTorchProfilerPlugin to pass to run.Script. |
|
A plugin for PyTorch profiler configuration. |
|
Arguments for WandbPlugin to pass to run.Script. |
|
A plugin for setting up Weights & Biases configuration. |
|
Arguments for PerfEnvPlugin to pass to run.Script. |
|
A plugin for setting up performance optimized environments. |
Functions#
Render a Python list into a Hydra/CLI-safe list string without spaces. |
|
Default converter for PreemptionPlugin that generates hydra-style overrides. |
|
Default converter for FaultTolerancePlugin that generates hydra-style overrides. |
|
Default converter for NsysPlugin that generates hydra-style overrides. |
|
Default converter for PyTorchProfilerPlugin that generates hydra-style overrides. |
|
Default converter for WandbPlugin that generates hydra-style overrides. |
|
Default converter for PerfEnvPlugin that generates hydra-style overrides. |
Data#
API#
- bridge.recipes.run_plugins.logger: logging.Logger#
‘getLogger(…)’
- bridge.recipes.run_plugins._format_list_for_override(values: List | int)#
Render a Python list into a Hydra/CLI-safe list string without spaces.
Example: [0, 3] -> “[0,3]”
- class bridge.recipes.run_plugins.PreemptionPluginScriptArgs#
Arguments for PreemptionPlugin to pass to run.Script.
- enable_exit_handler: bool#
None
- enable_exit_handler_for_data_loader: bool#
None
- bridge.recipes.run_plugins._default_preemption_converter( ) List[str] #
Default converter for PreemptionPlugin that generates hydra-style overrides.
- class bridge.recipes.run_plugins.PreemptionPlugin#
Bases:
nemo_run.Plugin
A plugin for setting up preemption handling and signals.
- Parameters:
preempt_time (int) – The time, in seconds, before the task’s time limit at which the executor will send a SIGTERM preemption signal. This allows tasks to be gracefully stopped before reaching their time limit, reducing waste and promoting fair resource usage. The default value is 60 seconds (1 minute). This is only supported for
run.SlurmExecutor
.enable_exit_handler (bool) – Whether to enable the exit signal handler in training config.
enable_exit_handler_for_data_loader (bool) – Whether to enable the exit signal handler for data loader.
script_args_converter_fn (Optional[Callable]) – A function that takes PreemptionPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.
- preempt_time: int#
60
- enable_exit_handler: bool#
True
- enable_exit_handler_for_data_loader: bool#
False
- script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.PreemptionPluginScriptArgs], List[str]]]#
None
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.FaultTolerancePluginScriptArgs#
Arguments for FaultTolerancePlugin to pass to run.Script.
- enable_ft_package: bool#
None
- calc_ft_timeouts: bool#
None
- bridge.recipes.run_plugins._default_fault_tolerance_converter( ) List[str] #
Default converter for FaultTolerancePlugin that generates hydra-style overrides.
- class bridge.recipes.run_plugins.FaultTolerancePlugin#
Bases:
nemo_run.Plugin
A plugin for setting up fault tolerance configuration. This plugin enables workload hang detection, automatic calculation of timeouts used for hang detection, detection of rank(s) terminated due to an error and workload respawning in case of a failure.
- Parameters:
enable_ft_package (bool) – Enable the fault tolerance package. Default is True.
calc_ft_timeouts (bool) – Automatically compute timeouts. Default is True.
num_in_job_restarts (int) – Max number of restarts on failure, within the same job. Default is 3.
num_job_retries_on_failure (int) – Max number of new job restarts on failure. Default is 2.
initial_rank_heartbeat_timeout (int) – Timeouts are time intervals used by a rank monitor to detect that a rank is not alive. This is the max timeout for the initial heartbeat. Default is 1800.
rank_heartbeat_timeout (int) – This is the timeout for subsequent hearbeats after the initial heartbeat. Default is 300.
script_args_converter_fn (Optional[Callable]) – A function that takes FaultTolerancePluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.
- enable_ft_package: bool#
True
- calc_ft_timeouts: bool#
True
- num_in_job_restarts: int#
3
- num_job_retries_on_failure: int#
2
- initial_rank_heartbeat_timeout: int#
1800
- rank_heartbeat_timeout: int#
300
- script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.FaultTolerancePluginScriptArgs], List[str]]]#
None
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.NsysPluginScriptArgs#
Arguments for NsysPlugin to pass to run.Script.
- profile_step_start: int#
None
- profile_step_end: int#
None
- profile_ranks: List[int]#
None
- record_shapes: bool#
None
- bridge.recipes.run_plugins._default_nsys_converter( ) List[str] #
Default converter for NsysPlugin that generates hydra-style overrides.
- class bridge.recipes.run_plugins.NsysPlugin#
Bases:
nemo_run.Plugin
A plugin for nsys profiling configuration.
The NsysPlugin allows you to profile your run using nsys. You can specify when to start and end the profiling, on which ranks to run the profiling, and what to trace during profiling.
- Parameters:
profile_step_start (int) – The step at which to start the nsys profiling.
profile_step_end (int) – The step at which to end the nsys profiling.
profile_ranks (Optional[list[int]]) – The ranks on which to run the nsys profiling. If not specified, profiling will be run on rank 0.
nsys_trace (Optional[list[str]]) – The events to trace during profiling. If not specified, ‘nvtx’ and ‘cuda’ events will be traced.
record_shapes (bool) – Whether to record tensor shapes. Default is False.
nsys_gpu_metrics (bool) – Whether to enable GPU metrics collection. Default is False.
script_args_converter_fn (Optional[Callable]) – A function that takes NsysPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.
- profile_step_start: int#
None
- profile_step_end: int#
None
- profile_ranks: Optional[list[int]]#
None
- nsys_trace: Optional[list[str]]#
None
- record_shapes: bool#
False
- nsys_gpu_metrics: bool#
False
- script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.NsysPluginScriptArgs], List[str]]]#
None
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.PyTorchProfilerPluginScriptArgs#
Arguments for PyTorchProfilerPlugin to pass to run.Script.
- profile_step_start: int#
None
- profile_step_end: int#
None
- profile_ranks: List[int]#
None
- record_memory_history: bool#
None
- memory_snapshot_path: str#
None
- record_shapes: bool#
None
- bridge.recipes.run_plugins._default_pytorch_profiler_converter( ) List[str] #
Default converter for PyTorchProfilerPlugin that generates hydra-style overrides.
- class bridge.recipes.run_plugins.PyTorchProfilerPlugin#
Bases:
nemo_run.Plugin
A plugin for PyTorch profiler configuration.
The PyTorchProfilerPlugin allows you to use the built-in PyTorch profiler which can be viewed in TensorBoard.
- Parameters:
profile_step_start (int) – The step at which to start profiling.
profile_step_end (int) – The step at which to end profiling.
profile_ranks (Optional[list[int]]) – The ranks on which to run the profiling. If not specified, profiling will be run on rank 0.
record_memory_history (bool) – Whether to record memory history. Default is False.
memory_snapshot_path (str) – Path to save memory snapshots. Default is “snapshot.pickle”.
record_shapes (bool) – Whether to record tensor shapes. Default is False.
script_args_converter_fn (Optional[Callable]) – A function that takes PyTorchProfilerPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.
- profile_step_start: int#
None
- profile_step_end: int#
None
- profile_ranks: Optional[list[int]]#
None
- record_memory_history: bool#
False
- memory_snapshot_path: str#
‘snapshot.pickle’
- record_shapes: bool#
False
- script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.PyTorchProfilerPluginScriptArgs], List[str]]]#
None
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.WandbPluginScriptArgs#
Arguments for WandbPlugin to pass to run.Script.
- project: str#
None
- entity: Optional[str]#
None
- name: Optional[str]#
None
- save_dir: str#
None
- bridge.recipes.run_plugins._default_wandb_converter( ) List[str] #
Default converter for WandbPlugin that generates hydra-style overrides.
- class bridge.recipes.run_plugins.WandbPlugin#
Bases:
nemo_run.Plugin
A plugin for setting up Weights & Biases configuration.
This plugin sets up Weights & Biases logging configuration. The plugin is only activated if the
WANDB_API_KEY
environment variable is set. TheWANDB_API_KEY
environment variables will also be set in the executor’s environment variables. Follow https://docs.wandb.ai/quickstart to retrieve yourWANDB_API_KEY
.- Parameters:
project (str) – The Weights & Biases project name.
name (Optional[str]) – The name for the Weights & Biases run. If not provided, uses experiment name.
entity (Optional[str]) – The Weights & Biases entity name.
save_dir (str) – Directory to save wandb logs. Default is “/nemo_run/wandb”.
log_task_config (bool, optional) – Whether to log the task configuration to wandb. Defaults to True.
script_args_converter_fn (Optional[Callable]) – A function that takes WandbPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.
- project: str#
None
- name: Optional[str]#
None
- entity: Optional[str]#
None
- save_dir: str#
‘/nemo_run/wandb’
- log_task_config: bool#
True
- script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.WandbPluginScriptArgs], List[str]]]#
None
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.PerfEnvPluginScriptArgs#
Arguments for PerfEnvPlugin to pass to run.Script.
- enable_manual_gc: bool#
None
- manual_gc_interval: int#
None
- bridge.recipes.run_plugins._default_perf_env_converter( ) List[str] #
Default converter for PerfEnvPlugin that generates hydra-style overrides.
- class bridge.recipes.run_plugins.PerfEnvPlugin#
Bases:
nemo_run.Plugin
A plugin for setting up performance optimized environments.
.. attribute:: enable_layernorm_sm_margin
Set SM margin for TransformerEngine’s Layernorm, so in order to not block DP level communication overlap.
- Type:
bool
.. attribute:: layernorm_sm_margin
The SM margin for TransformerEngine Layernorm.
- Type:
int
.. attribute:: enable_vboost
Whether to steer more power towards tensor cores via
sudo nvidia-smi boost-slider --vboost 1
. May not work on all systems.- Type:
bool
.. attribute:: nccl_pp_comm_chunksize
Chunk size for P2P communications.
- Type:
Optional[int]
.. attribute:: gpu_sm100_or_newer
Whether GPU is SM100 or newer architecture.
- Type:
bool
.. attribute:: enable_manual_gc
Enable manual garbage collection for better performance.
- Type:
bool
.. attribute:: manual_gc_interval
Interval for manual garbage collection. Default is 100.
- Type:
int
.. attribute:: tp_size
Tensor parallelism size. Default is 1.
- Type:
int
.. attribute:: cp_size
Context parallelism size. Default is 1.
- Type:
int
.. attribute:: pp_size
Pipeline parallelism size. Default is 1.
- Type:
int
.. attribute:: script_args_converter_fn
A function that takes PerfEnvPluginScriptArgs and returns a list of CLI arguments. If not provided, uses the default hydra-style converter.
- Type:
Optional[Callable]
- enable_layernorm_sm_margin: bool#
True
- layernorm_sm_margin: int#
16
- enable_vboost: bool#
False
- nccl_pp_comm_chunksize: Optional[int]#
None
- gpu_sm100_or_newer: bool#
False
- enable_manual_gc: bool#
True
- manual_gc_interval: int#
100
- tp_size: int#
1
- cp_size: int#
1
- pp_size: int#
1
- script_args_converter_fn: Optional[Callable[[bridge.recipes.run_plugins.PerfEnvPluginScriptArgs], List[str]]]#
None
- get_vboost_srun_cmd(nodes, job_dir)#
Create the vboost
sudo nvidia-smi boost-slider --vboost 1
command
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
Enable the performance environment settings