bridge.recipes.run_plugins#
Module Contents#
Classes#
A plugin for setting up preemption handling and signals. |
|
A plugin for setting up fault tolerance configuration. This plugin enables workload hang detection, automatic calculation of timeouts used for hang detection, detection of rank(s) terminated due to an error and workload respawning in case of a failure. |
|
A plugin for nsys profiling configuration. |
|
A plugin for PyTorch profiler configuration. |
|
A plugin for setting up Weights & Biases configuration. |
|
A plugin for setting up performance optimized environments. |
Functions#
Render a Python list into a Hydra/CLI-safe list string without spaces. |
Data#
API#
- bridge.recipes.run_plugins.logger: logging.Logger#
‘getLogger(…)’
- bridge.recipes.run_plugins._format_list_for_override(values: List | int)#
Render a Python list into a Hydra/CLI-safe list string without spaces.
Example: [0, 3] -> “[0,3]”
- class bridge.recipes.run_plugins.PreemptionPlugin#
Bases:
nemo_run.PluginA plugin for setting up preemption handling and signals.
- Parameters:
preempt_time (int) – The time, in seconds, before the task’s time limit at which the executor will send a SIGTERM preemption signal. This allows tasks to be gracefully stopped before reaching their time limit, reducing waste and promoting fair resource usage. The default value is 60 seconds (1 minute). This is only supported for
run.SlurmExecutor.enable_exit_handler (bool) – Whether to enable the exit signal handler in training config.
- preempt_time: int#
60
- enable_exit_handler: bool#
True
- enable_exit_handler_for_data_loader: bool#
False
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.FaultTolerancePlugin#
Bases:
nemo_run.PluginA plugin for setting up fault tolerance configuration. This plugin enables workload hang detection, automatic calculation of timeouts used for hang detection, detection of rank(s) terminated due to an error and workload respawning in case of a failure.
- Parameters:
enable_ft_package (bool) – Enable the fault tolerance package. Default is True.
calc_ft_timeouts (bool) – Automatically compute timeouts. Default is True.
num_in_job_restarts (int) – Max number of restarts on failure, within the same job. Default is 3.
num_job_retries_on_failure (int) – Max number of new job restarts on failure. Default is 2.
initial_rank_heartbeat_timeout (int) – Timeouts are time intervals used by a rank monitor to detect that a rank is not alive. This is the max timeout for the initial heartbeat. Default is 1800.
rank_heartbeat_timeout (int) – This is the timeout for subsequent hearbeats after the initial heartbeat. Default is 300.
- enable_ft_package: bool#
True
- calc_ft_timeouts: bool#
True
- num_in_job_restarts: int#
3
- num_job_retries_on_failure: int#
2
- initial_rank_heartbeat_timeout: int#
1800
- rank_heartbeat_timeout: int#
300
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.NsysPlugin#
Bases:
nemo_run.PluginA plugin for nsys profiling configuration.
The NsysPlugin allows you to profile your run using nsys. You can specify when to start and end the profiling, on which ranks to run the profiling, and what to trace during profiling.
- Parameters:
profile_step_start (int) – The step at which to start the nsys profiling.
profile_step_end (int) – The step at which to end the nsys profiling.
profile_ranks (Optional[list[int]]) – The ranks on which to run the nsys profiling. If not specified, profiling will be run on rank 0.
nsys_trace (Optional[list[str]]) – The events to trace during profiling. If not specified, ‘nvtx’ and ‘cuda’ events will be traced.
record_shapes (bool) – Whether to record tensor shapes. Default is False.
- profile_step_start: int#
None
- profile_step_end: int#
None
- profile_ranks: Optional[list[int]]#
None
- nsys_trace: Optional[list[str]]#
None
- record_shapes: bool#
False
- nsys_gpu_metrics: bool#
False
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.PyTorchProfilerPlugin#
Bases:
nemo_run.PluginA plugin for PyTorch profiler configuration.
The PyTorchProfilerPlugin allows you to use the built-in PyTorch profiler which can be viewed in TensorBoard.
- Parameters:
profile_step_start (int) – The step at which to start profiling.
profile_step_end (int) – The step at which to end profiling.
profile_ranks (Optional[list[int]]) – The ranks on which to run the profiling. If not specified, profiling will be run on rank 0.
record_memory_history (bool) – Whether to record memory history. Default is False.
memory_snapshot_path (str) – Path to save memory snapshots. Default is “snapshot.pickle”.
record_shapes (bool) – Whether to record tensor shapes. Default is False.
- profile_step_start: int#
None
- profile_step_end: int#
None
- profile_ranks: Optional[list[int]]#
None
- record_memory_history: bool#
False
- memory_snapshot_path: str#
‘snapshot.pickle’
- record_shapes: bool#
False
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.WandbPlugin#
Bases:
nemo_run.PluginA plugin for setting up Weights & Biases configuration.
This plugin sets up Weights & Biases logging configuration. The plugin is only activated if the
WANDB_API_KEYenvironment variable is set. TheWANDB_API_KEYenvironment variables will also be set in the executor’s environment variables. Follow https://docs.wandb.ai/quickstart to retrieve yourWANDB_API_KEY.- Parameters:
project (str) – The Weights & Biases project name.
name (Optional[str]) – The name for the Weights & Biases run. If not provided, uses experiment name.
entity (Optional[str]) – The Weights & Biases entity name.
save_dir (str) – Directory to save wandb logs. Default is “/nemo_run/wandb”.
log_task_config (bool, optional) – Whether to log the task configuration to wandb. Defaults to True.
- project: str#
None
- name: Optional[str]#
None
- entity: Optional[str]#
None
- save_dir: str#
‘/nemo_run/wandb’
- log_task_config: bool#
True
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
- class bridge.recipes.run_plugins.PerfEnvPlugin#
Bases:
nemo_run.PluginA plugin for setting up performance optimized environments.
.. attribute:: enable_layernorm_sm_margin
Set SM margin for TransformerEngine’s Layernorm, so in order to not block DP level communication overlap.
- Type:
bool
.. attribute:: layernorm_sm_margin
The SM margin for TransformerEngine Layernorm.
- Type:
int
.. attribute:: enable_vboost
Whether to steer more power towards tensor cores via
sudo nvidia-smi boost-slider --vboost 1. May not work on all systems.- Type:
bool
.. attribute:: nccl_pp_comm_chunksize
Chunk size for P2P communications.
- Type:
Optional[int]
.. attribute:: gpu_sm100_or_newer
Whether GPU is SM100 or newer architecture.
- Type:
bool
.. attribute:: enable_manual_gc
Enable manual garbage collection for better performance.
- Type:
bool
.. attribute:: manual_gc_interval
Interval for manual garbage collection. Default is 100.
- Type:
int
- enable_layernorm_sm_margin: bool#
True
- layernorm_sm_margin: int#
16
- enable_vboost: bool#
False
- nccl_pp_comm_chunksize: Optional[int]#
None
- gpu_sm100_or_newer: bool#
False
- enable_manual_gc: bool#
True
- manual_gc_interval: int#
100
- tp_size: int#
1
- cp_size: int#
1
- pp_size: int#
1
- get_vboost_srun_cmd(nodes, job_dir)#
Create the vboost
sudo nvidia-smi boost-slider --vboost 1command
- setup(
- task: Union[nemo_run.Partial, nemo_run.Script],
- executor: nemo_run.Executor,
Enable the performance environment settings