bridge.training.profiling#

Profiling utilities for training loop.

Module Contents#

Functions#

should_profile_rank

Check if current rank should be profiled.

handle_profiling_step

Handle profiling logic for a single training step.

handle_profiling_stop

Handle profiling cleanup at designated stop iteration.

initialize_pytorch_profiler

Initialize PyTorch profiler with config settings.

start_nsys_profiler

Start CUDA profiler for nsys profiling.

stop_nsys_profiler

Stop CUDA profiler for nsys profiling.

Data#

API#

bridge.training.profiling.TNvtxContext#

None

bridge.training.profiling.should_profile_rank(
config: Optional[megatron.bridge.training.config.ProfilingConfig],
rank: int,
) bool#

Check if current rank should be profiled.

Parameters:
  • config – Profiling configuration

  • rank – Current process rank

Returns:

True if this rank should be profiled

bridge.training.profiling.handle_profiling_step(
config: Optional[megatron.bridge.training.config.ProfilingConfig],
iteration: int,
rank: int,
pytorch_prof: Optional[torch.profiler.profile],
) Optional[bridge.training.profiling.TNvtxContext]#

Handle profiling logic for a single training step.

Parameters:
  • config – Profiling configuration

  • iteration – Current training iteration

  • rank – Current process rank

  • pytorch_prof – PyTorch profiler instance (if using PyTorch profiler)

Returns:

NVTX context if nsys profiling was started at this step, None otherwise

bridge.training.profiling.handle_profiling_stop(
config: Optional[megatron.bridge.training.config.ProfilingConfig],
iteration: int,
rank: int,
pytorch_prof: Optional[torch.profiler.profile],
nsys_nvtx_context: Optional[bridge.training.profiling.TNvtxContext] = None,
) None#

Handle profiling cleanup at designated stop iteration.

Parameters:
  • config – Profiling configuration

  • iteration – Current training iteration

  • rank – Current process rank

  • pytorch_prof – PyTorch profiler instance (if using PyTorch profiler)

  • nsys_nvtx_context – NVTX context from handle_profiling_step (if using nsys profiler)

bridge.training.profiling.initialize_pytorch_profiler(
config: megatron.bridge.training.config.ProfilingConfig,
tensorboard_dir: str,
) torch.profiler.profile#

Initialize PyTorch profiler with config settings.

Parameters:
  • config – Profiling configuration

  • tensorboard_dir – Directory for tensorboard outputs

Returns:

Initialized (but not started) PyTorch profiler

bridge.training.profiling.start_nsys_profiler(
config: megatron.bridge.training.config.ProfilingConfig,
) bridge.training.profiling.TNvtxContext#

Start CUDA profiler for nsys profiling.

Parameters:

config – Profiling configuration

Returns:

NVTX context manager that must be passed to stop_nsys_profiler

bridge.training.profiling.stop_nsys_profiler(
nvtx_context: Optional[bridge.training.profiling.TNvtxContext],
) None#

Stop CUDA profiler for nsys profiling.

Parameters:

nvtx_context – NVTX context manager returned from start_nsys_profiler