bridge.training.profiling
#
Profiling utilities for training loop.
Module Contents#
Functions#
Check if current rank should be profiled. |
|
Handle profiling logic for a single training step. |
|
Handle profiling cleanup at designated stop iteration. |
|
Initialize PyTorch profiler with config settings. |
|
Start CUDA profiler for nsys profiling. |
|
Stop CUDA profiler for nsys profiling. |
Data#
API#
- bridge.training.profiling.TNvtxContext#
None
- bridge.training.profiling.should_profile_rank(
- config: Optional[megatron.bridge.training.config.ProfilingConfig],
- rank: int,
Check if current rank should be profiled.
- Parameters:
config – Profiling configuration
rank – Current process rank
- Returns:
True if this rank should be profiled
- bridge.training.profiling.handle_profiling_step(
- config: Optional[megatron.bridge.training.config.ProfilingConfig],
- iteration: int,
- rank: int,
- pytorch_prof: Optional[torch.profiler.profile],
Handle profiling logic for a single training step.
- Parameters:
config – Profiling configuration
iteration – Current training iteration
rank – Current process rank
pytorch_prof – PyTorch profiler instance (if using PyTorch profiler)
- Returns:
NVTX context if nsys profiling was started at this step, None otherwise
- bridge.training.profiling.handle_profiling_stop(
- config: Optional[megatron.bridge.training.config.ProfilingConfig],
- iteration: int,
- rank: int,
- pytorch_prof: Optional[torch.profiler.profile],
- nsys_nvtx_context: Optional[bridge.training.profiling.TNvtxContext] = None,
Handle profiling cleanup at designated stop iteration.
- Parameters:
config – Profiling configuration
iteration – Current training iteration
rank – Current process rank
pytorch_prof – PyTorch profiler instance (if using PyTorch profiler)
nsys_nvtx_context – NVTX context from handle_profiling_step (if using nsys profiler)
- bridge.training.profiling.initialize_pytorch_profiler(
- config: megatron.bridge.training.config.ProfilingConfig,
- tensorboard_dir: str,
Initialize PyTorch profiler with config settings.
- Parameters:
config – Profiling configuration
tensorboard_dir – Directory for tensorboard outputs
- Returns:
Initialized (but not started) PyTorch profiler
- bridge.training.profiling.start_nsys_profiler(
- config: megatron.bridge.training.config.ProfilingConfig,
Start CUDA profiler for nsys profiling.
- Parameters:
config – Profiling configuration
- Returns:
NVTX context manager that must be passed to stop_nsys_profiler
- bridge.training.profiling.stop_nsys_profiler(
- nvtx_context: Optional[bridge.training.profiling.TNvtxContext],
Stop CUDA profiler for nsys profiling.
- Parameters:
nvtx_context – NVTX context manager returned from start_nsys_profiler