Profiling#

NeMo Framework provides built-in support for profiling your training jobs using various performance analysis tools, including NVIDIA Nsight Systems (Nsys) for workflow optimization and PyTorch-based memory profiling for tracking memory usage patterns during training.

Nsys Profiling#

Note

Nsys profiling cannot be used with the FaultTolerancePlugin due to implementation conflicts.

NVIDIA Nsys is a system-wide performance analysis tool designed to help you tune and optimize CUDA applications. NeMo Framework integrates with Nsys to enable profiling specific steps of your training job, making it easy to collect detailed performance data without manual instrumentation.

Key features of the Nsys profiling in NeMo 2.0 include:

  • Profile specific training steps, instead of the entire run.

  • Target specific GPU ranks for profiling.

  • Option to include input shapes for deeper kernel analysis.

  • Simple programmatic configuration via NsysPlugin.

Use Nsys Profiling in NeMo 2.0#

Nsys profiling can be configured with either the NsysPlugin or the NsysCallback, depending on the environment you are running in.

Example 1: Use the NeMo-Run Plugin#

In NeMo 2.0, you can enable Nsys profiling using the NsysPlugin when creating an experiment with nemo_run:

import nemo_run as run
from nemo.lightning.run.plugins import NsysPlugin
from nemo.collections.llm.recipes.llama3_8b import pretrain_recipe

# Create your recipe
recipe = pretrain_recipe(...)

# Create the executor
executor = run.SlurmExecutor(...)  # or other executor types

# Add NsysPlugin to the plugins list
plugins = []
plugins.append(NsysPlugin(
    start_step=10,
    end_step=15,
    ranks=[0, 1],  # Profile first two ranks
    nsys_trace=["nvtx", "cuda"]  # Optional: specify trace events
))

# Create and run experiment
with run.Experiment("llama3_8b_nsys_profiling") as exp:
    exp.add(
        recipe,
        executor=executor,
        plugins=plugins,
    )
    exp.run()

Example 2: Use the Lightning Callback#

If you are training on a local machine or in an interactive session, you should use the NsysCallback instead of the NsysPlugin.

You will need to add the callback to your Trainer and use run.LocalExecutor:

import nemo_run as run
from nemo.lightning.pytorch.callbacks import NsysCallback
from nemo.collections.llm.recipes.llama3_8b import pretrain_recipe

# Create your recipe
recipe = pretrain_recipe(...)

# Create the NsysCallback
nsys_callback = NsysCallback(
    start_step=10,
    end_step=15,
)
# Add the callback to the recipe
recipe.trainer.callbacks.append(nsys_callback)

run.run(recipe, direct=True)

You should then launch the script with:

nsys profile -s none -o <profile filepath> -t cuda,nvtx --force-overwrite true --capture-range=cudaProfilerApi --capture-range-end=stop python <path_to_script>

Analyze Nsys Profiling Results#

Once your profiling run is complete, you’ll have generated Nsys profile files that can be opened with the Nsys GUI. Follow these steps to analyze your results:

  1. Install NVIDIA Nsight Systems from the NVIDIA Developer website.

  2. Open the generated .nsys-rep file in the Nsys GUI.

  3. Use the timeline view to examine the performance of your training job.

CUDA Memory Profiling#

NeMo Framework also provides built-in support for CUDA memory profiling using the MemoryProfileCallback. This callback allows you to track and analyze memory usage patterns during training, including GPU memory allocation and memory consumption.

More information about the generated memory profiles can be found here.

Use CUDA Memory Profiling in NeMo 2.0#

To enable CUDA memory profiling, you can use the MemoryProfileCallback in your training script:

from nemo.lightning.pytorch.callbacks import MemoryProfileCallback
from nemo.collections.llm.recipes.llama3_8b import pretrain_recipe

# Create your recipe
recipe = pretrain_recipe(...)

# Create the MemoryProfileCallback
memory_profile_callback = MemoryProfileCallback(dir="/path/to/save/memory/traces", ranks=[0, 1])

# Add the callback to the recipe
recipe.trainer.callbacks.append(memory_profile_callback)

Analyze CUDA Memory Profiling Results#

Once the run completes, the specified directory will contain memory snapshots for each specified rank. These traces can be loaded with the PyTorch Memory Viz tool to plot memory usage over time.

Note

  • Profiling adds some overhead, so measured timings may be slightly higher than normal operation.

  • For accurate profiling, disable other intensive operations like frequent checkpointing during profiled steps.