Skip to main content
Ctrl+K
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
NVIDIA NeMo Framework User Guide - Home

NVIDIA NeMo Framework User Guide

NVIDIA NeMo Framework User Guide - Home

NVIDIA NeMo Framework User Guide

Table of Contents

NeMo Framework

  • Overview
  • Install NeMo Framework
  • Performance
  • Why NeMo Framework?

Getting Started

  • Quickstart with NeMo-Run
  • Quickstart with NeMo 2.0 API
  • Tutorials

Developer Guides

  • Migration Guide
    • Pre-Training
    • SFT Training and Inference
    • PEFT Training and Inference
    • Trainer Configuration
    • Precision Configuration
    • Parallelisms
    • Experiment Manager
    • Checkpointing Configurations
    • Optimizer Configuration
    • Data Configuration
    • Nsys Profiling
    • Tokenizers
  • Feature Guide
    • The Link Between Lightning and Megatron Core
    • Logging and Checkpointing
    • Serialization
    • Parameter Efficient Fine-Tuning (PEFT)
    • Hugging Face Integration
    • Profiling
  • Best Practices
  • Performance Tuning Guide

Training and Customization

  • Long Context Training
    • Context Parallelism
  • Optimal Configuration with Auto Configurator
  • Parameter-Efficient Fine-tuning (PEFT)
    • Supported PEFT Methods
    • A Comparison of Performant and Canonical LoRA Variants
  • Sequence Packing
  • Resiliency
  • Continual Training
  • Custom Datasets
    • Pre-Training Data Module
    • Fine-Tuning Data Module

Model Optimization

  • Quantization
  • Pruning
  • Distillation
  • Speculative Decoding

Models

  • Large Language Models
    • Baichuan 2
    • ChatGLM 3
    • DeepSeek V2
    • DeepSeek V3
    • Gemma
    • Gemma 2
    • GPT-OSS
    • Hyena
    • Llama 3
    • Llama Nemotron
    • Mamba 2
    • Mixtral
    • Nemotron
    • Phi 3
    • Qwen2/2.5
    • Qwen3
    • Starcoder
    • Starcoder 2
    • T5
    • BERT
  • Vision Language Models
    • NeVA (LLaVA)
    • LLaVA-Next
    • Llama 3.2 Vision Models
    • Llama 4 Models
    • Qwen2-VL
    • Gemma 3 Models
    • Data Preparation to Use Megatron-Energon Dataloader
    • CLIP
    • Llama Nemotron Nano VL 8B
    • Audio-Vision Language Model
  • Speech AI Models
  • Diffusion Models
    • Flux
    • Diffusion Training Framework
  • Embedding Models
    • SBERT
    • Llama Embedding
    • Exporting Llama Embedding To ONNX and TensorRT
  • Reranker Models
    • Llama Reranker

Library Documentation

  • Overview
  • NeMo
    • Introduction
    • NeMo Fundamentals
    • Tutorials
    • Mixed Precision Training
    • Parallelisms
    • Mixture of Experts
    • Optimizations
      • Attention Optimizations
      • Activation Recomputation
      • Communication Overlap
      • CPU Offloading
    • Checkpoints
      • NeMo Distributed Checkpoint User Guide
      • Converting from Megatron-LM
    • Evaluate NeMo 2.0 Checkpoints
    • Evaluation Adapters
    • NeMo APIs
      • NeMo Models
      • Neural Modules
      • Experiment Manager
      • Neural Types
      • Adapters
        • Adapter Components
        • Adapters API
      • NeMo Core APIs
      • NeMo Common Collection API
        • Callbacks
        • Losses
        • Metrics
        • Tokenizers
        • Data
        • S3 Checkpointing
      • NeMo ASR API
      • NeMo TTS API
    • NeMo Collections
      • Large Language Models
        • GPT Model Training
        • Batching
        • Positional embeddings
        • Megatron Core Customization
        • Reset Learning Rate
        • Ramp Up Batch Size
      • Machine Translation Models
      • Automatic Speech Recognition (ASR)
        • Models
        • Datasets
        • ASR Language Modeling and Customization
        • Checkpoints
        • Scores
        • NeMo ASR Configuration Files
        • NeMo ASR API
        • All Checkpoints
        • Example With MCV
      • Speech Classification
        • Models
        • Datasets
        • Checkpoints
        • NeMo Speech Classification Configuration Files
        • Resource and Documentation Guide
      • Speaker Recognition (SR)
        • Models
        • NeMo Speaker Recognition Configuration Files
        • Datasets
        • Checkpoints
        • NeMo Speaker Recognition API
        • Resource and Documentation Guide
      • Speaker Diarization
        • Models
        • Datasets
        • Checkpoints
        • End-to-End Speaker Diarization Configuration Files
        • NeMo Speaker Diarization API
        • Resource and Documentation Guide
      • Speech Self-Supervised Learning
        • Models
        • Datasets
        • Checkpoints
        • NeMo SSL Configuration Files
        • NeMo SSL collection API
        • Resources and Documentation
      • Speech Intent Classification and Slot Filling
        • Models
        • Datasets
        • Checkpoints
        • NeMo Speech Intent Classification and Slot Filling Configuration Files
        • NeMo Speech Intent Classification and Slot Filling collection API
        • Resources and Documentation
      • SpeechLM2
        • Models
        • Datasets
        • Configuration Files
        • Training and Scaling
      • Text-to-Speech (TTS)
        • Models
        • Data Preprocessing
        • Checkpoints
        • NeMo TTS Configuration Files
        • Grapheme-to-Phoneme Models
      • Speech and Audio Processing
        • Models
        • Datasets
        • Checkpoints
        • NeMo Audio Configuration Files
        • NeMo Audio API
    • Speech AI Tools
      • NeMo Forced Aligner (NFA)
      • Dataset Creation Tool Based on CTC-Segmentation
      • Speech Data Explorer
      • Comparison tool for ASR Models
      • ASR Evaluator
      • Speech Data Processor
      • (Inverse) Text Normalization
        • WFST-based (Inverse) Text Normalization
        • Neural Models for (Inverse) Text Normalization
  • NeMo AutoModel
  • NeMo Megatron Bridge
  • NeMo Curator
  • NeMo Eval
  • NeMo Export and Deploy
  • NeMo RL
  • NeMo Run
    • Guides
      • Why should I use NemoRun?
      • Configure NeMo-Run
      • Execute NeMo Run
      • Manage NeMo-Run
      • Ray Clusters & Jobs
      • NeMo Run CLI Guide
    • Frequently Asked Questions

Releases

  • Software Component Versions
  • Changelog
  • Known Issues
  • Feature Guide
  • The Link...

The Link Between Lightning and Megatron Core#

In PyTorch Lightning, a Strategy is responsible for managing the distributed execution of a model during training, validation, and testing. Strategies typically wrap the user-defined model with a class that can handle distributed execution. For instance, the standard DDPStrategy (Distributed Data Parallel Strategy) wraps the model with PyTorch’s DistributedDataParallel class. This wrapper handles the distribution of data across multiple GPUs or nodes, synchronizes gradients during the backward pass, and ensures that model parameters remain consistent across all processes. Strategies in Lightning abstract away much of the complexity of distributed training, allowing users to focus on their model architecture and training logic while the framework handles the intricacies of distributed execution.

MegatronStrategy#

The MegatronStrategy is a PyTorch Lightning strategy that enables distributed training of large language models using NVIDIA’s Megatron Core library. It’s designed to handle models that exceed the memory capacity of a single GPU by implementing various forms of model parallelism.

To use the MegatronStrategy, you initialize it with parameters that define the parallelism setup:

from nemo import lightning as nl

strategy = nl.MegatronStrategy(
    tensor_model_parallel_size=2,
    pipeline_model_parallel_size=2,
    virtual_pipeline_model_parallel_size=None,
    context_parallel_size=1,
    sequence_parallel=False,
    expert_model_parallel_size=1,
)

These parameters determine how the model will be split across available GPUs. The strategy then sets up the necessary distributed environment, initializing process groups for each type of parallelism.

The strategy is also responsible for configuring the checkpoint IO interface that handles saving and loading checkpoints. For a full list of options that can be configured via MegatronStrategy, refer to the documentation.

When you create your PyTorch Lightning Trainer, you pass this strategy:

trainer = nl.Trainer(strategy=strategy, devices=8, accelerator="gpu")

The MegatronStrategy utilizes Megatron’s distributed checkpointing system for model I/O. This system efficiently manages checkpoints for models partitioned across multiple GPUs, maintaining consistency across various parallelism configurations. It enables correct model reconstruction even when GPU setups differ between saving and loading.

The MegatronStrategy wraps the user-defined training_step, validation_step, and test_step methods to make them compatible with Megatron’s forward-backward pass implementation. This wrapping process allows these steps to be executed within the context of Megatron’s distributed execution framework, ensuring that all forms of parallelism are properly handled during each phase of the training loop. By doing this, the strategy maintains the familiar PyTorch Lightning interface for users while seamlessly integrating the complex distributed operations required for large-scale model training.

The MegatronStrategy employs the MegatronParallel class to manage the distributed execution of the user-defined model. This class breaks down the execution process into three key steps:

  1. Data Step: Prepares and distributes the input data across the model parallel groups.

  2. Forward Step: Executes the forward pass across the partitioned model.

  3. Loss Reduction: Computes and reduces the loss across the distributed setup.

MegatronParallel utilizes these steps to perform the forward-backward pass, which is derived from the user-defined training_step, validation_step, and test_step methods. It orchestrates the flow of data and gradients through the partitioned model, manages inter-GPU communication, and ensures proper gradient synchronization. This approach enables efficient execution across multiple GPUs while preserving the logical structure of the user’s Lightning module.

MegatronParallel#

The MegatronParallel class is the core component that implements distributed model parallelism in the Megatron Core library. It manages the execution of the model across multiple GPUs, breaking down the process into three key steps:

  1. Data Step: This step prepares and distributes the input data across the model parallel groups. For the GPT model, it uses the gpt_data_step function:

    def data_step(self, dataloader_iter):
        return gpt_data_step(dataloader_iter)
    

    This function handles:

    1. Fetching a batch from the dataloader

    2. Moving required tensors to CUDA

    3. Slicing the batch for context parallelism using get_batch_on_this_context_parallel_rank

    4. Preparing packed sequence parameters if necessary

  2. Forward Step: This step executes the forward pass across the partitioned model. For the GPT model, it uses the gpt_forward_step function:

    def forward_step(self, model, batch):
        return gpt_forward_step(model, batch)
    

    This function:

    1. Prepares the forward arguments from the batch

    2. Calls the model’s forward method with these arguments

    3. Handles both standard and packed sequence inputs

  3. Loss Reduction: After the forward pass, this step computes and reduces the loss across the distributed setup. The GPT model uses MaskedTokenLossReduction:

    def loss_reduction(self, model):
        return model.training_loss_reduction()
    

    For validation:

    def validation_loss_reduction(self, model):
        return model.validation_loss_reduction()
    

    These methods handle:

    1. Calculating the loss using masked token loss

    2. Reducing the loss across data parallel groups

    3. Handling special cases for validation (e.g., not dropping the last batch)

The MegatronParallel class orchestrates these steps to perform the complete forward-backward pass.

By using these model-specific functions, MegatronParallel allows the GPT model to define its own data processing, forward pass, and loss calculation logic while still benefiting from the distributed execution framework. This approach enables researchers and engineers to work with large language models using familiar PyTorch Lightning interfaces, while the underlying distributed execution is handled transparently.

MegatronMixedPrecision#

The MegatronMixedPrecision class is a specialized precision plugin for Megatron Core library models in PyTorch Lightning. It extends the standard MixedPrecision plugin to handle the specific requirements of large language models trained with Megatron Core library.

from nemo import lightning as nl

precision = nl.MegatronMixedPrecision(precision="bf16-mixed")
trainer = nl.Trainer(strategy=strategy, plugins=precision)

previous

Feature Guide

next

Logging and Checkpointing

On this page
  • MegatronStrategy
  • MegatronParallel
  • MegatronMixedPrecision
NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2023-2025, NVIDIA Corporation.

Last updated on Oct 09, 2025.