# NVIDIA NeMo Framework – End‑to‑End Guide for LLM, Multimodal, and Speech AI Development  

This documentation collection provides a complete reference for using NVIDIA NeMo 2.0 to set up environments, curate data, and train, fine‑tune, or deploy large language models, multimodal (audio‑vision‑language) models, and speech AI systems. It covers core concepts, migration from NeMo 1.0, advanced parallelism (Megatron Core, tensor/pipeline/context parallelism), experiment management, logging, resiliency, quantization, and deployment via TensorRT‑LLM or Triton. Detailed recipes and best‑practice playbooks are included for specific models such as GPT‑OSS, Gemma 2/3, Mixtral, T5, Phi‑3, AVLM, speaker‑recognition/diarization, and TTS, helping engineers and researchers accelerate development and production on NVIDIA GPUs.

## Overview
- [When you need to set up, train, fine‑tune, or deploy LLMs, multimodal or speech AI models with NeMo—or migrate from NeMo 1.0 to NeMo 2.0.](docs.nvidia.com/nemo-framework/user-guide/latest/overview.html.md)
- [When an engineer is building a domain‑specific ASR or conversational AI model on NVIDIA GPUs and wants to prototype quickly with reusable neural modules, or when a research team needs to scale large‑scale LLM/MM training across multi‑GPU/multi‑node clusters with mixed‑precision and advanced parallelism, this page provides the necessary guidance.](docs.nvidia.com/nemo-framework/user-guide/latest/why-nemo.html.md)
- [The LLM should read this page whenever it is asked to produce a concise, 1‑2 sentence answer that focuses on a specific scenario—such as how to transfer a model checkpoint to a Brev GPU, launch a training job, retrieve logs or evaluation results, or export a model back to local storage. It is also useful whenever the user needs a short, scenario‑driven explanation that follows the “good‑format” guidelines.](docs.nvidia.com/nemo-framework/user-guide/24.07/overview.html.md)

## Getting Started & Playbooks
- [Read the page when you’re about to start a NeMo project—whether you need to spin up the Docker container, curate a dataset with NeMo‑Curator, train or fine‑tune an LLM with NeMo‑Run, or run a speech‑AI demo on Colab—so you can follow the step‑by‑step tutorials and best‑practice recipes provided.](docs.nvidia.com/nemo-framework/user-guide/latest/playbooks/index.html.md)
- [Read this page when you’re about to start a new generative‑AI, speech‑recognition, or multimodal project that uses NeMo—specifically to check the Python, PyTorch, and GPU requirements, install the framework, and run the quick‑start Audio‑Translation example or other tutorials. This ensures you have the correct environment set up before fine‑tuning, training, or deploying any NeMo model.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/starthere/intro.html.md)

## Migration from NeMo 1.0 to 2.0
- [An LLM engineer upgrading a NeMo 1.0 training pipeline to NeMo 2.0 should read this page to migrate precision settings—especially FP8 options—into the new MegatronMixedPrecision plugin and adjust the Trainer accordingly.](docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/migration/precision.html.md)
- [Read it when you’re updating a NeMo 1.0 SFT pipeline to NeMo 2.0—e.g., a developer needs to replace the separate megatron_gpt_finetuning.py script with a FineTuningDataModule and use model.import_ckpt(...) in trainer.fit(...) to avoid the “Failure to acquire lock” error and align the fine‑tuning workflow with the new unified pre‑training API.](docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/migration/sft.html.md)
- [An LLM should read this page when migrating a NeMo project from 1.0 to 2.0 to update tokenizer configuration from YAML to Python, or when creating a new training recipe that requires a custom SentencePiece, Hugging Face, or Megatron tokenizer. This page guides how to instantiate and modify tokenizers programmatically for accurate tokenization in large‑model training.](docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/migration/tokenizer.html.md)
- [When migrating a NeMo 1.0 experiment to Python‑based configs, writing a custom training script, or launching multi‑GPU LLM training with the NeMo CLI or NeMo‑Run on Slurm or other clusters.](docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/index.html.md)
- [When upgrading a NeMo 1.0 training pipeline to NeMo 2.0 and replacing the YAML‑based `exp_manager` with `NeMoLogger` and `AutoResume` objects, adding callbacks like `TimingCallback`, and updating checkpoint and logging settings.](docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/migration/exp-manager.html.md)
- [When migrating a NeMo 1.0 project to NeMo 2.0, the LLM should read this page to translate distributed‑checkpoint YAML options into the `MegatronStrategy` arguments and to learn how to run the conversion script that rewrites checkpoints from the old `model_name.nemo` tarball format to the new folder‑based structure.](docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/migration/checkpointing.html.md)

## Core Toolkit & Core Concepts
- [A developer or researcher should read this page when they are building, fine‑tuning, or deploying a conversational AI model with NVIDIA NeMo—particularly if they need to set up Hydra/YAML configurations, leverage PyTorch Lightning training hooks, restore or push models to Hugging Face, or profile GPU/CPU usage during training.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/core.html.md)
- [Read this page whenever you are adding or customizing adapters in a NeMo model—for example, when selecting the adapter type, deciding where to insert it in the architecture, or implementing a new adapter strategy such as residual addition or stochastic depth. It’s also useful when you need to understand how to extend AdapterModuleUtil or integrate adapters into multi‑module (encoder/decoder) setups.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/adapters/components.html.md)
- [Use this page whenever you’re building, saving, or loading NeMo models—especially if you need to register artifacts, nest submodules, or export a .nemo file. It’s also essential when you’re converting a .nemo archive to a checkpoint, or integrating a NeMo model with Hugging Face or other deployment pipelines.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/api.html.md)
- [Use this page when you’re training a language model with NVIDIA NeMo and need to track perplexity during distributed training, so you can update the metric with logits or probability tensors, synchronize state across workers, and compute the average perplexity for model evaluation.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/common/metrics.html.md)

## Data Curation & Preparation
- [null](docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/api/classifiers.html.md)
- [null](docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/api/decontamination.html.md)
- [null](docs.nvidia.com/nemo-framework/user-guide/latest/datacuration/api/misc.html.md)
- [When you’re building or debugging the data pipeline for an LLM in NeMo 2.0—such as preparing a token‑indexed pre‑training corpus, configuring distributed Megatron data parallelism, or setting up a JSONL‑based fine‑tuning workflow—you should read this page to pick the right `PreTrainingDataModule` or `FineTuningDataModule` and use its built‑in features (memory‑mapped loading, split handling, validation checks, etc.).](docs.nvidia.com/nemo-framework/user-guide/latest/data/index.html.md)
- [Fine‑tune a large language model on a custom dataset—when you need to build or tweak a DataModule, specify a prompt template, or enable packed‑sequence training; or when you want to supply pre‑preprocessed training, validation, and test jsonl files to a NeMo 2 LLM training pipeline.](docs.nvidia.com/nemo-framework/user-guide/latest/data/finetune_data.html.md)

## Model Parallelism & Distributed Training
- [When training a GPT‑style model that exceeds a single GPU’s memory and you want to leverage Megatron Core’s tensor, pipeline, or context parallelism within a PyTorch Lightning workflow—especially when you need distributed checkpointing and BF16‑mixed precision support.](docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/megatron.html.md)
- [Use this guide when training LLMs that exceed device memory because of long sequence lengths or large micro‑batches, and you need to checkpoint activations to reduce peak memory. Enable transformer‑layer recomputation (`recompute_method=full`, `block`, or `uniform`) or self‑attention recomputation (`recompute_granularity=selective`, auto‑enabled with FlashAttention) to balance memory savings against the extra forward‑pass cost, especially when using pipeline or virtual pipelining and specifying `recompute_num_layers` per stage.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/optimizations/activation_recomputation.html.md)
- [When a developer is setting up distributed NeMo training and wants to start with a smaller batch size that gradually scales to the target size—so they can fit memory limits on a few GPUs before expanding to more nodes—the ramp‑up batch‑size guide should be read.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/nlp/nemo_megatron/rampup_batch_size.html.md)
- [When planning to train SpeechLM2 on a multi‑node SLURM cluster or scaling a large model with FSDP2/Tensor Parallelism, consult this guide for job‑submission scripts, random‑seed handling, parallelism configuration, and local debugging with torchrun. Use it when customizing training scripts, adjusting batch sizes, or configuring parallel strategies to fit GPU memory.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/speechlm2/training_and_scaling.html.md)

## Experiment Management & Logging
- [Use this page when you are configuring or debugging NeMo experiments that involve Hydra multi‑run, setting up checkpoints, loggers (such as WandB, MLflow, or Neptune), enabling exponential‑moving‑average, or managing experiment resumption and disk‑space cleanup.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/core/exp_manager.html.md)
- [When you’re preparing or launching a NeMo 2.0 training job that requires organized logs, checkpoints, or automated resume logic—especially if you need to integrate TensorBoard or WandB, enable asynchronous checkpointing, or transfer checkpoints between machines—consult this guide to configure the correct `NeMoLogger`, `ModelCheckpoint`, and `AutoResume` settings.](docs.nvidia.com/nemo-framework/user-guide/latest/nemo-2.0/features/logging.html.md)

## PEFT & Fine‑Tuning Techniques
- [Use this page when you’re preparing to fine‑tune a large language model with NeMo and need to choose a PEFT method that fits your memory, training‑speed, or inference‑latency constraints—e.g., deciding between LoRA, DoRA, QLoRA, adapters, or IA3 and customizing their bottleneck dimensions or quantization settings. It’s also the go‑to reference when upgrading from NeMo 1.0 to 2.0 to see which techniques are supported, which are legacy, and how to enable tensor or pipeline parallelism for DoRA.](docs.nvidia.com/nemo-framework/user-guide/latest/sft_peft/supported_methods.html.md)

## Specific Model Guides (Language‑Only)
- [Use this guide when you are building, finetuning, or deploying a GPT‑OSS 20B/120B model with NeMo 2.0—e.g., converting an OpenAI checkpoint to NeMo format, running LoRA or full‑model finetuning, launching inference scripts, exporting the checkpoint to Hugging Face, or building a TensorRT‑LLM/Triton inference pipeline.](docs.nvidia.com/nemo-framework/user-guide/latest/llms/gpt_oss.html.md)
- [If you’re about to fine‑tune a Gemma 2 (2B, 9B, or 27B) model with NeMo‑Run—importing the HF checkpoint, selecting the recipe, swapping out the SquadDataModule for a custom dataset, or toggling LoRA/full‑finetune and sequence packing—read this page for the exact configuration steps.](docs.nvidia.com/nemo-framework/user-guide/latest/llms/gemma2.html.md)
- [Use this page whenever you need to import a Gemma 3 checkpoint from Hugging Face into NeMo 2.0, run inference or fine‑tune it on local or multi‑node clusters, or adapt the provided recipes to a custom data pipeline.](docs.nvidia.com/nemo-framework/user-guide/latest/vlms/gemma3.html.md)
- [Use this page when you’re preparing a pre‑training or fine‑tuning job for Phi‑3 (or related Llama‑3 models) with NeMo 2.0, need to swap the default data modules, choose LoRA or full‑model training, or run the recipe via NeMo‑Run locally or on a cluster.](docs.nvidia.com/nemo-framework/user-guide/latest/llms/phi3.html.md)
- [Use this guide whenever you are setting up or running Mixtral‑8x7B or Mixtral‑8x22B training with NeMo 2.0—e.g., to import a Hugging Face checkpoint into NeMo, to replace the default `MockDataModule`/`SquadDataModule` with a custom dataset, to launch pre‑training or fine‑tuning recipes via NeMo‑Run, or to check which recipe versions (16k, 64k, FP8) are currently supported.](docs.nvidia.com/nemo-framework/user-guide/latest/llms/mixtral.html.md)
- [Use this guide when you want to pre‑train or fine‑tune a T5 model (220 M, 3 B, or 11 B) with NeMo 2.0, need to override the default data modules, configure LoRA or full‑model finetuning, and run the recipe via NeMo‑Run. It’s also useful when you need to understand how to launch the training locally or in a distributed executor and manage checkpoints.](docs.nvidia.com/nemo-framework/user-guide/latest/llms/t5.html.md)

## Multimodal & Vision‑Language Models
- [Read it when you’re building or fine‑tuning an Audio‑Vision Language Model in NeMo—e.g., migrating from NeVA, configuring the multimodal encoders, aligning vision/audio embeddings, or running the pre‑training/fine‑tuning recipes (including the Energon data module) for tasks like video captioning or image‑audio QA.](docs.nvidia.com/nemo-framework/user-guide/latest/vlms/avlm.html.md)
- [Use this guide whenever you’re setting up or running a LLaVA‑Next training pipeline—for example, converting a Hugging Face LLM checkpoint to NeMo format, swapping the task encoder to LlavaNextTaskEncoder, or configuring the Energon dataloader for high‑resolution images. It’s also useful before launching pretraining or fine‑tuning jobs with NeMo‑Run to ensure the recipes and data modules are correctly set for your GPU cluster.](docs.nvidia.com/nemo-framework/user-guide/latest/vlms/llavanext.html.md)

## Speech & ASR
- [When the developer needs to generate speaker embeddings or verify whether two or more audio files belong to the same speaker—such as adding a new speaker to a verification dataset, evaluating a batch of verification pairs, or integrating speaker identity checks into a downstream application.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_recognition/api.html.md)
- [Use this page whenever you need to set up a speaker‑recognition experiment in NeMo, such as selecting TitaNet, SpeakerNet, or ECAPA‑TDNN, locating the correct configuration files, loading a pretrained checkpoint, or understanding the architectural details for fine‑tuning or benchmarking.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_recognition/models.html.md)
- [Use this page when you are preparing to train or fine‑tune a Sortformer or MSDD speaker diarizer, configuring dataset manifests, training hyper‑parameters, or post‑processing settings, or when setting up a complete inference pipeline that includes VAD, embeddings, clustering, and optional ASR for a new audio domain. It is also the go‑to reference for troubleshooting or customizing any of the YAML config blocks in NeMo’s speaker diarization toolkit.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_diarization/configs.html.md)
- [When designing or fine‑tuning a multi‑speaker ASR pipeline that uses NeMo, the LLM should consult this page to decide between an end‑to‑end Sortformer diarizer or the cascaded VAD/embedding/MSDD approach and to understand the associated loss functions and multi‑scale weighting. It is also useful when integrating these models into a production workflow, debugging diarization accuracy on long recordings, or evaluating new SortLoss or attention‑based scale‑weighting options.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_diarization/models.html.md)
- [When you need to create or preprocess manifests, configure Hydra YAML, or prepare datasets for end‑to‑end or cascaded (MS‑DD) speaker diarization with NeMo, you should read this page.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_diarization/datasets.html.md)
- [Use this page when you need to integrate NVIDIA NeMo’s speaker diarization into a project, such as adding speaker‑labeling to a real‑time transcription pipeline, converting audio files into RTTM format for downstream analysis, or customizing a model’s diarization workflow by subclassing the `SpkDiarizationMixin` and overriding the required methods.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_diarization/api.html.md)
- [Use this page when you are starting an ASR project in NeMo and need to run the SSL‑for‑ASR or ASR‑with‑NeMo notebooks, locate pretrained checkpoints, or run dataset preprocessing scripts. It is also useful for quickly finding model architecture details or adjusting SSL configuration files for training.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/ssl/resources.html.md)
- [When planning to train SpeechLM2 on a multi‑node SLURM cluster or scaling a large model with FSDP2/Tensor Parallelism, consult this guide for job‑submission scripts, random‑seed handling, parallelism configuration, and local debugging with torchrun. Use it when customizing training scripts, adjusting batch sizes, or configuring parallel strategies to fit GPU memory.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/speechlm2/training_and_scaling.html.md)

## Text‑to‑Speech (TTS)
- [Use this guide whenever you are designing or fine‑tuning a NeMo TTS pipeline and need to compare model options, review pretrained checkpoints, or consult architecture details for FastPitch, Mixer‑TTS, RAD‑TTS, VITS, or vocoders such as HiFi‑GAN. It’s also handy when you want to implement a new alignment strategy or deploy audio codecs like SoundStream.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tts/models.html.md)
- [Use this page whenever you need to load a local or NGC‑hosted TTS checkpoint for evaluation, fine‑tuning, or inference, or when you must programmatically enumerate available NeMo TTS models and their download URLs to set up a training pipeline. It’s also the reference for attaching a vocoder to a cascaded FastPitch model or resuming an unfinished training run with the Experiment Manager’s resume_if_exists=True flag.](docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tts/checkpoints.html.md)

## Deployment & Optimization
- [Use this guide when you’re preparing a FP16/BF16 checkpoint for inference with TensorRT‑LLM and need to choose and run a post‑training quantization (FP8, INT8 SmoothQuant, or INT4 AWQ). It tells you which models support each method, the required calibration steps, and the NeMo CLI or PTQ‑script commands to export a quantized “qnemo” checkpoint ready for deployment.](docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/quantization/quantization.html.md)

## Infrastructure & Cluster Management
- [When planning a new pre‑train or finetune job on a GPU cluster, an LLM engineer should read this page to auto‑select the best model size, parallelism, and hyper‑parameters that fit the available GPUs and time budget. After training, the guide is useful again to compare throughput results and pick the fastest configuration.](docs.nvidia.com/nemo-framework/user-guide/latest/usingautoconfigurator.html.md)
- [Use this page when you are running NeMo training on a Slurm cluster and need to enable fault‑tolerance, auto‑resume, straggler detection, local checkpointing, or graceful preemption (e.g., training a LLaMA3‑8B recipe with NeMo‑Run or managing long‑running jobs that risk time‑limit or node failures).](docs.nvidia.com/nemo-framework/user-guide/latest/resiliency.html.md)

## Libraries & APIs
- [When an LLM developer needs to set up GPU‑accelerated training or fine‑tuning of Hugging Face models, they should consult this page to use NeMo AutoModel and Megatron Core for efficient scaling.  
When they are preparing data for a new LLM, evaluating model performance, or deploying the trained model to production via TensorRT‑LLM or NVIDIA Triton, this documentation provides the required APIs, deployment options, and evaluation harnesses.](docs.nvidia.com/nemo-framework/user-guide/latest/libraries/index.html.md)

## Best Practices & Troubleshooting
- [When adding optional dependencies, debugging missing imports, accessing gated Hugging Face models, or writing scripts that use NeMo 2.0’s multiprocessing backend, an LLM developer should read this page.](docs.nvidia.com/nemo-framework/user-guide/latest/best-practices.html.md)

## Release Notes / Changelog
- [When you’re updating, deploying, or troubleshooting NeMo‑based workloads—such as moving to a newer container, adding a new LLM or multimodal model, enabling FP8 or nvFSDP optimizations, or addressing a security patch or deprecation—consult the changelog to verify feature availability, required code changes, or compatibility notes.](docs.nvidia.com/nemo-framework/user-guide/latest/changelog.html.md)