Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Changelog
This section identifies the major changes in each version of the NVIDIA NeMo™ framework released to date.
NeMo Framework 24.07
Training
Features and Model architectures
PEFT: QLoRA support, LoRA/QLora for Mixture-of-Experts (MoE) dense layer
State Space Models & Hybrid Architecture support (Mamba2 and NV-Mamba2-hybrid)
Support Nemotron, Minitron, Gemma2, Qwen, RAG
Multimodal
NeVA: Add SOTA LLM backbone support (Mixtral/LLaMA3) and suite of model parallelism support (PP/EP)
Support Language Instructed Temporal-Localization Assistant (LITA) on top of video NeVA
Custom Tokenizer training in NeMo
Update the Auto-Configurator for EP, CP and FSDP
ASR
SpeechLM and SALM
Adapters for Canary Customization
Pytorch allocator in PyTorch 2.2 improves training speed up to 30% for all ASR models
Cuda Graphs for Transducer Inference
Replaced webdataset with Lhotse - gives up to 2x speedup
Transcription Improvements - Speedup and QoL Changes
ASR Prompt Formatter for multimodal Canary
Aligner
Speed up Aligner RLHF by 7x with TRT-LLM
Reward Preference Optimization (RPO)
Identity Preference Optimization (IPO)
SteerLM2
Llama 3 performance and convergence example
Constitutional AI algorithm (RLAIF)
Curator
Semantic Deduplication
Resiliparse for Text Extraction
Improve Distributed Data Classification - Domain classifier is 1.55x faster through intelligent batching
Synthetic data generation for fine-tuning
Export & Deploy
In framework PyTriton deployment with backends:
PyTorch
vLLM
TRT-LLM update to 0.10
TRT-LLM C++ runtime
NeMo Framework 24.05
NeMo Framework now supports Large Language Models (LLM), Multimodal (MM), Automatic Speech Recognition (ASR), and Text-to-Speech (TTS) in a single consolidated container.
LLM and MM
Models
Megatron Core RETRO
Pre-training
Zero-shot Evaluation
Pretraining, conversion, evaluation, SFT, and PEFT for:
Mixtral 8X22B
Llama 3
SpaceGemma
Embedding Models Fine Tuning
Mistral
BERT
BERT models
Distributed checkpoint
Video capabilities with NeVa
Performance
Distributed Checkpointing
Torch native backend
Parallel read/write
Async write
Multimodal LLM (LLAVA/NeVA)
Pipeline Parallelism support
Sequence packing support
Export
Integration of Export & Deploy Modules into NeMo Framework container
Upgrade to TRT-LLM 0.9
Curator
SFT/PEFT (LoRA, and p-tuning) Data Curation Pipeline and Example
Dataset Blending Tool
Domain Classifier
Aligner
LoRA techniques with:
PPO Actor
PDO
SFT/SteerLM
Stable Diffusion models
Speech (ASR & TTS)
Models
AED Multi Task Models (Canary) - Multi-Task Multi-Lingual Speech Recognition / Speech Translation model
Multimodal Domain - Speech LLM supporting SALM Model
Parakeet-tdt_ctc-1.1b Model - RTFx of > 1500 (can transcribe 1500 seconds of audio in 1 second)
Audio Codec 16kHz Small - NeMo Neural Audio Codec for discretizing speech for use in LLMs
mel_codec_22khz_medium
mel_codec_44khz_medium
Perf Improvements
Transcribe() upgrade - Enables one line transcribe with files, tensors, data loaders
Frame looping algorithm for RNNT faster decoding - Improves Real Time Factor (RTF) by 2-3x
Cuda Graphs + Label-Looping algorithm for RNN-T and TDT Decoding - Transducer Greedy decoding at over 1500x RTFx, on par with CTC Non-Autoregressive models
Semi Sorted Batching support - External User contribution that speeds up training by 15-30%.
Customization
Context biasing for CTC word stamping - Improve accuracy for custom vocabulary and pronunciation
Longform Inference
Longform inference support for AED models
Transcription of multi-channel audio for AED models
Misc
Upgraded webdataset - Speech and LLM / Multimodal unified container
NeMo Framework 24.03.01
Issues Fixed
GPT Memory Leak at Loss Function
Eval script issue for Mixtral PEFT
Llama 7B Out-of-memory issue when using 1TB system memory
Enable Pipeline Parallelism support for LoRA merge
Multi-node Llama training on Kubernetes while saving checkpoint
NeMo Framework 24.03
Fully Sharded Data Parallel (FSDP) support for GPT
Post-Training Quantization (PTQ) with AMMO library (0.7.4) for Llama
Support Expert Parallelism on all MoE models e.g. Mixtral
Pipeline parallel for p-tuning
Updated PEFT metrics for all popular community models. (Support matrix Temp Internal Link)
Upgraded PyTorch Lightning to 2.2
Upgraded base container to PyTorch 24.02
Consolidation of the StarCoder2 and Gemma specific containers, with the previous Framework GA container
Customizable distributed data classification tool in Curator
GPU-accelerated quality classification model code in Curator
GPU-accelerated domain classification model code in Curator
NeMo Framework 24.01.01
Added Mixture of Experts parameter passing for MCore
PP/TP support for Mixture of Experts
SFT / PEFT support for Gemma model
Training / SFT / PEFT / Evaluation support for - Baichuan model - CodeLlama model
Fixed SFT/PEFT support nemo-launcher configs for Mistral and Mixtral - Edited configs with correct values
Documentation refactor and landing page added
NeMo Framework developer docs added
NeMo Framework 24.01
New end-to-end support (pretraining, conversion, evaluation, SFT, PEFT) for community models, featuring:
Support for community model Falcon
Support for community model Mixtral (expert parallelism coming in future release)
Support for community model Mistral
Support for community model Code Llama
General availability release of NeMo Multimodal, featuring:
Support for vision-language foundation models: CLIP
Support for text-2-image foundation models: Stable Diffusion and Imagen
Support for text-2-image customization: SD-LoRA, SD-ControlNet, SD-instruct pix2pix
Support for multimodal LLM: NeVA and LLAVA
Support for text-2-NeRF: DreamFusion++
Support for NSFW
New performance features and key optimization:
Support PyTorch Fully Sharded Data Parallel training (FSDP) with tensor-parallelism
Support CPU offloading and prefetch of activations and weights
Support Context Parallelism for performant long-sequence-length LLM training
Support framework-level FP8 precision that reduces memory usage and training step time
Transformer layer granularity re-computation with FP8 LLM training
Support pipelined tensor-parallel communication overlap with GEMM for all LLMs
Support LLM fine-tuning with packed sequences
Support fused RoPE and Swiglu for LLAMA2 like models
Device memory bug fix; removed FP8 cast/transpose duplicates in FP8 training
New features for NeMo Aligner:
Support for MultiEpoch
Added PPO: custom end strings + memory optimizations
Added SFT: LoRa and custom validation metrics
New features for NeMo Curator:
Multi-node multi-GPU fuzzy document-level deduplication supported within the launcher.
Added new Personal Identifiable Information (PII) Removal module
Task decontamination for SFT and PEFT (e.g., LoRA, p-tuning, adapters, etc.) datasets supported within the launcher
Code data filtering heuristics from StarCoder
NeMo Framework 23.11
Open source release of NeMo-Aligner. NeMo-Aligner is a one stop shop for efficient model alignment algorithms, featuring:
Support for the full Reinforcement Learning from Human Feedback(RLHF) pipeline including SFT, Reward Model Training and Reinforcement Learning
Support for the SteerLM technique
Support for Direct Preference Optimization
Support for all Megatron Core GPT models such as LLAMA2 70B
Improved user experience
NeMo Framework 23.10
General announcement of the NeMo Framework Inference container, featuring:
Deployment support for distributed checkpoints (Megatron Core) for NeMotron 8B and Llama 2 (BF16 only)
Deployment support for fine-tuned (SFT, RLHF, SteerLM) NeMotron 8B (BF16 only)
Deployment support for P-tuned Llama 2 on a single GPU (BF16 only)
Support for serving GPT and Llama 2 models using PyTriton on Triton Inference Server
Support for serving GPT and Llama 2 models using TensorRT-LLM C++ backend on Triton Inference Server
Support in-flight batching for TensorRT-LLM C++ backend on Triton Inference Server
NeMo Framework 23.08.03
Enabled PEFT to work with Llama-2 models
Addressed an issue that occurred when resuming Supervised Fine-Tuning with constant learning rate scheduler
Fixed model parallelism bug in SFT and PEFT
Included P-tuning state dictionary handling for distributed checkpoints
Fixed bug that occurred when using the save_best_model flag
Fixed bug where progress bar would show the wrong number of steps
NeMo Framework 23.08.02
Fixed container paths in Hydra configurations
NeMo Framework 23.08.01
Fixed checkpoint search for distributed checkpoints
NeMo Framework 23.08
Added the Distributed Checkpoint Format to NeMo and Megatron Core for GPT
New GPT transformer from Megatron Core which enables training of improved LLM configs
When training 175B GPT with FP8, use tensor parallelism TP=8 and micro batch size MBS = 2 to ensure the model-parallel partitioning fits GPU memory
New GPT transformer from Megatron Core which enables Group and Multi Query Attention for models like LLAMA2
Support Llama 1 and Llama 2 pre-training with Megatron Core
Customize LLMs for Llama 1 and Llama 2 models with techniques like SFT, PEFT (p-tuning, adapters, IA3)
Added examples and documentation for Kubernetes training
NeMo Data Curator: added downstream task decontamination support
NeMo Framework 23.07
Added Low-Rank Adaptation (LoRA) Support for T5 and mT5
Added Batch Size Ramp-up Support for GPT
NeMo Framework 23.05
Low-Rank Adaptation (LoRA) Support for GPT
LDDL (Language Datasets and Data Loaders) for BERT on 100B model resulting in a 30% performance speedup
Unify dataset and model classes for all PEFT (p-tuning, adapters, IA3) with SFT model class as parent for GPT
Converter from Interleaved PP to non-Interleaved PP
Dialog dataset guidance for SFT to help create better chat models
Support Dynamic Sequence Length Batches with GPT SFT
Data parallelism enabled for RLHF servers, providing a 2x end-to-end speedup in most jobs
NeMo Framework 23.04.1 ——————–@@
Addressed issue in RLHF which prevented some jobs from running in Slurm clusters
Corrections related to the renaming of NeMo Megatron to NeMo Framework
Modified run.name in the *_improved configuration files to match the correct parameter count
NeMo Framework 23.04
Supports NeMo Data Curator, a scalable Python library for curating the large-scale datasets required for training large language foundation models
Enables continued training for P-tuning
Switches to Megatron Core for Model Parallelism
Extends the Data Validation Tool to provide P-tuning GPU runtime estimates
Supports tensor and pipeline parallelism conversion for GPT and T5 models
Supports supervised fine-tuning for GPT
Adds Reinforcement Learning from Human Feedback (RLHF) for GPT models
Adds four GPT model sizes based on new and improved model configurations:
400M_improved
1B_improved
7B_improved
40B_improved
Following is a list of GPT model configuration changes:
Configuration |
Previous |
New |
---|---|---|
Activation |
GeLU |
Fast-SwiGLU |
Position Embedding |
Learned Absolute |
RoPE |
Dropout |
0.1 |
0 |
Embeddings and Output Layer |
Tied |
Untied |
Bias terms |
Yes |
No |
Normalization |
LayerNorm |
LayerNorm1p |
NeMo Framework 23.03
Adds a per microbatch data loader for GPT and BERT models
Supports
SquaredReLU
andSwiGLU
activation functions for GPT and T5 modelsSupports Rotary Position Embedding (RoPE) for GPT and RETRO
Supports early stopping when P‑tuning or prompt tuning GPT, T5, and mT5 models
Implements refactored adapter learning to mimic the parameter-efficient transfer learning of the NLP approach
Adds flash attention for GPT models in Transformer Engine
NeMo Framework 23.01
Supports BERT models with tensor parallelism (training only)
Supports BERT models with pipeline parallelism (training only)
Supports sequence parallelism and selective activation checkpointing for BERT (training only)
Supports interleaved pipeline scheduling for BERT models
Adds Distributed Adam Optimizer for BERT models
Supports AutoConfigurator for BERT models
Adds 110M, 4B, 20B, and 100B BERT training configurations
Supports mixture-of-experts for T5 models (no expert parallelism, training only)
Improves performance for GPT P‑tuning (20%−25% speed-up)
Adds ALiBi position embeddings for T5 and mT5 (training only)
Logs total model size (across modal parallel ranks) for GPT, T5, mT5, and BERT models
NeMo Framework 22.11
Adds interleaved pipeline scheduling for GPT models (training only)
Supports FP8 using Transformer Engine (training only)
Adds Distributed Adam Optimizer for T5 and mT5 models
Supports P‑tuning and prompt tuning for GPT models with sequence parallelism
Improves training configurations throughput by 7.9% (5B GPT), 9.6% (3B T5), 4.3% (11B T5), 52.4% (23B T5), and 26.6% (41B T5)
NeMo Framework 22.09
Supports NeMo Framework training and inference containers on OCI; for details on orchestration scripts, reach out to oci_nm@nvidia.com
Supports P‑tuning and prompt tuning for T5 and mT5 models with pipeline parallelism (training only)
Supports adapter learning for GPT and T5 with tensor parallelism and pipeline parallelism (training only)
Supports IA3 learning for GPT and T5 with tensor parallelism and pipeline parallelism (training only)
Adds AutoConfigurator to find the highest-throughput configurations for training on Base Command Platform
Adds AutoConfigurator for parallel inference hyperparameter search for GPT on Base Command Manager
NeMo Framework 22.08.01
Supports Amazon Web Services as a cloud service provider (performance validated up to 20
p4d.24xlarge
instances)Adds switched orchestration for cloud service providers from Azure CycleCloud to NVIDIA Nephele for Microsoft Azure
NeMo Framework 22.08
Adds distributed Adam Optimizer for GPT models
Adds asymmetric encoder-decoder configuration for T5 and mT5 models
Supports untying embeddings from the classifier layer for T5 and mT5 models
Supports relative position embeddings for T5 and mT5 models (pipeline parallelism ≥3)
Supports P‑tuning and prompt tuning for T5 and mT5 models with tensor parallelism (training only)
Refactors code to yield improved consistency and readability of configurations and logs
Supports SQuAD fine-tuning and evaluation for T5 models with pipeline parallelism ≤2
Supports XQuAD fine tuning-and evaluation for mT5 models with pipeline parallelism ≤2
NeMo Framework 22.06-hotfix.01 —————————-@@
Fixes AutoConfigurator for T5 and mT5 models
Fixes Evaluation harness in GPT models
Fixes Prompt learning in GPT models
Fixes “out of memory” condition when pretraining GPT models with sequence parallelism
NeMo Framework 22.06
Supports sequence parallelism and selective activation checkpointing for GPT
Supports relative position embeddings for T5
NVIDIA used the mC4 dataset (24 Languages) for pretraining the mT5 models, and verified the results on KNLI, KorQuAD, KLUE-STS, and XNLI tasks.
Updates AutoConfigurator with sequence parallelism and selective activation checkpointing for GPT models
Adds AutoConfigurator support for DGX A100 40GB configurations for GPT, T5, and mT5 models
Supports P‑tuning and prompt tuning for GPT with pipeline parallelism (training only)
Supports operation fusions for higher training throughput (2%-7% speed-up)
Changes default GPT configurations to include sequence parallelism and selective activation checkpointing: 20B (speed-up: 14%), 40B (speed-up: 9%), and 175B (speed-up: 15%)
NeMo Framework 22.05.01
Adds cloud service provider support for Microsoft Azure (performance validated up to 36
Standard_ND96amsr_A100_v4
instances)Adds cluster validation tools (DGMI, NCCL)
Improves performance of 20B GPT training configuration by 2.7%
NeMo Framework 22.05
Supports asynchronous gradient all-reduce for GPT, T5, mT5 models with pipeline parallel size equal to 1
Supports P‑tuning and prompt tuning for GPT with tensor parallelism (training only)
Adds AutoConfigurator to find the highest-throughput configurations for training and inference on Base Command Manager
Supports custom tokenizers (training only)
Supports GPT models with pipeline parallelism on Base Command Manager (inference)
Supports new hyperparameters for text generation:
top-p
,top-k
, andtemperature
NeMo Framework 22.04
Supports T5 models with pipeline parallelism (training only)
Switches from GeLU to GeGLU as activation function for T5
Supports mT5 with tensor parallelism and pipeline parallelism (training only)
Adds 11B, 23B, and 41B T5 model training configurations
Adds 170M, 390M, and 3B mT5 model training configurations
Adds automatic and configurable Non-Uniform Memory Access (NUMA) mapping
NeMo Framework 22.03
Adds tensor parallelism support for T5 models (optimized for <20B parameters, training only)
Adds 220M and 3B T5 model training configurations
Supports GLUE fine-tuning and evaluation for T5 models
NeMo Framework 22.02
Supports GPT models with pipeline parallelism (training only)
Adds 40B and 175B GPT model training configurations
NeMo Framework 22.01
Supports GPT with tensor parallelism on Base Command Platform
Supports O2-style AMP (accelerated training of larger models)
Includes a chatbot sample application using your trained GPT model
Supports training metric monitoring and visualization with Weights & Biases