Changelog

This section identifies the major changes in each version of the NVIDIA NeMo™ framework released to date.

Issues Fixed

  • GPT Memory Leak at Loss Function

  • Eval script issue for Mixtral PEFT

  • Llama 7B Out-of-memory issue when using 1TB system memory

  • Enable Pipeline Parallelism support for LoRA merge

  • Multi-node Llama training on Kubernetes while saving checkpoint

  • Fully Sharded Data Parallel (FSDP) support for GPT

  • Post-Training Quantization (PTQ) with AMMO library (0.7.4) for Llama

  • Support Expert Parallelism on all MoE models e.g. Mixtral

  • Pipeline parallel for p-tuning

  • Updated PEFT metrics for all popular community models. (Support matrix Temp Internal Link)

  • Upgraded PyTorch Lightning to 2.2

  • Upgraded base container to PyTorch 24.02

  • Consolidation of the StarCoder2 and Gemma specific containers, with the previous Framework GA container

  • Customizable distributed data classification tool in Curator

  • GPU-accelerated quality classification model code in Curator

  • GPU-accelerated domain classification model code in Curator

  • Added Mixture of Experts parameter passing for MCore

  • PP/TP support for Mixture of Experts

  • SFT / PEFT support for Gemma model

  • Training / SFT / PEFT / Evaluation support for - Baichuan model - CodeLlama model

  • Fixed SFT/PEFT support nemo-launcher configs for Mistral and Mixtral - Edited configs with correct values

  • Documentation refactor and landing page added

  • NeMo Framework developer docs added

  • New end-to-end support (pretraining, conversion, evaluation, SFT, PEFT) for community models, featuring:

    • Support for community model Falcon

    • Support for community model Mixtral (expert parallelism coming in future release)

    • Support for community model Mistral

    • Support for community model Code Llama

  • General availability release of NeMo Multimodal, featuring:

    • Support for vision-language foundation models: CLIP

    • Support for text-2-image foundation models: Stable Diffusion and Imagen

    • Support for text-2-image customization: SD-LoRA, SD-ControlNet, SD-instruct pix2pix

    • Support for multimodal LLM: NeVA and LLAVA

    • Support for text-2-NeRF: DreamFusion++

    • Support for NSFW

  • New performance features and key optimization:

    • Support PyTorch Fully Sharded Data Parallel training (FSDP) with tensor-parallelism

    • Support CPU offloading and prefetch of activations and weights

    • Support Context Parallelism for performant long-sequence-length LLM training

    • Support framework-level FP8 precision that reduces memory usage and training step time

    • Transformer layer granularity re-computation with FP8 LLM training

    • Support pipelined tensor-parallel communication overlap with GEMM for all LLMs

    • Support LLM fine-tuning with packed sequences

    • Support fused RoPE and Swiglu for LLAMA2 like models

    • Device memory bug fix; removed FP8 cast/transpose duplicates in FP8 training

  • New features for NeMo Aligner:

    • Support for MultiEpoch

    • Added PPO: custom end strings + memory optimizations

    • Added SFT: LoRa and custom validation metrics

  • New features for NeMo Curator:

    • Multi-node multi-GPU fuzzy document-level deduplication supported within the launcher.

    • Added new Personal Identifiable Information (PII) Removal module

    • Task decontamination for SFT and PEFT (e.g., LoRA, p-tuning, adapters, etc.) datasets supported within the launcher

    • Code data filtering heuristics from StarCoder

  • Open source release of NeMo-Aligner. NeMo-Aligner is a one stop shop for efficient model alignment algorithms, featuring:

    • Support for the full Reinforcement Learning from Human Feedback(RLHF) pipeline including SFT, Reward Model Training and Reinforcement Learning

    • Support for the SteerLM technique

    • Support for Direct Preference Optimization

    • Support for all Megatron Core GPT models such as LLAMA2 70B

    • Improved user experience

  • General announcement of the NeMo Framework Inference container, featuring:

    • Deployment support for distributed checkpoints (Megatron Core) for NeMotron 8B and Llama 2 (BF16 only)

    • Deployment support for fine-tuned (SFT, RLHF, SteerLM) NeMotron 8B (BF16 only)

    • Deployment support for P-tuned Llama 2 on a single GPU (BF16 only)

    • Support for serving GPT and Llama 2 models using PyTriton on Triton Inference Server

    • Support for serving GPT and Llama 2 models using TensorRT-LLM C++ backend on Triton Inference Server

    • Support in-flight batching for TensorRT-LLM C++ backend on Triton Inference Server

  • Enabled PEFT to work with Llama-2 models

  • Addressed an issue that occurred when resuming Supervised Fine-Tuning with constant learning rate scheduler

  • Fixed model parallelism bug in SFT and PEFT

  • Included P-tuning state dictionary handling for distributed checkpoints

  • Fixed bug that occurred when using the save_best_model flag

  • Fixed bug where progress bar would show the wrong number of steps

  • Fixed container paths in Hydra configurations

  • Fixed checkpoint search for distributed checkpoints

  • Added the Distributed Checkpoint Format to NeMo and Megatron Core for GPT

  • New GPT transformer from Megatron Core which enables training of improved LLM configs

  • When training 175B GPT with FP8, use tensor parallelism TP=8 and micro batch size MBS = 2 to ensure the model-parallel partitioning fits GPU memory

  • New GPT transformer from Megatron Core which enables Group and Multi Query Attention for models like LLAMA2

  • Support Llama 1 and Llama 2 pre-training with Megatron Core

  • Customize LLMs for Llama 1 and Llama 2 models with techniques like SFT, PEFT (p-tuning, adapters, IA3)

  • Added examples and documentation for Kubernetes training

  • NeMo Data Curator: added downstream task decontamination support

  • Added Low-Rank Adaptation (LoRA) Support for T5 and mT5

  • Added Batch Size Ramp-up Support for GPT

  • Low-Rank Adaptation (LoRA) Support for GPT

  • LDDL (Language Datasets and Data Loaders) for BERT on 100B model resulting in a 30% performance speedup

  • Unify dataset and model classes for all PEFT (p-tuning, adapters, IA3) with SFT model class as parent for GPT

  • Converter from Interleaved PP to non-Interleaved PP

  • Dialog dataset guidance for SFT to help create better chat models

  • Support Dynamic Sequence Length Batches with GPT SFT

  • Data parallelism enabled for RLHF servers, providing a 2x end-to-end speedup in most jobs

  • Addressed issue in RLHF which prevented some jobs from running in Slurm clusters

  • Corrections related to the renaming of NeMo Megatron to NeMo Framework

  • Modified run.name in the *_improved configuration files to match the correct parameter count

  • Supports NeMo Data Curator, a scalable Python library for curating the large-scale datasets required for training large language foundation models

  • Enables continued training for P-tuning

  • Switches to Megatron Core for Model Parallelism

  • Extends the Data Validation Tool to provide P-tuning GPU runtime estimates

  • Supports tensor and pipeline parallelism conversion for GPT and T5 models

  • Supports supervised fine-tuning for GPT

  • Adds Reinforcement Learning from Human Feedback (RLHF) for GPT models

  • Adds four GPT model sizes based on new and improved model configurations:

    • 400M_improved

    • 1B_improved

    • 7B_improved

    • 40B_improved

Following is a list of GPT model configuration changes:

Configuration

Previous

New

Activation GeLU Fast-SwiGLU
Position Embedding Learned Absolute RoPE
Dropout 0.1 0
Embeddings and Output Layer Tied Untied
Bias terms Yes No
Normalization LayerNorm LayerNorm1p
  • Adds a per microbatch data loader for GPT and BERT models

  • Supports SquaredReLU and SwiGLU activation functions for GPT and T5 models

  • Supports Rotary Position Embedding (RoPE) for GPT and RETRO

  • Supports early stopping when P‑tuning or prompt tuning GPT, T5, and mT5 models

  • Implements refactored adapter learning to mimic the parameter-efficient transfer learning of the NLP approach

  • Adds flash attention for GPT models in Transformer Engine

  • Supports BERT models with tensor parallelism (training only)

  • Supports BERT models with pipeline parallelism (training only)

  • Supports sequence parallelism and selective activation checkpointing for BERT (training only)

  • Supports interleaved pipeline scheduling for BERT models

  • Adds Distributed Adam Optimizer for BERT models

  • Supports AutoConfigurator for BERT models

  • Adds 110M, 4B, 20B, and 100B BERT training configurations

  • Supports mixture-of-experts for T5 models (no expert parallelism, training only)

  • Improves performance for GPT P‑tuning (20%−25% speed-up)

  • Adds ALiBi position embeddings for T5 and mT5 (training only)

  • Logs total model size (across modal parallel ranks) for GPT, T5, mT5, and BERT models

  • Adds interleaved pipeline scheduling for GPT models (training only)

  • Supports FP8 using Transformer Engine (training only)

  • Adds Distributed Adam Optimizer for T5 and mT5 models

  • Supports P‑tuning and prompt tuning for GPT models with sequence parallelism

  • Improves training configurations throughput by 7.9% (5B GPT), 9.6% (3B T5), 4.3% (11B T5), 52.4% (23B T5), and 26.6% (41B T5)

  • Supports NeMo framework training and inference containers on OCI; for details on orchestration scripts, reach out to oci_nm@nvidia.com

  • Supports P‑tuning and prompt tuning for T5 and mT5 models with pipeline parallelism (training only)

  • Supports adapter learning for GPT and T5 with tensor parallelism and pipeline parallelism (training only)

  • Supports IA3 learning for GPT and T5 with tensor parallelism and pipeline parallelism (training only)

  • Adds AutoConfigurator to find the highest-throughput configurations for training on Base Command Platform

  • Adds AutoConfigurator for parallel inference hyperparameter search for GPT on Base Command Manager

  • Supports Amazon Web Services as a cloud service provider (performance validated up to 20 p4d.24xlarge instances)

  • Adds switched orchestration for cloud service providers from Azure CycleCloud to NVIDIA Nephele for Microsoft Azure

  • Adds distributed Adam Optimizer for GPT models

  • Adds asymmetric encoder-decoder configuration for T5 and mT5 models

  • Supports untying embeddings from the classifier layer for T5 and mT5 models

  • Supports relative position embeddings for T5 and mT5 models (pipeline parallelism ≥3)

  • Supports P‑tuning and prompt tuning for T5 and mT5 models with tensor parallelism (training only)

  • Refactors code to yield improved consistency and readability of configurations and logs

  • Supports SQuAD fine-tuning and evaluation for T5 models with pipeline parallelism ≤2

  • Supports XQuAD fine tuning-and evaluation for mT5 models with pipeline parallelism ≤2

  • Fixes AutoConfigurator for T5 and mT5 models

  • Fixes Evaluation harness in GPT models

  • Fixes Prompt learning in GPT models

  • Fixes “out of memory” condition when pretraining GPT models with sequence parallelism

  • Supports sequence parallelism and selective activation checkpointing for GPT

  • Supports relative position embeddings for T5

    NVIDIA used the mC4 dataset (24 Languages) for pretraining the mT5 models, and verified the results on KNLI, KorQuAD, KLUE-STS, and XNLI tasks.

  • Updates AutoConfigurator with sequence parallelism and selective activation checkpointing for GPT models

  • Adds AutoConfigurator support for DGX A100 40GB configurations for GPT, T5, and mT5 models

  • Supports P‑tuning and prompt tuning for GPT with pipeline parallelism (training only)

  • Supports operation fusions for higher training throughput (2%-7% speed-up)

  • Changes default GPT configurations to include sequence parallelism and selective activation checkpointing: 20B (speed-up: 14%), 40B (speed-up: 9%), and 175B (speed-up: 15%)

  • Adds cloud service provider support for Microsoft Azure (performance validated up to 36 Standard_ND96amsr_A100_v4 instances)

  • Adds cluster validation tools (DGMI, NCCL)

  • Improves performance of 20B GPT training configuration by 2.7%

  • Supports asynchronous gradient all-reduce for GPT, T5, mT5 models with pipeline parallel size equal to 1

  • Supports P‑tuning and prompt tuning for GPT with tensor parallelism (training only)

  • Adds AutoConfigurator to find the highest-throughput configurations for training and inference on Base Command Manager

  • Supports custom tokenizers (training only)

  • Supports GPT models with pipeline parallelism on Base Command Manager (inference)

  • Supports new hyperparameters for text generation: top-p, top-k, and temperature

  • Supports T5 models with pipeline parallelism (training only)

  • Switches from GeLU to GeGLU as activation function for T5

  • Supports mT5 with tensor parallelism and pipeline parallelism (training only)

  • Adds 11B, 23B, and 41B T5 model training configurations

  • Adds 170M, 390M, and 3B mT5 model training configurations

  • Adds automatic and configurable Non-Uniform Memory Access (NUMA) mapping

  • Adds tensor parallelism support for T5 models (optimized for <20B parameters, training only)

  • Adds 220M and 3B T5 model training configurations

  • Supports GLUE fine-tuning and evaluation for T5 models

  • Supports GPT models with pipeline parallelism (training only)

  • Adds 40B and 175B GPT model training configurations

  • Supports GPT with tensor parallelism on Base Command Platform

  • Supports O2-style AMP (accelerated training of larger models)

  • Includes a chatbot sample application using your trained GPT model

  • Supports training metric monitoring and visualization with Weights & Biases

Previous Distributed Data Classification
Next Known Issues
© Copyright 2023-2024, NVIDIA. Last updated on Apr 25, 2024.