NVIDIA NeMo framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech). It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pretrained model checkpoints.
NeMo framework provides end-to-end model development functionalities for LLMs and Multimodal models that can be used on-prem or a cloud provider of your choice:
- Data Curation 1
- Model Training and Customization
- Model Alignment 1
- Launcher
- Model Inference
- Model Support
NeMo Curator is a Python library that consists of a collection of GPU-optimized scalable data-mining modules for curating natural language data for training large language models (LLMs). The modules within NeMo Curator enable NLP researchers to mine high-quality text at scale from massive raw web corpora.
NeMo framework includes all the tooling required to efficiently train and customize LLMs and Multimodal models. This includes setting up the compute cluster, downloading data, and selecting model hyperparameters. The default configurations for each model and task are tested regularly and every configuration can be modified to train on new datasets or test new model hyperparameters. For customization, in addition to fully supervised finetuning (SFT), the framework also supports a variety of parameter efficient finetuning (PEFT) techniques such as Ptuning, LoRA, Adapters and IA3, which can usually reach nearly the same degree of accuracy with a fraction of the computational cost of SFT.
Part of the framework, NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit supports Supervised Finetuning (SFT) and other state-of-the-art (SOTA) model alignment algorithms such as SteerLM, DPO, and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe and helpful.
The NeMo Launcher streamlines your experience with the NeMo framework, offering a user-friendly interface to build end-to-end workflows for efficient management and organization of experiments across various environments. Built upon the Hydra framework, it empowers users to effortlessly compose and adjust hierarchical configurations using configuration files and command-line arguments. You can get started with large-scale training, customization, or alignment jobs locally (with single node support), on NVIDIA Base Command Manager (Slurm) or cloud providers - AWS, Azure, and Oracle Cloud Infrastructure (OCI) with launcher scripts without having to write any code.
NeMo framework integrates seamlessly with enterprise-grade model deployment tooling using NVIDIA NIM, powered by NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server for optimized and scalable inference.
End-to-end model development workflows for popular community models (Gemma, Starcoder 2, Llama 1/2, Baichuan 2, Falcon, Mixtral, Mistral, and more) and NVIDIA Nemotron models are supported in the NeMo framework.
Developing conversational AI models is a complex process that entails defining, building, and training models in specific domains. This often involves multiple iterations to achieve high accuracy, fine-tuning on various tasks and domain-specific data, ensuring training performance, and preparing models for inference deployment.
NeMo framework supports training and customizing of speech AI models for tasks such as automatic speech recognition (ASR), and text-to-speech synthesis (TTS), with a seamless path to enterprise-grade production deployment with NVIDIA Riva. To further aid developers & researchers, it comes with state-of-the-art (SOTA) pre-trained checkpoints, and tooling for reproducible speech data processing, and interactive exploration & analysis of speech datasets. Parts of NeMo framework for Speech include:
- Training and Customization
- SOTA Pre-trained Models
- Speech Tools
NeMo Forced Aligner (NFA) for generating token-, word- and segment-level timestamps of speech in audio using NeMo’s CTC-based Automatic Speech Recognition models.
Speech Data Processor (SDP), a toolkit for simplifying speech data processing. It allows you to represent data processing operations in a config file, minimizing boilerplate code, and allowing reproducibility and shareability.
Speech Data Explorer (SDE), a Dash-based web application for interactive exploration and analysis of speech datasets.
Dataset creation tool which provides functionality to align long audio files with the corresponding transcripts and split them into shorter fragments that are suitable for Automatic Speech Recognition (ASR) model training.
Comparison Tool for ASR Models to compare predictions of different ASR models at word accuracy and utterance level.
ASR Evaluator for evaluating the performance of ASR models and other features such as Voice Activity Detection.
- Path to Deployment
NeMo Framework contains everything needed to train and customize speech models (ASR, Speech Classification, Speaker Recognition, Speaker Diarization, and TTS) in a reproducible manner.
Recipes and pre-trained checkpoints of several ASR and TTS models, as well as instructions on how to load them, are provided.
A set of tools useful for developing ASR and TTS models, including:
NeMo models trained or customized using the framework can be optimized and deployed with NVIDIA Riva, which includes containers, and helm charts designed to automate the steps for push-button deployment.
Example Scripts for Pretraining and Fine-tuning
These scripts run a recommended config for GPT3, LLAMA2, NeMo Pretraining, and Fine-tuning for various model sizes on A100, H100. For example, for GPT3 pretrain the following folders provide sample scripts.
A100 : Scripts to run GPT pretraining on NVIDIA A100, in bf16 data type
H100 : Scripts to run GPT pretraining for NVIDIA H100, in fp8 data type
Setup
To run these scripts, you must have access to the NeMo Framework Container.. - Please sign in at NGC (user = ea-bignlp/ga-participants) to access the catalog.
Update the following bash variables in the example run scripts:
NEMO_MEGATRON_LAUNCHER_DIR
: the directory of where this repository is locatedDATA_DIR
: the directory of the dataset used for pretraining, by default this isNEMO_MEGATRON_LAUNCHER_DIR/data
Enter your cluster environment settings at config.yaml
For bcm type clusters update the job name, partition, and account at bcm.yaml
For testing performance with synthetic data on an interactive node, you need to add the following options to your bash script:
cluster_type=interactive \ ++training.cluster_type=BCP \ training.model.data.data_impl="mock" \ training.model.data.data_prefix=[]
For further details see General Configuration
Collect Results
For performance, the “step_time_per_sec” variable on the console out provides a quick way to read performance of a workload.
For more details and graphics, one can use tensorboard or Weights and Biases. In order to use that, please use results stored at NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>
with the following structure:
NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<experiment_name>.yaml
: The config of the pretrained modelNEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<jobname>_<experiment_name>.sh
: The autogenerated .sh file that was runNEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/results/
: Directory contained per rank logs, and tensorboard data.
For further details see Interpreting the Results
Benchmark Results
Large Language Models Pretraining
The results in the table below show pre-training performance of various models on DGXH100, with FP8.
Please refer to MLCommons Training results for performance of GPT3-175B pre-training on large scale H100 systems.
To calculate Model TFLOPs, please see Appendix A in paper.
Model |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to train in days (10T tokens, 1K GPUs) |
---|---|---|---|---|---|---|---|---|---|
GPT3-175B | 512 | 2048 | 1 | 2048 | 4 | 8 | 741 | 797 | 153 |
GPT3-5B | 64 | 2048 | 4 | 2048 | 1 | 1 | 23574 | 746 | 5 |
GPT3-20B | 64 | 256 | 2 | 2048 | 2 | 1 | 5528 | 708 | 20 |
LLAMA2-7B | 8 | 128 | 1 | 4096 | 1 | 1 | 16290 | 751 | 7 |
LLAMA2-13B | 16 | 128 | 1 | 4096 | 4 | 1 | 8317 | 725 | 14 |
LLAMA2-70B | 64 | 128 | 1 | 4096 | 4 | 4 | 1725 | 767 | 66 |
Nemotron-8B | 8 | 32 | 2 | 4096 | 2 | 1 | 11538 | 593 | 10 |
Nemotron-22B | 16 | 32 | 2 | 4096 | 1 | 4 | 3828 | 499 | 30 |
Large Language Models Fine-tuning
The following table provides performance benchmarking of LLAMA2 models with SFT (supervised fine-tuning), and LoRA (Low-rank adaptors) on DGXH100, with FP8.
For fine-tuning, we use SQuAD-v1.1 dataset, and the inputs are packed to 4096 tokens.
To calculate Model TFLOPs, please see Appendix A in paper.
Model |
Mode |
#-GPUs |
GBS |
MBS |
Sequence Length |
TP |
PP |
Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to complete in mins (10M tokens) |
---|---|---|---|---|---|---|---|---|---|---|
LLAMA2-7B | SFT | 8 | 32 | 1 | 4096 | 1 | 1 | 14761 | 591 | 1.4 |
LLAMA2-13B | SFT | 8 | 32 | 1 | 4096 | 1 | 4 | 8989 | 698 | 2.3 |
LLAMA2-70B | SFT | 16 | 32 | 1 | 4096 | 4 | 4 | 1470 | 609 | 7.1 |
LLAMA2-7B | LoRA | 8 | 32 | 1 | 4096 | 1 | 1 | 20750 | 556 | 1.0 |
LLAMA2-13B | LoRA | 8 | 32 | 1 | 4096 | 1 | 1 | 12584 | 654 | 1.7 |
LLAMA2-70B | LoRA | 8 | 32 | 1 | 4096 | 2 | 4 | 2279 | 631 | 9.1 |
Python
Pytorch
Bash
Footnotes