Overview

NVIDIA NeMo framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech). It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pretrained model checkpoints.

nemo-llm-mm-stack.png

NeMo framework provides end-to-end model development functionalities for LLMs and Multimodal models that can be used on-prem or a cloud provider of your choice:

Data Curation 1

NeMo Curator is a Python library that consists of a collection of GPU-optimized scalable data-mining modules for curating natural language data for training large language models (LLMs). The modules within NeMo Curator enable NLP researchers to mine high-quality text at scale from massive raw web corpora.

Model Training and Customization

NeMo framework includes all the tooling required to efficiently train and customize LLMs and Multimodal models. This includes setting up the compute cluster, downloading data, and selecting model hyperparameters. The default configurations for each model and task are tested regularly and every configuration can be modified to train on new datasets or test new model hyperparameters. For customization, in addition to fully supervised finetuning (SFT), the framework also supports a variety of parameter efficient finetuning (PEFT) techniques such as Ptuning, LoRA, Adapters and IA3, which can usually reach nearly the same degree of accuracy with a fraction of the computational cost of SFT.

Model Alignment 1

Part of the framework, NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit supports Supervised Finetuning (SFT) and other state-of-the-art (SOTA) model alignment algorithms such as SteerLM, DPO, and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe and helpful.

Launcher

The NeMo Launcher streamlines your experience with the NeMo framework, offering a user-friendly interface to build end-to-end workflows for efficient management and organization of experiments across various environments. Built upon the Hydra framework, it empowers users to effortlessly compose and adjust hierarchical configurations using configuration files and command-line arguments. You can get started with large-scale training, customization, or alignment jobs locally (with single node support), on NVIDIA Base Command Manager (Slurm) or cloud providers - AWS, Azure, and Oracle Cloud Infrastructure (OCI) with launcher scripts without having to write any code.

Model Inference

NeMo framework integrates seamlessly with enterprise-grade model deployment tooling using NVIDIA NIM, powered by NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server for optimized and scalable inference.

Model Support

End-to-end model development workflows for popular community models (Gemma, Starcoder 2, Llama 1/2, Baichuan 2, Falcon, Mixtral, Mistral, and more) and NVIDIA Nemotron models are supported in the NeMo framework.

Developing conversational AI models is a complex process that entails defining, building, and training models in specific domains. This often involves multiple iterations to achieve high accuracy, fine-tuning on various tasks and domain-specific data, ensuring training performance, and preparing models for inference deployment.

nemo-speech-ai.png

NeMo framework supports training and customizing of speech AI models for tasks such as automatic speech recognition (ASR), and text-to-speech synthesis (TTS), with a seamless path to enterprise-grade production deployment with NVIDIA Riva. To further aid developers & researchers, it comes with state-of-the-art (SOTA) pre-trained checkpoints, and tooling for reproducible speech data processing, and interactive exploration & analysis of speech datasets. Parts of NeMo framework for Speech include:

Training and Customization

NeMo Framework contains everything needed to train and customize speech models (ASR, Speech Classification, Speaker Recognition, Speaker Diarization, and TTS) in a reproducible manner.

SOTA Pre-trained Models

Recipes and pre-trained checkpoints of several ASR and TTS models, as well as instructions on how to load them, are provided.

Speech Tools

A set of tools useful for developing ASR and TTS models, including:

  • NeMo Forced Aligner (NFA) for generating token-, word- and segment-level timestamps of speech in audio using NeMo’s CTC-based Automatic Speech Recognition models.

  • Speech Data Processor (SDP), a toolkit for simplifying speech data processing. It allows you to represent data processing operations in a config file, minimizing boilerplate code, and allowing reproducibility and shareability.

  • Speech Data Explorer (SDE), a Dash-based web application for interactive exploration and analysis of speech datasets.

  • Dataset creation tool which provides functionality to align long audio files with the corresponding transcripts and split them into shorter fragments that are suitable for Automatic Speech Recognition (ASR) model training.

  • Comparison Tool for ASR Models to compare predictions of different ASR models at word accuracy and utterance level.

  • ASR Evaluator for evaluating the performance of ASR models and other features such as Voice Activity Detection.

Path to Deployment

NeMo models trained or customized using the framework can be optimized and deployed with NVIDIA Riva, which includes containers, and helm charts designed to automate the steps for push-button deployment.

Example Scripts for Pretraining and Fine-tuning

These scripts run a recommended config for GPT3, LLAMA2, NeMo Pretraining, and Fine-tuning for various model sizes on A100, H100. For example, for GPT3 pretrain the following folders provide sample scripts.

  • A100 : Scripts to run GPT pretraining on NVIDIA A100, in bf16 data type

  • H100 : Scripts to run GPT pretraining for NVIDIA H100, in fp8 data type

Setup

  1. To run these scripts, you must have access to the NeMo Framework Container.. - Please sign in at NGC (user = ea-bignlp/ga-participants) to access the catalog.

  2. Update the following bash variables in the example run scripts:

    • NEMO_MEGATRON_LAUNCHER_DIR : the directory of where this repository is located

    • DATA_DIR : the directory of the dataset used for pretraining, by default this is NEMO_MEGATRON_LAUNCHER_DIR/data

  3. Enter your cluster environment settings at config.yaml

    For bcm type clusters update the job name, partition, and account at bcm.yaml

  4. For testing performance with synthetic data on an interactive node, you need to add the following options to your bash script:

    Copy
    Copied!
                

    cluster_type=interactive \ ++training.cluster_type=BCP \ training.model.data.data_impl="mock" \ training.model.data.data_prefix=[]

For further details see General Configuration

Collect Results

For performance, the “step_time_per_sec” variable on the console out provides a quick way to read performance of a workload.

For more details and graphics, one can use tensorboard or Weights and Biases. In order to use that, please use results stored at NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name> with the following structure:

  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<experiment_name>.yaml : The config of the pretrained model

  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<jobname>_<experiment_name>.sh : The autogenerated .sh file that was run

  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/results/ : Directory contained per rank logs, and tensorboard data.

For further details see Interpreting the Results

Benchmark Results

Large Language Models Pretraining

  • The results in the table below show pre-training performance of various models on DGXH100, with FP8.

  • Please refer to MLCommons Training results for performance of GPT3-175B pre-training on large scale H100 systems.

  • To calculate Model TFLOPs, please see Appendix A in paper.

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to train in days (10T tokens, 1K GPUs)

GPT3-175B 512 2048 1 2048 4 8 741 797 153
GPT3-5B 64 2048 4 2048 1 1 23574 746 5
GPT3-20B 64 256 2 2048 2 1 5528 708 20
LLAMA2-7B 8 128 1 4096 1 1 16290 751 7
LLAMA2-13B 16 128 1 4096 4 1 8317 725 14
LLAMA2-70B 64 128 1 4096 4 4 1725 767 66
Nemotron-8B 8 32 2 4096 2 1 11538 593 10
Nemotron-22B 16 32 2 4096 1 4 3828 499 30

Large Language Models Fine-tuning

  • The following table provides performance benchmarking of LLAMA2 models with SFT (supervised fine-tuning), and LoRA (Low-rank adaptors) on DGXH100, with FP8.

  • For fine-tuning, we use SQuAD-v1.1 dataset, and the inputs are packed to 4096 tokens.

  • To calculate Model TFLOPs, please see Appendix A in paper.

Model

Mode

#-GPUs

GBS

MBS

Sequence Length

TP

PP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to complete in mins (10M tokens)

LLAMA2-7B SFT 8 32 1 4096 1 1 14761 591 1.4
LLAMA2-13B SFT 8 32 1 4096 1 4 8989 698 2.3
LLAMA2-70B SFT 16 32 1 4096 4 4 1470 609 7.1
LLAMA2-7B LoRA 8 32 1 4096 1 1 20750 556 1.0
LLAMA2-13B LoRA 8 32 1 4096 1 1 12584 654 1.7
LLAMA2-70B LoRA 8 32 1 4096 2 4 2279 631 9.1

Footnotes

[1](1,2)

Currently, NeMo Curator and NeMo Aligner support for Multimodal models is a work in progress, and will be available very soon.

Previous NeMo Framework User Guide
Next Why NeMo Framework?
© Copyright 2023-2024, NVIDIA. Last updated on May 3, 2024.