Overview

NVIDIA NeMo Framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (e.g. Automatic Speech Recognition and Text-to-Speech). It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pre-trained model checkpoints.

Large Language Models and Multimodal Models

_images/nemo-llm-mm-stack.png

NeMo Framework offers comprehensive functionalities for developing both Language Learning Models (LLMs) and Multimodal Models (MMs). These functionalities cover the entire model development process. You have the flexibility to use this framework either on-premises or with a cloud provider of your preference.

Data Curation 1

NeMo Curator is a Python library that includes a suite of data-mining modules. These modules are optimized for GPUs and designed to scale, making them ideal for curating natural language data to train LLMs. With NeMo Curator, researchers in Natural Language Processing (NLP) can efficiently extract high-quality text from extensive raw web data sources.

Model Training and Customization

NeMo Framework provides a comprehensive set of tools for the efficient training and customization of LLMs and Multimodal models. This includes the setup of the compute cluster, data downloading, and model hyperparameter selection. Each model and task come with default configurations that are regularly tested. However, these configurations can be adjusted to train on new datasets or test new model hyperparameters. For customization, NeMo Framework supports not only fully Supervised Fine-Tuning (SFT), but also a range of Parameter Efficient Fine-Tuning (PEFT) techniques. These techniques include Ptuning, LoRA, Adapters, and IA3. They typically achieve nearly the same accuracy as SFT, but at a fraction of the computational cost.

Model Alignment 1

Part of the framework, NeMo-Aligner is a scalable toolkit for efficient model alignment. NeMo-Aligner, a component of NeMo Framework, is a scalable toolkit designed for effective model alignment. The toolkit supports Supervised Finetuning (SFT) and other state-of-the-art (SOTA) model alignment algorithms such as SteerLM, DPO, and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe and helpful.

Launcher

NeMo Launcher streamlines your experience with the NeMo Framework by providing an intuitive interface for constructing comprehensive workflows. This allows for effective organization and management of experiments across different environments. Based on the Hydra framework, NeMo Launcher enables users to easily create and modify hierarchical configurations using both configuration files and command-line arguments. It simplifies the process of initiating large-scale training, customization, or alignment tasks. These tasks can be run locally (supporting single node), on NVIDIA Base Command Manager (Slurm), or on cloud providers such as AWS, Azure, and Oracle Cloud Infrastructure (OCI). This is all made possible through Launcher scripts, eliminating the need for writing any code.

Model Inference

NeMo Framework seamlessly integrates with enterprise-level model deployment tools through NVIDIA NIM. This integration is powered by NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server, ensuring optimized and scalable inference.

Model Support

NeMo Framework supports end-to-end model development workflows for a variety of models. This includes popular community models such as Gemma, Starcoder 2, Llama 1/2, Baichuan 2, Falcon, Mixtral, Mistral, and others, as well as NVIDIA Nemotron models. The support extends to both pretraining and fine-tuning for all language and multimodal models. More specific details can be found in the support matrix provided below.

_images/LLMSupportMatrix.png _images/MMSupportMatrix.png

Speech AI

Developing conversational AI models is a complex process that involves defining, constructing, and training models within particular domains. This process typically requires several iterations to reach a high level of accuracy. It often involves multiple iterations to achieve high accuracy, fine-tuning on various tasks and domain-specific data, ensuring training performance, and preparing models for inference deployment.

_images/nemo-speech-ai.png

NeMo Framework provides support for the training and customization of Speech AI models. This includes tasks like Automatic Speech Recognition (ASR) and Text-To-Speech (TTS) synthesis. It offers a smooth transition to enterprise-level production deployment with NVIDIA Riva. To assist developers and researchers, NeMo Framework includes state-of-the-art pre-trained checkpoints, tools for reproducible speech data processing, and features for interactive exploration and analysis of speech datasets. The components of the NeMo Framework for Speech AI are as follows:

Training and Customization

NeMo Framework contains everything needed to train and customize speech models (ASR, Speech Classification, Speaker Recognition, Speaker Diarization, and TTS) in a reproducible manner.

SOTA Pre-trained Models

NeMo Framework provides state-of-the-art recipes and pre-trained checkpoints of several ASR and TTS models, as well as instructions on how to load them.

Speech Tools

NeMo Framework provides a set of tools useful for developing ASR and TTS models, including:

  • NeMo Forced Aligner (NFA) for generating token-, word- and segment-level timestamps of speech in audio using NeMo’s CTC-based Automatic Speech Recognition models.

  • Speech Data Processor (SDP), a toolkit for simplifying speech data processing. It allows you to represent data processing operations in a config file, minimizing boilerplate code, and allowing reproducibility and shareability.

  • Speech Data Explorer (SDE), a Dash-based web application for interactive exploration and analysis of speech datasets.

  • Dataset creation tool which provides functionality to align long audio files with the corresponding transcripts and split them into shorter fragments that are suitable for Automatic Speech Recognition (ASR) model training.

  • Comparison Tool for ASR Models to compare predictions of different ASR models at word accuracy and utterance level.

  • ASR Evaluator for evaluating the performance of ASR models and other features such as Voice Activity Detection.

  • Text Normalization Tool for converting text from the written form to the spoken form and vice versa (e.g. “31st” vs “thirty first”).

Path to Deployment

NeMo models that have been trained or customized using the NeMo framework can be optimized and deployed with NVIDIA Riva. Riva provides containers and Helm charts specifically designed to automate the steps for push-button deployment.

Programming Languages and Frameworks

  • Python

  • Pytorch

  • Bash

Performance Benchmarks

Large Language Models

Pretraining

  • The results in the table below show pre-training performance (using NeMo Framework 24.05) of various models on DGXH100, with FP8 precision.

  • Please refer to MLCommons Training results for performance of GPT3-175B pre-training on large scale H100 systems.

Model

#-GPUs

GBS

MBS

Sequence Length

TP

PP

CP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to train in days (10T tokens, 1K GPUs)

GPT3-5B

64

2048

4

2048

1

1

1

23117

755

5

GPT3-20B

64

256

2

2048

2

1

1

5611

719

20

LLAMA2-7B

8

128

1

4096

1

1

1

16154

744

7

LLAMA2-13B

16

128

1

4096

1

4

1

8344

727

14

LLAMA2-70B

64

128

1

4096

4

4

1

1659

737

68

Nemotron-8B

64

256

4

4096

2

1

1

11753

604

10

Nemotron-22B

64

256

2

4096

2

4

1

4113

536

27

Nemotron-340B

128

32

1

4096

8

8

1

295

621

383

LLAMA3-8B

8

128

1

8192

1

1

2

11879

688

10

LLAMA3-70B

64

128

1

8192

4

4

2

1444

695

78

Finetuning

  • The following table provides performance benchmarking of LLAMA2 models with SFT (supervised fine-tuning), and LoRA (Low-rank adaptors) on DGXH100, with FP8.

  • For fine-tuning, we use SQuAD-v1.1 dataset, and the inputs are packed to 4096 tokens.

Model

Mode

#-GPUs

GBS

MBS

Sequence Length

TP

PP

Tokens / sec / GPU

Model TFLOP / sec / GPU

Est. time to complete in mins (10M tokens)

LLAMA2-7B

SFT

8

32

1

4096

1

1

16891

673

1.2

LLAMA2-13B

SFT

8

32

1

4096

1

4

9384

726

2.2

LLAMA2-70B

SFT

16

32

1

4096

4

4

1739

717

6.0

LLAMA2-7B

LoRA

8

32

1

4096

1

1

23711

633

0.9

LLAMA2-13B

LoRA

8

32

1

4096

1

1

14499

751

1.4

LLAMA2-70B

LoRA

8

32

1

4096

2

4

2470

681

8.4

Resources

GitHub repos

Where to get help

Licenses

Footnotes

1(1,2)

Currently, NeMo Curator and NeMo Aligner support for Multimodal models is a work in progress and will be available very soon.