Getting Started

Welcome to the NeMo Framework Getting Started Guide. This guide is designed to help you understand some fundamental concepts related to the various components of the framework and point you to some resources to kickstart your journey in using it to build generative AI applications.

Get Access to NeMo Framework

NeMo Framework now supports Large Language Models (LLMs), Multimodal Models (MMs), Automatic Speech Recognition (ASR), and Text-to-Speech (TTS) in a single consolidated Docker container.

To get access to the container, log in to the NVIDIA GPU Cloud (NGC) platform or create a free NGC account here: NVIDIA NGC. Once you have logged in, you can get the container here: NVIDIA NGC NeMo Framework.

Please use the latest tag in the form yy.mm.(patch), for example: docker pull nvcr.io/nvidia/nemo:24.05

Note

Speech AI containers dated prior to 24.01 do not have a .speech extension. For example, the container for the NeMo v1.22 release is nvcr.io/nvidia/nemo:23.10. Refer to container tags for a comprehensive list of previous container versions.

Large Language Models

Training

NeMo can train LLMs of any scale, ranging from small models with a few billion parameters, to very large models with hundreds of billions to trillions of parameters. Training models that differ in size by two orders of magnitude present distinct and unique challenges. NeMo addresses all these challenges by leveraging optimized and scalable data loaders, model parallelism techniques, memory optimizations, and providing training recipes.

Two options are available to launch training in NeMo - with or without the NeMo Launcher.

  • Without NeMo Launcher: This method works well with a simple setup involving small models. It can accelerate your training process and directly expose you to NeMo training scripts. For a concrete step-by-step training tutorial, follow the GPT model training guide.

  • With NeMo Launcher: This method is the recommended way to launch model training jobs on big clusters, such as BCM (Slurm), BCP, AWS, Azure, and OCI. NeMo Launcher provides an interface for the efficient management and organization of experiments across many nodes. Users specify all job details via editing a YAML configuration file, which is executed by a unified launching script. As a result of this abstraction, you will not be exposed directly to the underlying NeMo tools and scripts. However, it provides you with a necessary and efficient way to handle the complexities of training jobs at scale. Follow the playbook for Foundation Model Pre-training using NeMo Framework to get started training using the Launcher.

Alignment

NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state-of-the-art model alignment algorithms such as SteerLM, DPO and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless, and helpful.

The NeMo-Aligner toolkit is built using the NeMo Framework which enables scalable training up to 1000s of GPUs using tensor, data, and pipeline parallelism for all components of alignment. All of the checkpoints are cross-compatible with the NeMo ecosystem, allowing for further customization and inference deployment.

The recommended way to use NeMo-Aligner is through the NeMo Framework Docker container, as mentioned in Get access to NeMo Framework.

For developers of NeMo Framework, it is also possible to install NeMo-Aligner from source or build a Docker container by following the instructions at the NeMo-Aligner GitHub repo.

Follow the tutorials provided at the Model Alignment documentation for a step-by-step workflow of end-to-end RLHF on a small GPT-2B model. We demonstrate all the three phases of RLHF:

In addition, we demonstrate NeMo support for two novel alignment methods:

  • SteerLM: a technique based on conditioned-SFT, with steerable output.

  • DPO: a lightweight alignment algorithm compared to RLHF with a simpler loss function.

Customization

Model customization allows users to tailor a general pre-trained LLM to a specific use case or domain. It produces a tuned model that can leverage the vast amount of data available during pretraining, while at the same time producing outputs that are more accurate for the specific downstream task.

Model customization is achieved by fine-tuning the LLM in a supervised fashion. There are two popular categories of methods:

  • Full-Parameter Fine-Tuning, which is referred to as Supervised Fine-Tuning (SFT) in NeMo

  • Parameter-Efficient Fine-Tuning (PEFT)

In SFT, all of the model parameters are updated to produce outputs that are adapted to the task. On the other hand, PEFT tunes a much smaller number of parameters which are inserted into the base model at strategic locations. While SFT often produces the best possible result, PEFT methods can usually reach nearly the same degree of accuracy with a fraction of the computational cost. As language models become larger each day, PEFT is gaining popularity due to its lightweight requirement on training hardware.

The NeMo Framework supports SFT as well as several PEFT techniques. You can find more information about PEFT in the NeMo developer documentation.

We also have examples of Llama 2 SFT and PEFT playbooks to help you get started. After going through the Llama 2 playbooks, you can also try out customization of other models, such as Mistral, Mixtral, Nemotron, Falcon or T5 (please see playbooks). By simply changing the config arguments, you can also easily switch between the various supported PEFT methods in NeMo.

Inference

The NeMo Framework provides three distinct paths for LLM inference, catering to different deployment scenarios and performance needs. These paths include in-framework inference, exporting to TensorRT-LLM and deploying with Triton, and enterprise deployment with NVIDIA NIM. Each path offers unique advantages and is suited to different use cases.

In-Framework Inference

In-framework inference involves running LLM models directly within the NeMo Framework. This approach is straightforward and does not require exporting models to another format. It is ideal for development and testing phases, where ease of use and flexibility are paramount. The NeMo Framework supports multi-node and multi-GPU inference, while maximizing throughput. This method allows for quick iterations and testing directly within the NeMo environment.

Exporting to TensorRT-LLM

For scenarios requiring optimized performance, NeMo models can leverage TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs. This process involves converting NeMo models into a format compatible with TensorRT-LLM using the nemo.export module.

For an example, refer to Export an LLM model to TensorRT-LLM with NeMo APIs.

Once exported, these models can be deployed using NVIDIA Triton Inference Server, a platform for deploying trained AI models so that they are accessible over a network. Triton provides a scalable and efficient way to serve inference requests, supporting HTTP/REST and gRPC protocols. It can handle single-GPU, multi-GPU, and multi-node configurations, ensuring high throughput and low latency for LLM inference. You can see an example of how to deploy a TensorRT-LLM Model with Triton Inference Server in this tutorial. NeMo Framework also includes the nemo.deploy module to simplify this process to a few lines of code.

Please note that NeMo Framework Inference container is deprecated and all related documentation will be removed or updated in the next release.

Using NVIDIA NIM

Enterprises looking for a comprehensive solution that includes deployment on-premises or in the cloud can use use NVIDIA Inference Microservices (NIM). This approach leverages the NVIDIA AI Enterprise suite, which includes support for NVIDIA NeMo, Triton Inference Server, TensorRT-LLM, and other NVIDIA AI software. This option is ideal for organizations that need a reliable and scalable solution for deploying generative AI models in production environments.

To learn more about NVIDIA NIM, visit the NVIDIA website.

Multimodal Models

NeMo Framework introduces support for MMs by providing optimized software to train and deploy state-of-the-art models across several categories: Multimodal Language Models, Vision-Language Foundations, Text-to-Image models, and Beyond 2D Generation using Neural Radiance Fields (NeRF).

Each category is designed to cater to specific needs and advancements in the field, leveraging cutting-edge models to handle a wide range of data types, including text, images, and 3D models.

To get started, check out our latest tutorials:

  1. Multimodal Data Preparation

  2. NeVA (LLaVA) Pretraining, Fine-tuning, and In-Framework Inference

  3. Stable Diffusion Training and In-Framework Inference

  4. Dreambooth Training and In-Framework Inference

You can follow the tutorials in the latest NeMo Framework Training container, where the notebooks are located in /opt/NeMo/tutorials/multimodal.

Additionally, we support CLIP, Imagen, ControlNet, InstructPix2Pix, and NeRF.

Check our Multimodal Models documentation for more details.

Speech AI

NeMo Framework supports the training and customizing of speech AI models for tasks such as Automatic Speech Recognition (ASR) and Text-to-Speech Synthesis (TTS).

For a quick setup of NeMo speech AI training and inference, we recommend using the NeMo Framework speech container, as indicated in Get access to NeMo Framework.

You can try out quick inference of NeMo’s ASR and TTS models with the Speech AI Quickstart guide.

To learn more about Speech AI model training, please refer to the tutorial notebooks.