Getting Started

Welcome to the NeMo Framework Getting Started guide! This guide is designed to help you understand some fundamental concepts related to the various components of the framework, and point you to some resources to kickstart your journey in using it to build generative AI applications.

NeMo Framework is available as a docker container. To get access to it, you need to log in to NGC or create a free NGC account here: NVIDIA NGC. Once you have logged in, you can get the container here: NVIDIA NGC NeMo Framework.

Large Language and Multimodal models

Please use the latest tag in the form yy.mm.(patch).framework, for example: docker pull nvcr.io/nvidia/nemo:24.03.framework

Speech AI

Please use the latest tag in the form yy.mm.speech, for example: docker pull nvcr.io/nvidia/nemo:24.01.speech

Note

Speech AI containers dated prior to 24.01 do not have a .speech extension. For example, the container for the NeMo v1.22 release is nvcr.io/nvidia/nemo:23.10. Refer to container tags for a comprehensive list of previous container versions.

Training

NeMo can train LLMs of any scale, ranging from small models with a few billion parameters to very large models with hundreds of billions to trillions of parameters. Training models that differ in size by two orders of magnitude present distinct and unique challenges. NeMo addresses all these challenges by leveraging optimized and scalable data loaders, model parallelism techniques, memory optimizations, and providing training recipes.

There are two ways to launch training in NeMo - with or without the NeMo Launcher.

  • Without NeMo Launcher: This works well with a simple setup involving small models on a single or a few nodes. It can get you to training more quickly and expose you directly to NeMo training scripts. For a concrete step-by-step training tutorial, follow the GPT model training guide.

  • With NeMo Launcher: This is the recommended way to launch model training jobs on big clusters, such as BCM (Slurm), BCP, AWS, Azure, and OCI. NeMo Launcher provides an interface for the efficient management and organization of experiments across many nodes. Users specify all job details via editing a YAML configuration file, which is executed by a unified launching script. As a result of this abstraction, you will not be exposed directly to the underlying NeMo tools and scripts, but a necessary and efficient way to handle complexities of the training job at scale. Follow the playbook for Foundation Model Pre-training using NeMo Framework to get started training using the Launcher.

Alignment

NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state-of-the-art model alignment algorithms such as SteerLM, DPO and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless, and helpful.

The NeMo-Aligner toolkit is built using the NeMo Framework which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of the checkpoints are cross-compatible with the NeMo ecosystem, allowing for further customization and inference deployment.

The recommended way to use NeMo-Aligner is through the NeMo Framework docker container, as mentioned in Get access to NeMo Framework.

For developers of NeMo Framework, it is also possible to install NeMo-Aligner from source or build a docker container by following the instructions at the NeMo-Aligner GitHub repo.

Follow the tutorials at Model Alignment documentation for a step-by-step workflow of end-to-end RLHF on a small GPT-2B model. We demonstrate all the three phases of RLHF:

In addition, we demonstrate NeMo support for two novel alignment methods:

  • SteerLM: a technique based on conditioned-SFT, with steerable output.

  • DPO: a lightweight alignment algorithm compared to RLHF with a simpler loss function.

Customization

Model customization allows users to tailor a general pre-trained LLM to a specific use case or domain. It produces a tuned model that can leverage the vast amount of data available during pretraining, while at the same time producing outputs that are more accurate for the specific downstream task.

Model customization is achieved by finetuning the LLM in a supervised fashion. There are two popular categories of methods

  • Full-parameter finetuning, which is referred to as supervised finetuning (SFT) in NeMo

  • Parameter-efficient finetuning (PEFT)

In SFT, all of the model parameters are updated to produce outputs that are adapted to the task. On the other hand, PEFT tunes a much smaller number of parameters which are inserted into the base model at strategic locations. While SFT often produces the best possible result, PEFT methods can usually reach nearly the same degree of accuracy with a fraction of the computational cost. With language models becoming larger each day, PEFT is gaining popularity due to its lightweight requirement on training hardware.

The NeMo framework supports SFT as well as several PEFT techniques. You can find more information about PEFT in the NeMo developer documentation.

We have examples of SFT and PEFT for Llama 2 available as playbooks for you to get started. After going through the Llama 2 playbooks, you can also try out customization of other models, such as Mistral, Mixtral, Nemotron, Falcon or T5 (please see playbooks). By simply changing the config arguments, you can also easily switch between the various supported PEFT methods in NeMo.

Inference

The NVIDIA NeMo Framework provides three distinct paths for LLM inference, catering to different deployment scenarios and performance needs. These paths include in-framework inference, exporting to TensorRT-LLM and deploying with Triton, and enterprise deployment with NVIDIA NIM. Each path offers unique advantages and is suited to different use cases.

In-Framework Inference

In-framework inference involves running LLM models directly within the NeMo Framework. This approach is straightforward and does not require exporting models to another format. It is ideal for development and testing phases, where ease of use and flexibility are paramount. The NeMo Framework supports multi-node and multi-GPU inference, maximizing throughput. This method allows for quick iterations and testing directly within the NeMo environment.

Exporting to TensorRT-LLM

For scenarios requiring optimized performance, NeMo models can leverage TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs. This process involves converting NeMo models into a format compatible with TensorRT-LLM using the nemo.export module.

For an example, refer to Export an LLM model to TensorRT-LLM with NeMo APIs.

Once exported, these models can be deployed using NVIDIA Triton Inference Server, a platform for deploying trained AI models so that they are accessible over a network. Triton provides a scalable and efficient way to serve inference requests, supporting HTTP/REST and gRPC protocols. It can handle single-GPU, multi-GPU, and multi-node configurations, ensuring high throughput and low latency for LLM inference. You can see an example of how to deploy a TensorRT-LLM Model with Triton Inference Server in this tutorial. NeMo Framework also includes the nemo.deploy module to simplify this process to a few lines of code.

Please note that NeMo Framework Inference container is deprecated and the all related documentation will be removed or updated in the next release.

Using NVIDIA NIM

Enterprises looking for a comprehensive solution that includes deployment on-premises or in the cloud can use NVIDIA NIM. This approach leverages the NVIDIA AI Enterprise suite, which includes support for NVIDIA NeMo, Triton Inference Server, TensorRT-LLM, and other NVIDIA AI software. This option is ideal for organizations that need a reliable and scalable solution for deploying generative AI models in production environments.

To learn more about NVIDIA NIM, visit the NVIDIA website

NeMo Framework introduces support for multimodal models by providing optimized software to train and deploy SOTA models across several categories: Multimodal Language Models, Vision-Language Foundations, Text-to-Image Models, and Beyond 2D Generation using NeRF.

Each category is designed to cater to specific needs and advancements in the field, leveraging cutting-edge models to handle a wide range of data types, including text, images, and 3D models.

To get started, check out our latest tutorials:

  1. Multimodal Data Preparation

  2. NeVA (LLaVA) Pretraining, Fine-tuning, and In-Framework Inference

  3. Stable Diffusion Training and In-Framework Inference

  4. Dreambooth Training and In-Framework Inference

You can step by step follow the tutorials in the latest NeMo Framework Training container, where the notebooks are located in /opt/NeMo/tutorials/multimodal.

Additionally, we support CLIP, Imagen, ControlNet, InstructPix2Pix, and NeRF.

Check our multimodal introduction page for more details.

NeMo framework supports the training and customizing of speech AI models for tasks such as automatic speech recognition (ASR), and text-to-speech synthesis (TTS).

For a quick setup of NeMo speech AI training and inference, we recommend using the NeMo Framework speech container, as indicated in Get access to NeMo Framework.

You can try out quick inference of NeMo’s ASR and TTS models with the Speech AI Quickstart guide.

To learn more about Speech AI model training, please refer to the tutorial notebooks.

Previous Software Component Versions
Next Playbooks
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.