Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Getting Started
Welcome to the NeMo Framework Getting Started Guide. This guide is designed to help you understand some fundamental concepts related to the various components of NeMo Framework and point you to some resources to kickstart your journey in using it to build generative AI applications.
Get Access to NeMo Framework
The NeMo Framework can be accessed in a variety of ways, depending on your needs.
Docker Containers
NeMo Framework now supports Large Language Models (LLMs), Multimodal Models (MMs), Automatic Speech Recognition (ASR), and Text-to-Speech (TTS) in a single consolidated Docker container.
This is the quickest way to get started with NeMo and is recommended for LLM and multimodal domains.
Conda / Pip
This is the recommended method for ASR and TTS domains.
When using a Nvidia PyTorch container as the base, this is the recommended method for all domains.
NLP and MMs Dependencies
If working with the NLP or multimodal collections, NVIDIA Apex, NVIDIA Transformer Engine, and NVIDIA Megatron Core are required.
If working with NeMo 2.0 and the corresponding LLM collection, only Megatron Core is required. However, Apex and Transformer Engine are recommended for optimal performance.
NeMo containers are launched concurrently with NeMo version updates. You can find additional information about released containers on the NeMo releases page.
To get access to the container, log in to the NVIDIA GPU Cloud (NGC) platform or create a free NGC account here: NVIDIA NGC. Once you have logged in, you can get the container here: NVIDIA NGC NeMo Framework.
To use a pre-built container, run the following code:
docker pull nvcr.io/nvidia/nemo:24.07
Please use the latest tag in the form yy.mm.(patch).
To build a nemo container with Dockerfile from a branch, run the following code:
DOCKER_BUILDKIT=1 docker build -f Dockerfile -t nemo:latest
If you choose to work with the main branch, we recommend using NVIDIA’s PyTorch container version 23.10-py3 and then installing from GitHub.
docker run --gpus all -it --rm -v <nemo_github_folder>:/NeMo --shm-size=8g \ -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit \ stack=67108864 --device=/dev/snd nvcr.io/nvidia/pytorch:23.10-py3
Install NeMo in a fresh Conda environment:
conda create --name nemo python==3.10.12 conda activate nemo
Install PyTorch using their configurator:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
The command to install PyTorch may depend on your system. Use the configurator linked above to find the right command for your system.
Then, install NeMo via Pip or from Source. We do not provide NeMo on the conda-forge or any other Conda channel.
Important
We strongly recommended that you start with a base NVIDIA PyTorch container:
nvcr.io/nvidia/pytorch:24.07-py3
.To install the nemo_toolkit, use the following installation method:
apt-get update && apt-get install -y libsndfile1 ffmpeg pip install Cython packaging pip install nemo_toolkit['all']
Pip from a Specific Domain
To install a specific domain of NeMo, you must first install the nemo_toolkit using the instructions listed above. Then, you run the following domain-specific commands:
pip install nemo_toolkit['asr'] pip install nemo_toolkit['nlp'] pip install nemo_toolkit['tts'] pip install nemo_toolkit['vision'] pip install nemo_toolkit['multimodal']
Note that to install the LLM domain, you should use the command
pip install nemo_toolkit['all']
.Pip from a Source Branch
If you want to work with a specific version of NeMo from a particular GitHub branch (e.g main), use the following installation method:
apt-get update && apt-get install -y libsndfile1 ffmpeg pip install Cython packaging python -m pip install git+https://github.com/NVIDIA/NeMo.git@{BRANCH}#egg=nemo_toolkit[all]
Depending on the shell used, you may need to use the “nemo_toolkit[all]” specifier instead in the above command.
If you work with the NLP and MM domains, three additional dependencies are required: NVIDIA Apex, NVIDIA Transformer Engine, and NVIDIA Megatron Core. For instructions on how to install Apex, Tansformer Engine, and Megatron Core, please refer to the NeMo GitHub README.
The same steps used to install NeMo 1.0 can be used to install NeMo 2.0. When installing NeMo 2.0, be sure to pip install the full NeMo toolkit (e.g. pip install nemo_toolkit[‘all’]). Similar to NeMo 1.0, NVIDIA Megatron Core is a required dependency. However, unlike NeMo 1.0, NVIDIA Apex and NVIDIA Transformer Engine are both optional.
If you would like to run NeMo without Transformer Engine and Apex, you can do so inside of a conda environment using the following steps:
conda create --name nemo python==3.10.12 conda activate nemo
Install PyTorch using their configurator:
conda install pytorch==2.2.0 torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
The command to install PyTorch may depend on your system. Use the configurator linked above to find the right command for your system.
You can then install NeMo and Megatron Core from source as follows:
pip install git+https://github.com/NVIDIA/NeMo.git@BRANCH#egg=nemo_toolkit[all] pip install git+https://github.com/NVIDIA/Megatron-LM.git@BRANCH
Note
Though optional, Transformer Engine and Apex are recommended for optimal performance.
Note
RMSNorm is currently not supported when Apex is not installed. Non-Apex support for RMSNorm will be added in a future release.
Large Language Models
Training
NeMo can train LLMs of any scale, ranging from small models with a few billion parameters, to very large models with hundreds of billions to trillions of parameters. Training models that differ in size by two orders of magnitude present distinct and unique challenges. NeMo addresses all these challenges by leveraging optimized and scalable data loaders, model parallelism techniques, memory optimizations, and providing training recipes.
Two options are available to launch training in NeMo - with or without the NeMo Launcher.
Without NeMo Launcher: This method works well with a simple setup involving small models. It can accelerate your training process and directly expose you to NeMo training scripts. For a concrete step-by-step training tutorial, follow the GPT model training guide.
With NeMo Launcher: This method is the recommended way to launch model training jobs on big clusters, such as BCM (Slurm), BCP, AWS, Azure, and OCI. NeMo Launcher provides an interface for the efficient management and organization of experiments across many nodes. Users specify all job details via editing a YAML configuration file, which is executed by a unified launching script. As a result of this abstraction, you will not be exposed directly to the underlying NeMo tools and scripts. However, it provides you with a necessary and efficient way to handle the complexities of training jobs at scale. Follow the playbook for Foundation Model Pre-training using NeMo Framework to get started training using the Launcher.
Alignment
NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit has support for state-of-the-art model alignment algorithms such as SteerLM, DPO and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe, harmless, and helpful.
The NeMo-Aligner toolkit is built using the NeMo Framework which enables scalable training up to 1000s of GPUs using tensor, data, and pipeline parallelism for all components of alignment. All of the checkpoints are cross-compatible with the NeMo ecosystem, allowing for further customization and inference deployment.
The recommended way to use NeMo-Aligner is through the NeMo Framework Docker container, as mentioned in Get access to NeMo Framework.
For developers of NeMo Framework, it is also possible to install NeMo-Aligner from source or build a Docker container by following the instructions at the NeMo-Aligner GitHub repo.
Follow the tutorials provided at the Model Alignment documentation for a step-by-step workflow of end-to-end RLHF on a small GPT-2B model. We demonstrate all the three phases of RLHF:
In addition, we demonstrate NeMo support for two novel alignment methods:
SteerLM: a technique based on conditioned-SFT, with steerable output.
DPO: a lightweight alignment algorithm compared to RLHF with a simpler loss function.
Customization
Model customization allows users to tailor a general pre-trained LLM to a specific use case or domain. It produces a tuned model that can leverage the vast amount of data available during pretraining, while at the same time producing outputs that are more accurate for the specific downstream task.
Model customization is achieved by fine-tuning the LLM in a supervised fashion. There are two popular categories of methods:
Full-Parameter Fine-Tuning, which is referred to as Supervised Fine-Tuning (SFT) in NeMo
Parameter-Efficient Fine-Tuning (PEFT)
In SFT, all of the model parameters are updated to produce outputs that are adapted to the task. On the other hand, PEFT tunes a much smaller number of parameters which are inserted into the base model at strategic locations. While SFT often produces the best possible result, PEFT methods can usually reach nearly the same degree of accuracy with a fraction of the computational cost. As language models become larger each day, PEFT is gaining popularity due to its lightweight requirement on training hardware.
The NeMo Framework supports SFT as well as several PEFT techniques. You can find more information about PEFT in the NeMo developer documentation.
We also have examples of Llama 2 SFT and PEFT playbooks to help you get started. After going through the Llama 2 playbooks, you can also try out customization of other models, such as Mistral, Mixtral, Nemotron, Falcon or T5 (please see playbooks). By simply changing the config arguments, you can also easily switch between the various supported PEFT methods in NeMo.
Inference
The NeMo Framework provides three distinct paths for LLM inference, catering to different deployment scenarios and performance needs. These paths include in-framework inference, exporting to TensorRT-LLM and deploying with Triton, and enterprise deployment with NVIDIA NIM. Each path offers unique advantages and is suited to different use cases.
In-Framework Inference
In-framework inference involves running LLM models directly within the NeMo Framework. This approach is straightforward and does not require exporting models to another format. It is ideal for development and testing phases, where ease of use and flexibility are paramount. The NeMo Framework supports multi-node and multi-GPU inference, while maximizing throughput. This method allows for quick iterations and testing directly within the NeMo environment.
Exporting to TensorRT-LLM
For scenarios requiring optimized performance, NeMo models can leverage TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs. This process involves converting NeMo models into a format compatible with TensorRT-LLM using the nemo.export module.
For an example, refer to Export an LLM model to TensorRT-LLM with NeMo APIs.
Once exported, these models can be deployed using NVIDIA Triton Inference Server, a platform for deploying trained AI models so that they are accessible over a network. Triton provides a scalable and efficient way to serve inference requests, supporting HTTP/REST and gRPC protocols. It can handle single-GPU, multi-GPU, and multi-node configurations, ensuring high throughput and low latency for LLM inference. You can see an example of how to deploy a TensorRT-LLM Model with Triton Inference Server in this tutorial. NeMo Framework also includes the nemo.deploy module to simplify this process to a few lines of code.
Note
The NeMo Framework Inference container is deprecated and all related documentation will be removed or updated in the next release.
Using NVIDIA NIM
Enterprises looking for a comprehensive solution that includes deployment on-premises or in the cloud can use use NVIDIA Inference Microservices (NIM). This approach leverages the NVIDIA AI Enterprise suite, which includes support for NVIDIA NeMo, Triton Inference Server, TensorRT-LLM, and other NVIDIA AI software. This option is ideal for organizations that need a reliable and scalable solution for deploying generative AI models in production environments.
To learn more about NVIDIA NIM, visit the NVIDIA website.
Multimodal Models
NeMo Framework introduces support for MMs by providing optimized software to train and deploy state-of-the-art models across several categories: Multimodal Language Models, Vision-Language Foundations, Text-to-Image models, and Beyond 2D Generation using Neural Radiance Fields (NeRF).
Each category is designed to cater to specific needs and advancements in the field, leveraging cutting-edge models to handle a wide range of data types, including text, images, and 3D models.
To get started, check out our latest tutorials:
NeVA (LLaVA) Pretraining, Fine-tuning, and In-Framework Inference
LITA Checkpoint conversion, Fine-tuning, and In-Framework Inference
You can follow the tutorials in the latest NeMo Framework Training container, where the notebooks are located in /opt/NeMo/tutorials/multimodal
.
Additionally, we support CLIP, Imagen, ControlNet, InstructPix2Pix, and NeRF.
Check our Multimodal Models documentation for more details.
Speech AI
NeMo Framework supports the training and customizing of speech AI models for tasks such as Automatic Speech Recognition (ASR) and Text-to-Speech Synthesis (TTS).
For a quick setup of NeMo speech AI training and inference, we recommend using the NeMo Framework speech container, as indicated in Get access to NeMo Framework.
You can try out quick inference of NeMo’s ASR and TTS models with the Speech AI Quickstart guide.
To learn more about Speech AI model training, please refer to the tutorial notebooks.