NVIDIA NeMo framework is a scalable and cloud-native generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech). It enables users to efficiently create, customize, and deploy new generative AI models by leveraging existing code and pretrained model checkpoints.


NeMo framework provides end-to-end model development functionalities for LLMs and Multimodal models that can be used on-prem or a cloud provider of your choice:

Data Curation 1

NeMo Curator is a Python library that consists of a collection of GPU-optimized scalable data-mining modules for curating natural language data for training large language models (LLMs). The modules within NeMo Curator enable NLP researchers to mine high-quality text at scale from massive raw web corpora.

Model Training and Customization

NeMo framework includes all the tooling required to efficiently train and customize LLMs and Multimodal models. This includes setting up the compute cluster, downloading data, and selecting model hyperparameters. The default configurations for each model and task are tested regularly and every configuration can be modified to train on new datasets or test new model hyperparameters. For customization, in addition to fully supervised finetuning (SFT), the framework also supports a variety of parameter efficient finetuning (PEFT) techniques such as Ptuning, LoRA, Adapters and IA3, which can usually reach nearly the same degree of accuracy with a fraction of the computational cost of SFT.

Model Alignment 1

Part of the framework, NeMo-Aligner is a scalable toolkit for efficient model alignment. The toolkit supports Supervised Finetuning (SFT) and other state-of-the-art (SOTA) model alignment algorithms such as SteerLM, DPO, and Reinforcement Learning from Human Feedback (RLHF). These algorithms enable users to align language models to be more safe and helpful.


The NeMo Launcher streamlines your experience with the NeMo framework, offering a user-friendly interface to build end-to-end workflows for efficient management and organization of experiments across various environments. Built upon the Hydra framework, it empowers users to effortlessly compose and adjust hierarchical configurations using configuration files and command-line arguments. You can get started with large-scale training, customization, or alignment jobs locally (with single node support), on NVIDIA Base Command Manager (Slurm) or cloud providers - AWS, Azure, and Oracle Cloud Infrastructure (OCI) with launcher scripts without having to write any code.

Model Inference

NeMo framework integrates seamlessly with enterprise-grade model deployment tooling using NVIDIA NIM, powered by NVIDIA TensorRT-LLM and NVIDIA Triton Inference Server for optimized and scalable inference.

Model Support

End-to-end model development workflows for popular community models (Gemma, Starcoder 2, Llama 1/2, Baichuan 2, Falcon, Mixtral, Mistral, and more) and NVIDIA Nemotron models are supported in the NeMo framework. This includes support for pretraining and fine-tuning for all the language models and multimodal models. Additional details are provided in the support matrix below.



Developing conversational AI models is a complex process that entails defining, building, and training models in specific domains. This often involves multiple iterations to achieve high accuracy, fine-tuning on various tasks and domain-specific data, ensuring training performance, and preparing models for inference deployment.


NeMo framework supports training and customizing of speech AI models for tasks such as automatic speech recognition (ASR), and text-to-speech synthesis (TTS), with a seamless path to enterprise-grade production deployment with NVIDIA Riva. To further aid developers & researchers, it comes with state-of-the-art (SOTA) pre-trained checkpoints, and tooling for reproducible speech data processing, and interactive exploration & analysis of speech datasets. Parts of NeMo framework for Speech include:

Training and Customization

NeMo Framework contains everything needed to train and customize speech models (ASR, Speech Classification, Speaker Recognition, Speaker Diarization, and TTS) in a reproducible manner.

SOTA Pre-trained Models

Recipes and pre-trained checkpoints of several ASR and TTS models, as well as instructions on how to load them, are provided.

Speech Tools

A set of tools useful for developing ASR and TTS models, including:

  • NeMo Forced Aligner (NFA) for generating token-, word- and segment-level timestamps of speech in audio using NeMo’s CTC-based Automatic Speech Recognition models.

  • Speech Data Processor (SDP), a toolkit for simplifying speech data processing. It allows you to represent data processing operations in a config file, minimizing boilerplate code, and allowing reproducibility and shareability.

  • Speech Data Explorer (SDE), a Dash-based web application for interactive exploration and analysis of speech datasets.

  • Dataset creation tool which provides functionality to align long audio files with the corresponding transcripts and split them into shorter fragments that are suitable for Automatic Speech Recognition (ASR) model training.

  • Comparison Tool for ASR Models to compare predictions of different ASR models at word accuracy and utterance level.

  • ASR Evaluator for evaluating the performance of ASR models and other features such as Voice Activity Detection.

Path to Deployment

NeMo models trained or customized using the framework can be optimized and deployed with NVIDIA Riva, which includes containers, and helm charts designed to automate the steps for push-button deployment.

GitHub repos

Where to get help




Currently, NeMo Curator and NeMo Aligner support for Multimodal models is a work in progress, and will be available very soon.

Previous NeMo Framework User Guide
Next Why NeMo Framework?
© Copyright 2023-2024, NVIDIA. Last updated on May 17, 2024.