Deploy NeMo Framework Models

NVIDIA NeMo Framework offers various deployment paths for NeMo models, tailored to different domains such as Large Language Models (LLMs) and Multimodal Models (MMs). There are three primary deployment paths for NeMo models: enterprise-level deployment with NVIDIA Inference Microservice (NIM), optimized inference via exporting to another library and deploying with the NVIDIA Triton Inference Server, and in-framework inference. To begin serving your model on these three deployment paths, all you need is a NeMo checkpoint. You can find the support matrix for the different domains below.

Domain	NVIDIA NIM	Optimized	In-Framework
LLMs	Yes	Yes	N/A
MMs	N/A	N/A	N/A

While a number of deployment paths are currently available, others are still in development. As each unique deployment path becomes available, it will be added to this section.

The following section describes the paths that are available to you today for working with LLMs. Support for the MMs will be added in the coming releases.

NVIDIA NIM for LLMs

Enterprises seeking a comprehensive solution that covers both on-premises and cloud deployment can use NVIDIA NIM. This approach leverages the NVIDIA AI Enterprise suite, which includes support for NVIDIA NeMo, Triton Inference Server, TensorRT-LLM, and other AI software.

This option is ideal for organizations requiring a reliable and scalable solution to deploy generative AI models in production environments. It also stands out as the fastest inference option, offering user-friendly scripts and APIs. Leveraging the TensorRT-LLM Triton backend, it achieves rapid inference using advanced batching algorithms, including in-flight batching. Note that this deployment path supports only selected LLM models.

To learn more about NVIDIA NIM, visit the NVIDIA website.

In-Framework Inference for LLMs using the NeMo Framework

In-framework inference involves running LLM models directly within the NeMo Framework. This approach is straightforward and eliminates the need to export models to another format. It is ideal for development and testing phases, where ease-of-use and flexibility are critical. The NeMo Framework supports multi-node and multi-GPU inference, maximizing throughput. This method allows for rapid iterations and direct testing within the NeMo environment. Although this is the slowest option, it provides support for all NeMo models.

This deployment path is still under development and this section will be updated when the in-framework deployment is released.

Optimized Inference for LLMs using TensorRT-LLM

For scenarios requiring optimized performance, NeMo models can leverage TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs. This process involves converting NeMo models into a format compatible with TensorRT-LLM using the nemo.export module. Unlike the NIM path for LLMs, this option does not include the advanced batching algorithms, such as in-flight batching using the TensorRT-LLM Triton backend, which achieves the fastest LLM inference. Note that this deployment path supports only selected LLM models.

As new information becomes available, this section will be updated for future releases.

Supported GPUs

TensorRT-LLM supports NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper, NVIDIA Ada Lovelace, NVIDIA Ampere, NVIDIA Turing, and NVIDIA Volta architectures.

Supported LLMs

The following table shows the supported LLMs and their parameters in the distributed NeMo checkpoint format.

Model Name	Model Parameters	NeMo Precision	TensorRT-LLM Precision
GPT	2B, 8B, 43B	bfloat16	bfloat16
Nemotron	8B, 22B	bfloat16	bfloat16
Llama 2	7B, 13B, 70B	bfloat16	bfloat16
Llama 3	8B, 70B	bfloat16	bfloat16
Falcon	7B, 40B	bfloat16	bfloat16
Gemma	2B, 7B	bfloat16	bfloat16
StarCoder1	15B	bfloat16	bfloat16
StarCoder2	3B, 7B, 15B	bfloat16	bfloat16
MISTRAL	7B	bfloat16	bfloat16
MIXTRAL	8x7B	bfloat16	bfloat16

Only Megatron Core-based NeMo models with the distributed checkpoint format are supported. There are two types of NeMo checkpoint files including .nemo and .qnemo.

.nemo file:
comprised of a yaml config file, model weights folder, and the tokenizer (if not available online). Trained models are stored in this file format with bfloat16 precision for weight values.
.qnemo file:
comprised of a yaml config file, quantized model weights, and the tokenizer (if not available online). Quantized models are stored in this file format.

Options for Running In-Framework and Optimized Inference using TensorRT-LLM for LLMs

The NeMo Framework provides various options for running in-framework and optimized inference, including scripts and Python APIs. The following sections describe different options you can use to run in-framework and optimized inference. The examples in these sections demonstrate how to run optimized inference. In-framework inference is still under development and the related documentation will be added in future releases.

Access the Models with a Hugging Face Token

If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you’ll need to generate a Hugging Face token that has access to these models. Visit Hugging Face for more information. After you have the token, perform one of the following steps.

Log in to Hugging Face:
huggingface-cli login
Or, set the HF_TOKEN environment variable:
export HF_TOKEN=your_token_here

Export and Deploy a NeMo Checkpoint to TensorRT-LLM

This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT-LLM. Nemotron will be used as an example model. The steps in this section work with most NVIDIA NeMo LLM models. Please consult the LLM model table above for a complete list of supported models.

Download the nemotron-3-8b-base-4k model from the following link:

https://developer.nvidia.com/nemotron-3-8b
Fill in an application form to get access to the model.

An approval email will be sent to you along with instructions.
Follow the instructions to download the Nemotron checkpoint file from the NVIDIA GPU Cloud (NGC) registry.

After downloading the Nemotron checkpoint file, pull down and run the Docker container image using the command shown below. Change the :vr tag to the version of the container you want to use:

docker pull nvcr.io/nvidia/nemo:vr

docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo:/opt/checkpoints/nemotron-3-8b-base-4k.nemo -w /opt/NeMo nvcr.io/nvidia/nemo:vr

Run the following deployment script to verify that everything is working correctly. The script exports the downloaded NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:
```
python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k.nemo --model_type gptnext --triton_model_name nemotron
```
If you only want to export the NeMo checkpoint to TensorRT-LLM, use the scripts/export/export_to_trt.py. The parameters in the script are similar to the scripts/deploy/nlp/deploy_triton.py script, but exclude the deployment part.
If the test yields a shared memory-related error, change the shared memory size using --shm-size.
In a separate terminal, run the following to get the container ID of the running container. Please find the nvcr.io/nvidia/nemo:24.vr image to get the container ID.
```
docker ps
```
Get into the running container as below. Please replace the container_id with the actual container ID in the below command.
```
docker exec -it container_id bash
```

To send a query to the Triton server, run the following script:

python scripts/deploy/nlp/query.py -mn nemotron -p "What is the color of a banana?" -mot 5

To export and deploy a different model such as Llama3, Mixtral, and Starcoder, just change the model_type in the scripts/deploy/nlp/deploy_triton.py script.

Use a Script to Run Inference on a Triton Server

You can deploy a LLM from a NeMo checkpoint on Triton using the provided script. The deployment options include in-framework inference or optimized inference using TensorRT-LLM. Currently, only optimized inference with TensorRT-LLM is supported, and the following steps pertain to that mode.

Export and Deploy a LLM model to TensorRT-LLM

After executing the script, if the optimized inference option is selected, it will export the model to TensorRT-LLM and then initiate the service on Triton.

Start the container using the steps described in the previous section.
To begin serving the downloaded model, run the following script:
```
python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k.nemo --model_type gptnext --triton_model_name nemotron
```
The following parameters are defined in the deploy_triton.py script:
- nemo_checkpoint - path of the .nemo or .qnemo checkpoint file.
- model_type - type of the model. choices=[“gptnext”, “gpt”, “llama”, “falcon”, “starcoder”, “mixtral”, “gemma”].
- triton_model_name name of the model on Triton.
- triton_model_version - version of the model. Default is 1.
- triton_port - port for the Triton server to listen for requests. Default is 8000.
- triton_http_address - HTTP address for the Triton server. Default is 0.0.0.0
- triton_model_repository - TensorRT temp folder. Default is /tmp/trt_llm_model_dir/.
- num_gpus - number of GPUs to use for inference. Large models require multi-gpu export.
- dtype - data type of the model on TensorRT-LLM. Default is “bfloat16”. Currently only “bfloat16” is supported.
- max_input_len - maximum input length of the model.
- max_output_len - maximum output length of the model.
- max_batch_size - maximum batch size of the model.
- ptuning_nemo_checkpoint - source .nemo file for prompt embeddings table.
- task_ids - unique task names for the prompt embedding.
- max_prompt_embedding_table_size - max prompt embedding table size.
- lora_ckpt - a checkpoint list of LoRA weights.
- use_lora_plugin - activates the lora plugin which enables embedding sharing.
- lora_target_modules - adds lora in which modules. Only be activated when use_lora_plugin is enabled.
- max_lora_rank - maximum lora rank for different lora modules. It is used to compute the workspace size of lora plugin.
- no_paged_kv_cache - disables paged kv cache in the TensorRT-LLM.
- disable_remove_input_padding - disables remove input padding option of TensorRT-LLM.
Note

The parameters described here are generalized and should be compatible with any NeMo checkpoint. It is important; however, that you check the LLM model table above for optimized inference model compatibility. We are actively working on extending support to other checkpoints.

Whenever the script is executed, it initiates the service by exporting the NeMo checkpoint to the TensorRT-LLM. If you want to skip the exporting step in the optimized inference option, you can specify an empty directory.
To export and deploy a different model such as Llama3, Mixtral, and Starcoder, just change the model_type in the scripts/deploy/nlp/deploy_triton.py script. Please see the table below to learn more about which model_type is used for a LLM model.

Model Name

model_type

GPT

gpt

Nemotron

gpt

Llama 2

llama

Llama 3

llama

Falcon

falcon

Gemma

gemma

StarCoder1

starcoder

StarCoder2

starcoder

MISTRAL

llama

MIXTRAL

mixtral

Stop the running container and then run the following command to specify an empty directory:

mkdir tmp_triton_model_repository

docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:vr

python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo --model_type="gptnext" --triton_model_name nemotron --triton_model_repository /opt/checkpoints/tmp_triton_model_repository

The checkpoint will be exported to the specified folder after executing the script mentioned above.

To load the exported model directly, run the following script within the container:

python scripts/deploy/nlp/deploy_triton.py --triton_model_name nemotron --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type="gptnext"

Use Prompt Embedding Tables

You can use learned virtual tokens to perform a downstream stream task during inference. Once the virtual tokens are learned using the NeMo FW training container, all the tokens are saved in a .nemo file. You can feed this file into the script as shown in the following command. Since there is no NeMo checkpoint specifically for the virtual token available on NVIDIA NGC or Hugging Face, you’ll need to find or generate a checkpoint.

Assuming there is a checkpoint for the prompt embedding table, run the following command:

python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo --model_type="gptnext" --triton_model_name nemotron --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --max_prompt_embedding_table_size 1024 --ptuning_nemo_checkpoint /opt/checkpoints/my_ptuning_table.nemo --task_ids "task 1"

max_prompt_embedding_table_size parameter should be set as the total number of virtual tokens for all of the downstream tasks.

To pass multiple NeMo checkpoints, run the following command:

python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo --model_type="gptnext" --triton_model_name nemotron --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --max_prompt_embedding_table_size 1024 --ptuning_nemo_checkpoint /opt/checkpoints/my_ptuning_table-1.nemo /opt/checkpoints/my_ptuning_table-2.nemo --task_ids "task 1" "task 2"

Please make sure that the total number of virtual tokens of my_ptuning_table-1.nemo and my_ptuning_table-2.nemo doesn’t exceed the max_prompt_embedding_table_size parameter.

Send a Query

After starting the service using the provided scripts from the previous section, it will wait for incoming requests. You can send a query to this service in several ways.

Use the Query Script: Execute the query script within the currently running container.
PyTriton: Utilize PyTriton to send requests directly.
HTTP Requests: Make HTTP requests using various tools or libraries.

The following example shows how to execute the query script within the currently running container.

To use a query script, run the following command:

python scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?"

Change the url and the model_name based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well.

If the there is a prompt embedding table, run the following command to send a query:

python scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?" --task_id "task 1"

Use NeMo Export and Deploy Module APIs to Run Inference

Up until now, we’ve used scripts for exporting and deploying LLM models. However, NeMo’s Deploy and Export modules offer straightforward APIs for deploying models to Triton and exporting NeMo checkpoints to TensorRT-LLM.

Export a LLM model to TensorRT-LLM

You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the Nemotron-3-8B-Base-4k.nemo checkpoint has already been downloaded and mounted to the /opt/checkpoints/ path. Additionally, the /opt/checkpoints/tmp_trt_llm path is also assumed to exist.

Run the following command:

from nemo.export import TensorRTLLM

trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_triton_model_repository/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo", model_type="gptnext", n_gpus=1)
output = trt_llm_exporter.forward(["What is the best city in the world?"], max_output_token=15, top_k=1, top_p=0.0, temperature=1.0)
print("output: ", output)

Be sure to check the TensorRTLLM class docstrings for details.

Deploy a LLM Model to TensorRT-LLM

You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the Nemotron-3-8B-Base-4k.nemo checkpoint has already been downloaded and mounted to the /opt/checkpoints/ path. Additionally, the /opt/checkpoints/tmp_trt_llm path is also assumed to exist.

Run the following command:

from nemo.export import TensorRTLLM
from nemo.deploy import DeployPyTriton

trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_triton_model_repository/")
trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo", model_type="gptnext", n_gpus=1)

nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="nemotron", port=8000)
nm.deploy()
nm.serve()

Send a Query

The NeMo Framework provides NemoQueryLLM APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container.

To run the request example using NeMo APIs, run the following command:

from nemo.deploy.nlp import NemoQueryLLM

nq = NemoQueryLLM(url="localhost:8000", model_name="nemotron")
output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_token=10, top_k=1, top_p=0.0, temperature=1.0)
print(output)

Change the url and the model_name based on your server and the model name of your service. Please check the NeMoQuery docstrings for details.

If there is a prompt embedding table, run the following command to send a query:

output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_token=10, top_k=1, top_p=0.0, temperature=1.0, task_id="0")