Deploy NeMo Framework Models

NVIDIA NeMo Framework offers various deployment paths for NeMo models, tailored to different domains such as Large Language Models (LLMs) and Multimodal Models (MMs). There are three primary deployment paths for NeMo models: enterprise-level deployment with NVIDIA Inference Microservice (NIM), optimized inference via exporting to another library and deploying with the NVIDIA Triton Inference Server, and in-framework inference. To begin serving your model on these three deployment paths, all you need is a NeMo checkpoint. You can find the support matrix for the different domains below.

Domain

NVIDIA NIM

Optimized

In-Framework

LLMs

Yes

Yes

N/A

MMs

N/A

N/A

N/A

While a number of deployment paths are currently available, others are still in development. As each unique deployment path becomes available, it will be added to this section.

The following section describes the paths that are available to you today for working with LLMs. Support for the MMs will be added in the coming releases.

NVIDIA NIM for LLMs

Enterprises seeking a comprehensive solution that covers both on-premises and cloud deployment can use NVIDIA NIM. This approach leverages the NVIDIA AI Enterprise suite, which includes support for NVIDIA NeMo, Triton Inference Server, TensorRT-LLM, and other AI software.

This option is ideal for organizations requiring a reliable and scalable solution to deploy generative AI models in production environments. It also stands out as the fastest inference option, offering user-friendly scripts and APIs. Leveraging the TensorRT-LLM Triton backend, it achieves rapid inference using advanced batching algorithms, including in-flight batching. Note that this deployment path supports only selected LLM models.

To learn more about NVIDIA NIM, visit the NVIDIA website.

In-Framework Inference for LLMs using the NeMo Framework

In-framework inference involves running LLM models directly within the NeMo Framework. This approach is straightforward and eliminates the need to export models to another format. It is ideal for development and testing phases, where ease-of-use and flexibility are critical. The NeMo Framework supports multi-node and multi-GPU inference, maximizing throughput. This method allows for rapid iterations and direct testing within the NeMo environment. Although this is the slowest option, it provides support for all NeMo models.

This deployment path is still under development and this section will be updated when the in-framework deployment is released.

Optimized Inference for LLMs using TensorRT-LLM

For scenarios requiring optimized performance, NeMo models can leverage TensorRT-LLM, a specialized library for accelerating and optimizing LLM inference on NVIDIA GPUs. This process involves converting NeMo models into a format compatible with TensorRT-LLM using the nemo.export module. Unlike the NIM path for LLMs, this option does not include the advanced batching algorithms, such as in-flight batching using the TensorRT-LLM Triton backend, which achieves the fastest LLM inference. Note that this deployment path supports only selected LLM models.

As new information becomes available, this section will be updated for future releases.

Supported GPUs

TensorRT-LLM supports NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper, NVIDIA Ada Lovelace, NVIDIA Ampere, NVIDIA Turing, and NVIDIA Volta architectures.

Supported LLMs

The following table shows the supported LLMs and their parameters in the distributed NeMo checkpoint format.

Model Name

Model Parameters

NeMo Precision

TensorRT-LLM Precision

GPT

2B, 8B, 43B

bfloat16

bfloat16

Nemotron

8B, 22B

bfloat16

bfloat16

Llama 2

7B, 13B, 70B

bfloat16

bfloat16

Llama 3

8B, 70B

bfloat16

bfloat16

Falcon

7B, 40B

bfloat16

bfloat16

Gemma

2B, 7B

bfloat16

bfloat16

StarCoder1

15B

bfloat16

bfloat16

StarCoder2

3B, 7B, 15B

bfloat16

bfloat16

MISTRAL

7B

bfloat16

bfloat16

MIXTRAL

8x7B

bfloat16

bfloat16

Only Megatron Core-based NeMo models with the distributed checkpoint format are supported. There are two types of NeMo checkpoint files including .nemo and .qnemo.

  • .nemo file:

    comprised of a yaml config file, model weights folder, and the tokenizer (if not available online). Trained models are stored in this file format with bfloat16 precision for weight values.

  • .qnemo file:

    comprised of a yaml config file, quantized model weights, and the tokenizer (if not available online). Quantized models are stored in this file format.

Options for Running In-Framework and Optimized Inference using TensorRT-LLM for LLMs

The NeMo Framework provides various options for running in-framework and optimized inference, including scripts and Python APIs. The following sections describe different options you can use to run in-framework and optimized inference. The examples in these sections demonstrate how to run optimized inference. In-framework inference is still under development and the related documentation will be added in future releases.

Access the Models with a Hugging Face Token

If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you’ll need to generate a Hugging Face token that has access to these models. Visit Hugging Face for more information. After you have the token, perform one of the following steps.

  • Log in to Hugging Face:

    huggingface-cli login
    
  • Or, set the HF_TOKEN environment variable:

    export HF_TOKEN=your_token_here
    

Export and Deploy a NeMo Checkpoint to TensorRT-LLM

This section provides an example of how to quickly and easily deploy a NeMo checkpoint to TensorRT-LLM. Nemotron will be used as an example model. The steps in this section work with most NVIDIA NeMo LLM models. Please consult the LLM model table above for a complete list of supported models.

  1. Download the nemotron-3-8b-base-4k model from the following link:

    https://developer.nvidia.com/nemotron-3-8b

  2. Fill in an application form to get access to the model.

    An approval email will be sent to you along with instructions.

  3. Follow the instructions to download the Nemotron checkpoint file from the NVIDIA GPU Cloud (NGC) registry.

  4. After downloading the Nemotron checkpoint file, pull down and run the Docker container image using the command shown below. Change the :vr tag to the version of the container you want to use:

    docker pull nvcr.io/nvidia/nemo:vr
    
    docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo:/opt/checkpoints/nemotron-3-8b-base-4k.nemo -w /opt/NeMo nvcr.io/nvidia/nemo:vr
    
  5. Run the following deployment script to verify that everything is working correctly. The script exports the downloaded NeMo checkpoint to TensorRT-LLM and subsequently serves it on the Triton server:

    python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k.nemo --model_type gptnext --triton_model_name nemotron
    

    If you only want to export the NeMo checkpoint to TensorRT-LLM, use the scripts/export/export_to_trt.py. The parameters in the script are similar to the scripts/deploy/nlp/deploy_triton.py script, but exclude the deployment part.

  6. If the test yields a shared memory-related error, change the shared memory size using --shm-size.

  7. In a separate terminal, run the following to get the container ID of the running container. Please find the nvcr.io/nvidia/nemo:24.vr image to get the container ID.

    docker ps
    
  8. Get into the running container as below. Please replace the container_id with the actual container ID in the below command.

    docker exec -it container_id bash
    
  9. To send a query to the Triton server, run the following script:

    python scripts/deploy/nlp/query.py -mn nemotron -p "What is the color of a banana?" -mot 5
    
  10. To export and deploy a different model such as Llama3, Mixtral, and Starcoder, just change the model_type in the scripts/deploy/nlp/deploy_triton.py script.

Use a Script to Run Inference on a Triton Server

You can deploy a LLM from a NeMo checkpoint on Triton using the provided script. The deployment options include in-framework inference or optimized inference using TensorRT-LLM. Currently, only optimized inference with TensorRT-LLM is supported, and the following steps pertain to that mode.

Export and Deploy a LLM model to TensorRT-LLM

After executing the script, if the optimized inference option is selected, it will export the model to TensorRT-LLM and then initiate the service on Triton.

  1. Start the container using the steps described in the previous section.

  2. To begin serving the downloaded model, run the following script:

    python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k.nemo --model_type gptnext --triton_model_name nemotron
    

    The following parameters are defined in the deploy_triton.py script:

    • nemo_checkpoint - path of the .nemo or .qnemo checkpoint file.

    • model_type - type of the model. choices=[“gptnext”, “gpt”, “llama”, “falcon”, “starcoder”, “mixtral”, “gemma”].

    • triton_model_name name of the model on Triton.

    • triton_model_version - version of the model. Default is 1.

    • triton_port - port for the Triton server to listen for requests. Default is 8000.

    • triton_http_address - HTTP address for the Triton server. Default is 0.0.0.0

    • triton_model_repository - TensorRT temp folder. Default is /tmp/trt_llm_model_dir/.

    • num_gpus - number of GPUs to use for inference. Large models require multi-gpu export.

    • dtype - data type of the model on TensorRT-LLM. Default is “bfloat16”. Currently only “bfloat16” is supported.

    • max_input_len - maximum input length of the model.

    • max_output_len - maximum output length of the model.

    • max_batch_size - maximum batch size of the model.

    • ptuning_nemo_checkpoint - source .nemo file for prompt embeddings table.

    • task_ids - unique task names for the prompt embedding.

    • max_prompt_embedding_table_size - max prompt embedding table size.

    • lora_ckpt - a checkpoint list of LoRA weights.

    • use_lora_plugin - activates the lora plugin which enables embedding sharing.

    • lora_target_modules - adds lora in which modules. Only be activated when use_lora_plugin is enabled.

    • max_lora_rank - maximum lora rank for different lora modules. It is used to compute the workspace size of lora plugin.

    • no_paged_kv_cache - disables paged kv cache in the TensorRT-LLM.

    • disable_remove_input_padding - disables remove input padding option of TensorRT-LLM.

    Note

    The parameters described here are generalized and should be compatible with any NeMo checkpoint. It is important; however, that you check the LLM model table above for optimized inference model compatibility. We are actively working on extending support to other checkpoints.

    Whenever the script is executed, it initiates the service by exporting the NeMo checkpoint to the TensorRT-LLM. If you want to skip the exporting step in the optimized inference option, you can specify an empty directory.

  3. To export and deploy a different model such as Llama3, Mixtral, and Starcoder, just change the model_type in the scripts/deploy/nlp/deploy_triton.py script. Please see the table below to learn more about which model_type is used for a LLM model.

    Model Name

    model_type

    GPT

    gpt

    Nemotron

    gpt

    Llama 2

    llama

    Llama 3

    llama

    Falcon

    falcon

    Gemma

    gemma

    StarCoder1

    starcoder

    StarCoder2

    starcoder

    MISTRAL

    llama

    MIXTRAL

    mixtral

  4. Stop the running container and then run the following command to specify an empty directory:

    mkdir tmp_triton_model_repository
    
    docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:vr
    
    python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo --model_type="gptnext" --triton_model_name nemotron --triton_model_repository /opt/checkpoints/tmp_triton_model_repository
    

    The checkpoint will be exported to the specified folder after executing the script mentioned above.

  5. To load the exported model directly, run the following script within the container:

    python scripts/deploy/nlp/deploy_triton.py --triton_model_name nemotron --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --model_type="gptnext"
    

Use Prompt Embedding Tables

You can use learned virtual tokens to perform a downstream stream task during inference. Once the virtual tokens are learned using the NeMo FW training container, all the tokens are saved in a .nemo file. You can feed this file into the script as shown in the following command. Since there is no NeMo checkpoint specifically for the virtual token available on NVIDIA NGC or Hugging Face, you’ll need to find or generate a checkpoint.

  1. Assuming there is a checkpoint for the prompt embedding table, run the following command:

    python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo --model_type="gptnext" --triton_model_name nemotron --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --max_prompt_embedding_table_size 1024 --ptuning_nemo_checkpoint /opt/checkpoints/my_ptuning_table.nemo --task_ids "task 1"
    

    max_prompt_embedding_table_size parameter should be set as the total number of virtual tokens for all of the downstream tasks.

  2. To pass multiple NeMo checkpoints, run the following command:

    python scripts/deploy/nlp/deploy_triton.py --nemo_checkpoint /opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo --model_type="gptnext" --triton_model_name nemotron --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --max_prompt_embedding_table_size 1024 --ptuning_nemo_checkpoint /opt/checkpoints/my_ptuning_table-1.nemo /opt/checkpoints/my_ptuning_table-2.nemo --task_ids "task 1" "task 2"
    

    Please make sure that the total number of virtual tokens of my_ptuning_table-1.nemo and my_ptuning_table-2.nemo doesn’t exceed the max_prompt_embedding_table_size parameter.

Send a Query

After starting the service using the provided scripts from the previous section, it will wait for incoming requests. You can send a query to this service in several ways.

  • Use the Query Script: Execute the query script within the currently running container.

  • PyTriton: Utilize PyTriton to send requests directly.

  • HTTP Requests: Make HTTP requests using various tools or libraries.

The following example shows how to execute the query script within the currently running container.

  1. To use a query script, run the following command:

    python scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?"
    
  2. Change the url and the model_name based on your server and the model name of your service. The code in the script can be used as a basis for your client code as well.

  3. If the there is a prompt embedding table, run the following command to send a query:

    python scripts/deploy/nlp/query.py --url "http://localhost:8000" --model_name nemotron --prompt "What is the capital of United States?" --task_id "task 1"
    

Use NeMo Export and Deploy Module APIs to Run Inference

Up until now, we’ve used scripts for exporting and deploying LLM models. However, NeMo’s Deploy and Export modules offer straightforward APIs for deploying models to Triton and exporting NeMo checkpoints to TensorRT-LLM.

Export a LLM model to TensorRT-LLM

You can use the APIs in the export module to export a NeMo checkpoint to TensorRT-LLM. The following code example assumes the Nemotron-3-8B-Base-4k.nemo checkpoint has already been downloaded and mounted to the /opt/checkpoints/ path. Additionally, the /opt/checkpoints/tmp_trt_llm path is also assumed to exist.

  1. Run the following command:

    from nemo.export import TensorRTLLM
    
    trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_triton_model_repository/")
    trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo", model_type="gptnext", n_gpus=1)
    output = trt_llm_exporter.forward(["What is the best city in the world?"], max_output_token=15, top_k=1, top_p=0.0, temperature=1.0)
    print("output: ", output)
    
  2. Be sure to check the TensorRTLLM class docstrings for details.

Deploy a LLM Model to TensorRT-LLM

You can use the APIs in the deploy module to deploy a TensorRT-LLM model to Triton. The following code example assumes the Nemotron-3-8B-Base-4k.nemo checkpoint has already been downloaded and mounted to the /opt/checkpoints/ path. Additionally, the /opt/checkpoints/tmp_trt_llm path is also assumed to exist.

  1. Run the following command:

    from nemo.export import TensorRTLLM
    from nemo.deploy import DeployPyTriton
    
    trt_llm_exporter = TensorRTLLM(model_dir="/opt/checkpoints/tmp_triton_model_repository/")
    trt_llm_exporter.export(nemo_checkpoint_path="/opt/checkpoints/nemotron-3-8b-base-4k_v1.0/Nemotron-3-8B-Base-4k.nemo", model_type="gptnext", n_gpus=1)
    
    nm = DeployPyTriton(model=trt_llm_exporter, triton_model_name="nemotron", port=8000)
    nm.deploy()
    nm.serve()
    

Send a Query

The NeMo Framework provides NemoQueryLLM APIs to send a query to the Triton server for convenience. These APIs are only accessible from the NeMo Framework container.

  1. To run the request example using NeMo APIs, run the following command:

    from nemo.deploy.nlp import NemoQueryLLM
    
    nq = NemoQueryLLM(url="localhost:8000", model_name="nemotron")
    output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_token=10, top_k=1, top_p=0.0, temperature=1.0)
    print(output)
    
  2. Change the url and the model_name based on your server and the model name of your service. Please check the NeMoQuery docstrings for details.

  3. If there is a prompt embedding table, run the following command to send a query:

    output = nq.query_llm(prompts=["What is the capital of United States?"], max_output_token=10, top_k=1, top_p=0.0, temperature=1.0, task_id="0")