Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Deploy NeMo Large Language Models

The NeMo Framework provides scripts and APIs to deploy NeMo LLMs to NVIDIA Triton Inference Server. You have the option to export a NeMo LLM to TensorRT-LLM or vLLM for optimized inference before deploying to the inference server. Note that this optimized deployment path supports only selected LLM models.

You can find details about how to deploy inference optimized models and PyTorch level models (in-framework), and how to send queries to deployed models below.

Supported GPUs

TensorRT-LLM supports NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper, NVIDIA Ada Lovelace, NVIDIA Ampere, NVIDIA Turing, and NVIDIA Volta architectures.

Supported NeMo Checkpoint Formats

The NeMo Framework saves models as a nemo file, which includes data related to the model such as weights and the configuration file. The format of the nemo file has undergone several changes in the past. The framework supports the deployment of Megatron Core-based NeMo models using the distributed checkpoint format. If you saved the model using one of the most recent NeMo Framework containers, you should not encounter any issues. Otherise, you will get an error message regarding the nemo file format.

NeMo checkpoint files come in two types: .nemo and .qnemo. Both file types are supported for deployment.

The .nemo file includes the model weights with default precision. It consists of a yaml config file, a folder for model weights, and the tokenizer (provided it’s not available online). The models are trained and stored in this format, with weight values in bfloat16 precision.

The .qnemo file contains the quantized weights. Similar to the .nemo file, it includes a yaml config file and the tokenizer (if not available online), but the model weights are quantized. For detailed information about .qnemo, please refer to Post-Training Quantization.

Nemotron LLMs

Nemotron LLM is a foundational model with 8 billion parameters. It enables customization, including parameter-efficient fine-tuning and continuous pre-training for domain-adapted LLMs. All the LLM deployment examples will use a Nemotron LLM. Please follow steps below to dowload a Nemotron NeMo checkpoint.

  1. Download the nemotron-3-8b-base-4k model from the following link:

    https://huggingface.co/nvidia/nemotron-3-8b-base-4k

    Please find the Nemotron-3-8B-Base-4k.nemo file on the Files and Versions tab. Or, you can use the the following link,

    https://developer.nvidia.com/nemotron-3-8b

  2. If you are using the second link, fill in an application form to get access to the model.

    An approval email will be sent to you along with instructions.

  3. Follow the instructions to download the Nemotron checkpoint file from the NVIDIA GPU Cloud (NGC) registry.