Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Deploy NeMo Large Language Models#
The NeMo Framework provides scripts and APIs to deploy NeMo LLMs to the NVIDIA Triton Inference Server. You have the option to export a NeMo LLM to TensorRT-LLM or vLLM for optimized inference before deploying to the inference server. Note that this optimized deployment path supports only selected LLM models.
You can find details about how to deploy inference optimized models and PyTorch level models (in-framework), how to apply quantization, and how to send queries to deployed models below.
Supported NeMo Checkpoint Formats#
The NeMo Framework saves models as nemo files, which include data related to the model, such as weights and configuration files. The format of these .nemo files has evolved over time. The framework supports deploying Megatron Core-based NeMo models using the distributed checkpoint format. If you saved the model using one of the latest NeMo Framework containers, you should not encounter any issues. Otherwise, you may receive an error message regarding the .nemo file format.
NeMo checkpoint are either of the nemo or qnemo type. Both file types are supported for deployment.
The .nemo checkpoint includes the model weights with default precision. It consists of a YAML config file, a folder for model weights, and the tokenizer (if it is not available online). Models are trained and stored in this format, with weight values in FP16 or BF16 precision. Additionally, for .nemo models trained in FP8 precision using NVIDIA Transformer Engine, it is possible to directly export them to an inference framework that supports FP8. Such models already come with scaling factors for low-precision GEMMs and do not require any extra calibration.
The qnemo checkpoint contains the quantized weights and scaling factors. It follows the TensorRT-LLM checkpoint format <https://nvidia.github.io/TensorRT-LLM/architecture/checkpoint.html>`_ with the addition of a tokenizer. It is derived from a corresponding nemo model. For detailed information on how to produce a qnemo checkpoint, please refer to Quantization manual.
Supported GPUs#
TensorRT-LLM supports NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper, Ada Lovelace, Ampere, Turing, and Volta architectures. Certain specialized deployment paths, such as FP8 quantized models, require hardware with FP8 data type support, like NVIDIA H100 GPUs.
Download a Nemotron NeMo Checkpoint#
Nemotron LLM is a foundational model with 8 billion parameters. It enables customization, including parameter-efficient fine-tuning and continuous pre-training for domain-adapted LLMs. All the LLM deployment examples will use a Nemotron LLM. Please follow the steps below to download a Nemotron NeMo checkpoint.
Download the nemotron-3-8b-base-4k model from the following link:
https://huggingface.co/nvidia/nemotron-3-8b-base-4k
Please find the Nemotron-3-8B-Base-4k.nemo file on the Files and Versions tab. Or, you can use the the following link,
If you are using the second link, fill in an application form to get access to the model.
An approval email will be sent to you along with instructions.
Follow the instructions to download the Nemotron checkpoint file from the NVIDIA GPU Cloud (NGC) registry.