Deploy NeMo Large Language Models#

The NeMo Framework provides scripts and APIs to deploy NeMo LLMs to the NVIDIA Triton Inference Server. You have the option to export a NeMo LLM to TensorRT-LLM or vLLM for optimized inference before deploying to the inference server. Note that this optimized deployment path supports only selected LLM models.

You can find details about how to deploy inference optimized models and PyTorch level models (in-framework), how to apply quantization, and how to send queries to deployed models below.

Supported NeMo Checkpoint Formats#

The NeMo Framework saves models as nemo files, which include data related to the model, such as weights and configuration files. The format of these .nemo files has evolved over time. The framework supports deploying Megatron Core-based NeMo models using the distributed checkpoint format. If you saved the model using one of the latest NeMo Framework containers, you should not encounter any issues. Otherwise, you may receive an error message regarding the .nemo file format.

NeMo 1.0 checkpoint are either of the nemo or qnemo type. Both file types are supported for deployment. NeMo 2.0 saves all the model related files into a folder.

The NeMo 1.0 .nemo checkpoint and the NeMo 2.0 model folder include the model weights with default precision. It consists of a YAML config file, a folder for model weights, and the tokenizer (if it is not available online). Models are trained and stored in this format, with weight values in FP16 or BF16 precision. Additionally, for .nemo models trained in FP8 precision using NVIDIA Transformer Engine, it is possible to directly export them to an inference framework that supports FP8. Such models already come with scaling factors for low-precision GEMMs and do not require any extra calibration.

The qnemo checkpoint contains the quantized weights and scaling factors. It follows the TensorRT-LLM checkpoint format with the addition of a tokenizer. It is derived from a corresponding nemo model. For detailed information on how to produce a qnemo checkpoint, please refer to Quantization manual.

The NeMo 1.0 and NeMo 2.0 checkpoint formats are supported by the TensorRT-LLM and vLLM inference optimized deployment options whereas in-framework (PyTorch level models) deployment only supports NeMo 2.0 models.

Supported GPUs#

TensorRT-LLM supports NVIDIA DGX H100 and NVIDIA H100 GPUs based on the NVIDIA Hopper, Ada Lovelace, Ampere, Turing, and Volta architectures. Certain specialized deployment paths, such as FP8 quantized models, require hardware with FP8 data type support, like NVIDIA H100 GPUs.

Generate a NeMo 2.0 Checkpoint#

Please follow the steps below in order to generate a NeMo 2.0 checkpoint and use it for testing the export and deployment paths for NeMo models.

  1. If you need to get access for the Llama models, please visit the Llama 3.1 Hugging Face page <https://huggingface.co/meta-llama/Llama-3.1-8B>.

  2. Pull down and run the Docker container image using the command shown below. Change the :vr tag to the version of the container you want to use:

    docker pull nvcr.io/nvidia/nemo:vr
    
    docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:vr
    
  3. Run the following in the terminal and enter Hugging Face access token to log into Hugging Face.

    huggingface-cli login
    
  4. Run the following python code to generate the NeMo 2.0 checkpoint.

    from nemo.collections.llm import import_ckpt
    from nemo.collections.llm.gpt.model.llama import Llama31Config8B, LlamaModel
    from pathlib import Path
    
    if __name__ == "__main__":
        import_ckpt(
            model=LlamaModel(Llama31Config8B(seq_length=16384)),
            source='hf://meta-llama/Llama-3.1-8B',
            output_path=Path('/opt/checkpoints/hf_llama31_8B_nemo2.nemo')
        )