Important

You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.

Deploy NeMo Models Using NIM Containers#

NVIDIA NIM is a containerized solution designed to deploy PyTorch-level NeMo Large Language Models (LLMs) using the TensorRT-LLM backend. This section demonstrates how to deploy these models using a NIM container. Currently, to use NeMo 2.0 checkpoint support in NIM, it is necessary to obtain the latest NeMo sources. This can be done by cloning the NeMo GitHub repository and mounting the sources in a NIM container. In future releases, NIM will support NeMo 2.0 checkpoints natively.

An alternative way to deploy NeMo models in NIM is to export the model to a Hugging Face format and use the standard NIM support for the Hugging Face framework. This approach requires that a corresponding Hugging Face implementation of the NeMo model exists. Example workflows that convert NeMo models to their equivalent Hugging Face formats include Llama Embedding and PEFT in NeMo 2.0. Visit these tutorials for detailed instructions on using the export_ckpt API in NeMo.

The instructions below present the direct path of exporting and deploying a NeMo model in a NIM container.

Quickstart#

  1. Follow the steps in the Deploy NeMo LLM main page to generate an example NeMo checkpoint. The checkpoint used is typically a product of some other workload like, for example, Supervised Fine-Tuning.

    In the following example, we will use the Llama 3.1 8B Instruct model and assume it is available at /opt/checkpoints/llama-3.1-8b-instruct-nemo2.

  2. There are two options for producing the TensorRT-LLM engine to be used by NIM:

    • (Recommended) Running the export_to_trt_llm.py script within the NIM container.

    • Running this script using the NeMo Framework container.

    Note

    Using a TensorRT-LLM engine produced with the NeMo Framework container and run in a NIM container may result in compatibility issues with TensorRT and/or TensorRT-LLM library versions. Therefore, the recommended way to deploy NeMo models is to use the NIM container to build the engine. The subsequent instructions will focus on this approach.

  3. To use the latest NeMo codebase and access the export_to_trt_llm.py script, clone the NeMo Framework repository:

    git clone https://github.com/NVIDIA/NeMo.git /opt/NeMo
    

    As mentioned, in future releases the updated NeMo codebase will be shipped within the NIM container.

  4. Go to NIM NGC container collection and find a recent NIM image to use. In our example, we choose nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.0-RTX that comes with TensorRT-LLM 0.17.1. Note that currently using the latest nemo.export requires TensorRT-LLM version 0.16.0 or higher.

  5. Run the container mounting both the cloned NeMo sources and the folder with your checkpoint:

    docker run --gpus all -it --rm --shm-size=4g \
       -v /opt/NeMo:/opt/NeMo \
       -v /opt/checkpoints:/opt/checkpoints \
       -w /workspace nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.0-RTX bash
    
  6. Run the following commands to produce the engine. In the example, use tensor parallelism of two:

    export PYTHONPATH=/opt/NeMo  # To use the latest code for NeMo 2.0 checkpoint support
    export NIM_MODEL_NAME=/workspace/llama-3.1-8b-instruct-nim  # To inform NIM scripts where the model resides
    
    mkdir $NIM_MODEL_NAME
    
    python /opt/NeMo/scripts/export/export_to_trt_llm.py \
       --nemo_checkpoint /opt/checkpoints/llama-3.1-8b-instruct-nemo2 \
       --model_repository $NIM_MODEL_NAME \
       --tensor_parallelism_size 2
    

    Wait for the command to complete the TensorRT-LLM engine build. It will produce the model format ready to be run with the NIM container.

  7. Run the following script to start the server. The script is included in the NIM containers. It sets up the server within several minutes (which in general depends on the model size).

    export NIM_SERVED_MODEL_NAME=test-model  # Model name used to query it
    
    /opt/nim/start_server.sh > server.log 2>&1 &
    

    Wait until the server starts. You can inspect the server.log file until it reports the server is running at the default 8000 port.

  8. Once the server is up, you can query it with an example prompt using curl:

    apt-get update && apt-get install -y curl
    
    curl -X 'POST' \
      'http://0.0.0.0:8000/v1/completions' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
        "model": "'"$NIM_SERVED_MODEL_NAME"'",
        "prompt": "hello world!",
        "top_p": 1,
        "n": 1,
        "max_tokens": 15,
        "stream": false,
        "frequency_penalty": 1.0,
        "stop": ["hello"]
      }'