Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to the Migration Guide for information on getting started.

Deploy NeMo Models in the Framework

This section demonstrates how to deploy PyTorch-level NeMo LLMs within the framework (referred to as ‘In-Framework’) using the NVIDIA Triton Inference Server.

Quick Example

  1. Follow the steps in the Deploy NeMo LLM main page to download the nemotron-3-8b-base-4k model.

  2. Pull down and run the Docker container image using the command shown below. Change the :vr tag to the version of the container you want to use:

    docker pull nvcr.io/nvidia/nemo:vr
    
    docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/Nemotron-3-8B-Base-4k.nemo:/opt/checkpoints/Nemotron-3-8B-Base-4k.nemo -w /opt/NeMo nvcr.io/nvidia/nemo:vr
    
  3. Run the following deployment script to verify that everything is working correctly. The script directly serves the .nemo model on the Triton server:

    python scripts/deploy/nlp/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo --triton_model_name nemotron
    
  4. If the test yields a shared memory-related error, change the shared memory size using --shm-size.

  5. In a separate terminal, run the following command to get the container ID of the running container. Please find the nvcr.io/nvidia/nemo:vr image to get the container ID.

    docker ps
    
  6. Access the running container and replace container_id with the actual container ID as follows:

    docker exec -it container_id bash
    
  7. To send a query to the Triton server, run the following script:

    python scripts/deploy/nlp/query_inframework.py -mn nemotron -p "What is the color of a banana?" -mol 5
    

Any model that is compatible with the MegatronGPTModel class in collections/nlp/models/language_modeling should work with this script.

Use a Script to Deploy NeMo LLMs on a Triton Server

You can deploy an LLM from a NeMo checkpoint on Triton using the provided script.

Deploy a NeMo LLM Model

Executing the script will directly deploy the in-framework (.nemo) model and initiate the service on Triton.

  1. Start the container using the steps described in the Quick Example section.

  2. To begin serving the downloaded model, run the following script:

    python scripts/deploy/nlp/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo --triton_model_name nemotron
    

    The following parameters are defined in the deploy_inframework_triton.py script:

    • --nemo_checkpoint: path of the .nemo or .qnemo checkpoint file.

    • --triton_model_name: name of the model on Triton.

    • --triton_model_version: version of the model. Default is 1.

    • --triton_port: port for the Triton server to listen for requests. Default is 8000.

    • --triton_http_address: HTTP address for the Triton server. Default is 0.0.0.0

    • --num_gpus: number of GPUs to use for inference. Large models require multi-gpu export. This parameter is deprecated.

    • --max_batch_size: maximum batch size of the model. Default is 8.

    • --debug_mode: enables additional debug logging messages from the script

  3. To deploy a different model, just change the nemo_checkpoint in the scripts/deploy/nlp/deploy_inframework_triton.py script. Any model which is compatible with Nemo’s MegatronGPTDeployable class can be used here. Nemotron, Llama2, and Mistral have been tested and confirmed to work.

  4. Access the models with a Hugging Face token.

    If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you’ll need to generate a Hugging Face token that has access to these models. Visit Hugging Face for more information. After you have the token, perform one of the following steps.

    • Log in to Hugging Face:

      huggingface-cli login
      
    • Or, set the HF_TOKEN environment variable:

      export HF_TOKEN=your_token_here
      

Use NeMo Deploy Module APIs to Run Inference

Up until now, we’ve used scripts for deploying LLM models. However, NeMo’s Deploy module offers straightforward APIs for deploying models to Triton.

Deploy an LLM Model to TensorRT-LLM

You can use the APIs in the deploy module to deploy an in-framework model to Triton. The following code example assumes the Nemotron-3-8B-Base-4k.nemo checkpoint has already been downloaded and mounted to the /opt/checkpoints/ path.

  1. Run the following command:

    from nemo.deploy.nlp import MegatronLLMDeployable
    from nemo.deploy import DeployPyTriton
    
    megatron_deployable = MegatronLLMDeployable("/opt/checkpoints/Nemotron-3-8B-Base-4k.nemo")
    nm = DeployPyTriton(model=megatron_deployable, triton_model_name="nemotron", port=8000)
    
    nm.deploy()
    nm.serve()