Deploy NeMo Models by Exporting vLLM#
This section shows how to use scripts and APIs to export a NeMo LLM to vLLM and deploy it with the NVIDIA Triton Inference Server.
Quick Example#
Follow the steps in the Deploy NeMo LLM main page to generate a NeMo 2.0 Llama checkpoint.
In a terminal, go to the folder where the
hf_llama31_8B_nemo2.nemo
file is located. Pull down and run the Docker container image using the command shown below. Change the:vr
tag to the version of the container you want to use:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 \ -v ${PWD}/hf_llama31_8B_nemo2.nemo:/opt/checkpoints/hf_llama31_8B_nemo2.nemo \ -w /opt/NeMo \ --name nemo-fw \ nvcr.io/nvidia/nemo:vr
Install vLLM by executing the following command inside the container:
cd /opt/Export-Deploy uv sync --inexact --link-mode symlink --locked --extra vllm $(cat /opt/uv_args.txt)
Run the following deployment script to verify that everything is working correctly. The script exports the Llama NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ --model_path_id /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --triton_model_name llama \ --tensor_parallelism_size 1
If the test yields a shared memory-related error, increase the shared memory size using
--shm-size
(gradually by 50%, for example).In a separate terminal, access the running container as follows:
docker exec -it nemo-fw bash
To send a query to the Triton server, run the following script:
python /opt/Export-Deploy/scripts/deploy/nlp/query.py -mn llama -p "The capital of Canada is" -mol 50
To export and deploy a different model such as Llama3, Mixtral, or Starcoder, the script automatically detects the model type from the checkpoint. Please check below to see the list of supported model types.
Use a Script to Deploy NeMo LLMs on a Triton Server#
You can deploy a LLM from a NeMo checkpoint on Triton using the provided script.
Export and Deploy a LLM Model#
After executing the script, it will export the model to vLLM and then initiate the service on Triton.
Start the container using the steps described in the Quick Example section.
To begin serving the downloaded model, run the following script:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_vllm_triton.py \ --model_path_id /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --triton_model_name llama \ --tensor_parallelism_size 1
The following parameters are defined in the
deploy_vllm_triton.py
script:--model_path_id
: path of the .nemo checkpoint folder, or Hugging Face model ID or path.--tokenizer
: tokenizer file if it is not provided in the checkpoint.--lora_ckpt
: list of LoRA checkpoints in HF format.--triton_model_name
: name of the model on Triton.--triton_model_version
: version of the model. Default is 1.--triton_port
: port for the Triton server to listen for requests. Default is 8000.--triton_http_address
: HTTP address for the Triton server. Default is 0.0.0.0.--tensor_parallelism_size
: Number of GPUs to split the tensors for tensor parallelism. Default is 1.--dtype
: data type of the deployed model. Choices are “auto”, “bfloat16”, “float16”, “float32”. Default is “auto”.--quantization
: quantization method for vLLM. Choices are “awq”, “gptq”, “fp8”. Default is None.--seed
: random seed for reproducibility. Default is 0.--gpu_memory_utilization
: GPU memory utilization percentage for vLLM. Default is 0.9.--swap_space
: the size (GiB) of CPU memory per GPU to use as swap space. Default is 4.--cpu_offload_gb
: the size (GiB) of CPU memory to use for offloading the model weights. Default is 0.--enforce_eager
: whether to enforce eager execution. Default is False.--max_seq_len_to_capture
: maximum sequence len covered by CUDA graphs. Default is 8192.--max_batch_size
: maximum batch size of the model. Default is 8.--debug_mode
: enables more verbose output.
Note: The parameters described here are generalized and should be compatible with any NeMo checkpoint. It is important, however, that you check the LLM model table in the main Deploy NeMo LLM main page for optimized inference model compatibility. We are actively working on extending support to other checkpoints.
The script automatically detects the model type from the checkpoint. Please see the table below to learn more about which models are supported.
Model Name
Support Status
Llama 2
Supported
Llama 3
Supported
Gemma
Supported
StarCoder2
Supported
Mistral
Supported
Mixtral
Supported
Access the models with a Hugging Face token.
If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you’ll need to generate a Hugging Face token that has access to these models. Visit
Hugging Face <https://huggingface.co/>
__ for more information. After you have the token, perform one of the following steps.Log in to Hugging Face:
huggingface-cli login
Or, set the HF_TOKEN environment variable:
export HF_TOKEN=your_token_here
Use NeMo Export and Deploy Module APIs to Run Inference#
Up until now, we have used scripts for exporting and deploying LLM models. However, NeMo’s deploy and export modules offer straightforward APIs for deploying models to Triton and exporting NeMo checkpoints to vLLM.
Export an LLM Model to vLLM#
You can use the APIs in the export module to export a NeMo checkpoint to vLLM. The following code example assumes the hf_llama31_8B_nemo2.nemo
checkpoint has already been downloaded and mounted to the /opt/checkpoints/
path.
from nemo_export.vllm_exporter import vLLMExporter
checkpoint_file = "/opt/checkpoints/hf_llama31_8B_nemo2.nemo"
exporter = vLLMExporter()
exporter.export(
model_path_id=checkpoint_file,
tensor_parallel_size=1,
dtype="auto",
)
output = exporter.forward(["What is the best city in the world?"], max_output_len=50, top_k=1, top_p=0.0, temperature=1.0)
print("output: ", output)
Be sure to check the vLLMExporter class docstrings for details.
Deploy an LLM Model on the Triton Server using vLLM#
You can use the APIs in the deploy module to deploy a vLLM model to Triton. Use the Export example above to export the model to vLLM first, just drop the forward
and print
calls at the end. Then initialize the Triton server and serve the model:
from nemo_deploy import DeployPyTriton
nm = DeployPyTriton(model=exporter, triton_model_name="llama", http_port=8000)
nm.deploy()
nm.serve()