Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Deploy NeMo Models by Exporting vLLM#
This section shows how to use scripts and APIs to export a NeMo LLM to vLLM and deploy it with the NVIDIA Triton Inference Server.
Quick Example#
Download a supported model checkpoint in NeMo format from NGC. Here, we’ll be using the Gemma 7B Base model as an example. In order to run the following command, an NGC account is required. Please visit NGC to create an account and run the ngc command below.
ngc registry model download-version "nvidia/nemo/gemma_7b_base:1.1"
Pull down and run the Docker container image using the command shown below. Change the
:vr
tag to the version of the container you want to use:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/gemma_7b_base_v1.1:/opt/checkpoints/gemma_7b_base_v1.1 -w /opt/NeMo nvcr.io/nvidia/nemo:vr
In the container, activate the virtual environment (venv) that contains the vLLM installation.
source /opt/venv/bin/activate
Run the following deployment script to verify that everything is working correctly. The script exports the Nemotron NeMo checkpoint to vLLM and subsequently serves it on the Triton server:
python scripts/deploy/nlp/deploy_vllm_triton.py --nemo_checkpoint /opt/checkpoints/gemma_7b_base_v1.1/pytorch-7b-pt.nemo --model_type gemma --triton_model_name gemma --tensor_parallelism_size 1
In a separate terminal, run the following command to get the container ID of the running container. Please access the
nvcr.io/nvidia/nemo:vr
image to get the container ID.docker ps
Access the running container and replace
container_id
with the actual container ID in the below command.docker exec -it container_id bash
To send a query to the Triton server, run the following script:
python scripts/deploy/nlp/query.py -mn gemma -p "The capital of Canada is" -mol 50
To export and deploy a different model such as Llama3, Mixtral, or Starcoder, change the model_type in the scripts/deploy/nlp/deploy_vllm_triton.py script. Please check below to see the list of the model types.
Use a Script to Deploy NeMo LLMs on a Triton Server#
You can deploy a LLM from a NeMo checkpoint on Triton using the provided script.
Export and Deploy a LLM Model#
After executing the script, it will export the model to vLLM and then initiate the service on Triton.
Start the container using the steps described in the Quick Example section.
To begin serving the downloaded model, run the following script:
python scripts/deploy/nlp/deploy_vllm_triton.py --nemo_checkpoint /opt/checkpoints/gemma_7b_base_v1.1/pytorch-7b-pt.nemo --model_type gemma --triton_model_name gemma --tensor_parallelism_size 1
The following parameters are defined in the
deploy_vllm_triton.py
script:--nemo_checkpoint
: path of the .nemo or .qnemo checkpoint file.--model_type
: type of the model. Can be “llama”, “mistral”, “mixtral”, “starcoder2”, “gemma”.--triton_model_name
: name of the model on Triton.--triton_model_version
: version of the model. Default is 1.--triton_port
: port for the Triton server to listen for requests. Default is 8000.--triton_http_address
: HTTP address for the Triton server. Default is 0.0.0.0--triton_model_repository
: Temporary folder for weight conversion. Default is a new folder in /tmp/.--tensor_parallelism_size
: Number of GPUs to split the tensors for tensor parallelism. Default is 1.--dtype
: data type of the deployed model. Default is “bfloat16”.--max_model_len
: maximum input + output length of the model. Default is 512.--max_batch_size
: maximum batch size of the model. Default is 8.--debug_mode
: enables more verbose output.--weight_storage
: strategy for storing converted weights for vLLM. Can be “auto”, “cache”, “file”, “memory”. Use--help
for more information.
Note
The parameters described here are generalized and should be compatible with any NeMo checkpoint. It is important; however, that you check the LLM model table in the main Deploy NeMo LLM page for optimized inference model compatibility. We are actively working on extending support to other checkpoints.
To export and deploy a different model such as Llama3, Mixtral, and Starcoder, change the model_type parameter in the scripts/deploy/nlp/deploy_vllm_triton.py script. Please see the table below to learn more about which model_type is used for a LLM model.
Model Name
model_type
Llama 2
llama
Llama 3
llama
Gemma
gemma
StarCoder2
starcoder2
Mistral
mistral
Mixtral
mixtral
Export faster by caching weights.
Whenever the deployment script is executed, it initiates the service by exporting the NeMo checkpoint to vLLM, which includes converting weights to a compatible format. By default, for a single-GPU use case, the conversion happens in-memory and is quick. For multiple GPUs, the conversion happens through a temporary file, and there is an option to keep that file between runs for quicker deployment. To do that, you’ll need to create an empty directory and make it available to the deployment script.
Stop the running container and then run the following command to specify an empty directory:
mkdir tmp_triton_model_repository docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}:/opt/checkpoints/ -w /opt/NeMo nvcr.io/nvidia/nemo:vr python scripts/deploy/nlp/deploy_vllm_triton.py --nemo_checkpoint /opt/checkpoints/gemma_7b_base_v1.1/pytorch-7b-pt.nemo --model_type=gemma --triton_model_name gemma --triton_model_repository /opt/checkpoints/tmp_triton_model_repository --weight_storage cache --tensor_parallelism_size 1The
--weight_storage cache
setting indicates that weights will be converted through a file in the directory specified by--triton_model_repository
. This file will only be overwritten if it’s older than the input .nemo file.
Access the models with a Hugging Face token.
If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you’ll need to generate a Hugging Face token that has access to these models. Visit Hugging Face for more information. After you have the token, perform one of the following steps.
Log in to Hugging Face:
huggingface-cli login
Or, set the HF_TOKEN environment variable:
export HF_TOKEN=your_token_here
Use NeMo Export and Deploy Module APIs to Run Inference#
Up until now, we’ve used scripts for exporting and deploying LLM models. However, NeMo’s Deploy and Export modules offer straightforward APIs for deploying models to Triton and exporting NeMo checkpoints to vLLM.
Export an LLM Model to vLLM#
You can use the APIs in the export module to export a NeMo checkpoint to vLLM. The following code example assumes the gemma_7b_base_v1.1/pytorch-7b-pt.nemo
checkpoint has already been downloaded and mounted to the /opt/checkpoints/
path.
from nemo.export.vllm_exporter import vLLMExporter
import os, os.path
checkpoint_file = "/opt/checkpoints/gemma_7b_base_v1.1/pytorch-7b-pt.nemo"
model_dir = "/opt/checkpoints/gemma_7b_base_v1.1/vllm_export"
# Ensure that the temporary directory exists
if not os.path.exsts(model_dir):
os.mkdir(model_dir)
# Export the checkpoint to vLLM, prepare for inference
exporter = vLLMExporter()
exporter.export(
nemo_checkpoint=checkpoint_file,
model_dir=model_dir,
model_type="gemma",
)
# Run inference and print the output
output = exporter.forward(["What is the best city in the world?"], max_output_len=50, top_k=1, top_p=0.0, temperature=1.0)
print("output: ", output)
Be sure to check the vLLMExporter class docstrings for details.
Deploy an LLM Model on the Triton Server using vLLM#
You can use the APIs in the deploy module to deploy a vLLM model to Triton. Use the Export example above to export the model to vLLM first, just drop the forward
and print
calls at the end. Then initialize the Triton server and serve the model:
from nemo.deploy import DeployPyTriton
nm = DeployPyTriton(model=exporter, triton_model_name="gemma", port=8000)
nm.deploy()
nm.serve()