Important
You are viewing the NeMo 2.0 documentation. This release introduces significant changes to the API and a new library, NeMo Run. We are currently porting all features from NeMo 1.0 to 2.0. For documentation on previous versions or features not yet available in 2.0, please refer to the NeMo 24.07 documentation.
Deploy NeMo Models in the Framework#
This section demonstrates how to deploy PyTorch-level NeMo LLMs within the framework (referred to as ‘In-Framework’) using the NVIDIA Triton Inference Server.
Quick Example#
Follow the steps in the Deploy NeMo LLM main page to download the nemotron-3-8b-base-4k model.
Pull down and run the Docker container image using the command shown below. Change the
:vr
tag to the version of the container you want to use:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm --shm-size=4g -p 8000:8000 -v ${PWD}/Nemotron-3-8B-Base-4k.nemo:/opt/checkpoints/Nemotron-3-8B-Base-4k.nemo -w /opt/NeMo nvcr.io/nvidia/nemo:vr
Run the following deployment script to verify that everything is working correctly. The script directly serves the .nemo model on the Triton server:
python scripts/deploy/nlp/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo --triton_model_name nemotron
If the test yields a shared memory-related error, increase the shared memory size using
--shm-size
(gradually by 50%, for example).In a separate terminal, run the following command to get the container ID of the running container. Please find the
nvcr.io/nvidia/nemo:vr
image to get the container ID.docker ps
Access the running container and replace
container_id
with the actual container ID as follows:docker exec -it container_id bash
To send a query to the Triton server, run the following script:
python scripts/deploy/nlp/query_inframework.py -mn nemotron -p "What is the color of a banana?" -mol 5
Any model that is compatible with the MegatronGPTModel
class in collections/nlp/models/language_modeling
should work with this script.
Use a Script to Deploy NeMo LLMs on a Triton Server#
You can deploy an LLM from a NeMo checkpoint on Triton using the provided script.
Deploy a NeMo LLM Model#
Executing the script will directly deploy the in-framework (.nemo) model and initiate the service on Triton.
Start the container using the steps described in the Quick Example section.
To begin serving the downloaded model, run the following script:
python scripts/deploy/nlp/deploy_inframework_triton.py --nemo_checkpoint /opt/checkpoints/Nemotron-3-8B-Base-4k.nemo --triton_model_name nemotron
The following parameters are defined in the
deploy_inframework_triton.py
script:--nemo_checkpoint
: path of the .nemo or .qnemo checkpoint file.--triton_model_name
: name of the model on Triton.--triton_model_version
: version of the model. Default is 1.--triton_port
: port for the Triton server to listen for requests. Default is 8000.--triton_http_address
: HTTP address for the Triton server. Default is 0.0.0.0.--num_gpus
: number of GPUs to use for inference. Large models require multi-gpu export. This parameter is deprecated.--max_batch_size
: maximum batch size of the model. Default is 8.--debug_mode
: enables additional debug logging messages from the script
To deploy a different model, just change the
--nemo_checkpoint
in the scripts/deploy/nlp/deploy_inframework_triton.py script. Any model that is compatible with Nemo’sMegatronLLMDeployable
class can be used here. Nemotron, Llama2, and Mistral have been tested and confirmed to work.Access the models with a Hugging Face token.
If you want to run inference using the StarCoder1, StarCoder2, or LLama3 models, you’ll need to generate a Hugging Face token that has access to these models. Visit Hugging Face for more information. After you have the token, perform one of the following steps.
Log in to Hugging Face:
huggingface-cli login
Or, set the HF_TOKEN environment variable:
export HF_TOKEN=your_token_here
Use NeMo Deploy Module APIs to Run Inference#
Up until now, we have used scripts for deploying LLM models. However, NeMo’s deploy module offers straightforward APIs for deploying models to Triton.
Deploy an LLM Model to TensorRT-LLM#
You can use the APIs in the deploy module to deploy an in-framework model to Triton. The following code example assumes the Nemotron-3-8B-Base-4k.nemo
checkpoint has already been downloaded and mounted to the /opt/checkpoints/
path.
Run the following command:
from nemo.deploy.nlp import MegatronLLMDeployable from nemo.deploy import DeployPyTriton megatron_deployable = MegatronLLMDeployable("/opt/checkpoints/Nemotron-3-8B-Base-4k.nemo") nm = DeployPyTriton(model=megatron_deployable, triton_model_name="nemotron", port=8000) nm.deploy() nm.serve()