Deploy NeMo Models using Ray#
This section demonstrates how to deploy NeMo LLM models using Ray Serve (referred to as ‘Ray for NeMo Models’). Ray deployment support provides scalable and flexible deployment for NeMo models, offering features such as automatic scaling, load balancing, and multi-replica deployment with support for advanced parallelism strategies.
Note: Currently, only single-node deployment is supported.
Quick Example#
Follow the steps in the Deploy NeMo LLM main page to generate a NeMo 2.0 checkpoint.
Pull and run the Docker container image. Replace
:vr
with your desired version:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm \ --shm-size=4g \ -p 1024:1024 \ -v ${PWD}/:/opt/checkpoints/ \ -w /opt/Export-Deploy \ --name nemo-fw \ nvcr.io/nvidia/nemo:vr
Deploy the NeMo model to Ray:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id llama \ --num_replicas 1 \ --num_gpus 1 \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0"
In a separate terminal, access the running container as follows:
docker exec -it nemo-fw bash
Test the deployed model:
python scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
Detailed Deployment Guide#
Deploy a NeMo LLM Model#
Follow these steps to deploy your NeMo model on Ray Serve:
Start the container as shown in the Quick Example section.
Deploy your model:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id llama \ --num_replicas 1 \ --num_gpus 2 \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0,1"
Available Parameters:
--nemo_checkpoint
: Path to the .nemo checkpoint file (required).--num_gpus
: Number of GPUs to use per node. Default is 1.--tensor_model_parallel_size
: Size of the tensor model parallelism. Default is 1.--pipeline_model_parallel_size
: Size of the pipeline model parallelism. Default is 1.--expert_model_parallel_size
: Size of the expert model parallelism. Default is 1.--context_parallel_size
: Size of the context parallelism. Default is 1.--model_id
: Identifier for the model in the API responses. Default is “nemo-model”.--host
: Host address to bind the Ray Serve server to. Default is 0.0.0.0.--port
: Port number to use for the Ray Serve server. Default is 1024.--num_cpus
: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.--num_cpus_per_replica
: Number of CPUs per model replica. Default is 8.--include_dashboard
: Whether to include the Ray dashboard for monitoring.--cuda_visible_devices
: Comma-separated list of CUDA visible devices. Default is “0,1”.--enable_cuda_graphs
: Whether to enable CUDA graphs for faster inference.--enable_flash_decode
: Whether to enable Flash Attention decode.--num_replicas
: Number of replicas for the deployment. Default is 1.--legacy_ckpt
: Whether to use legacy checkpoint format.
To use a different model, modify the
--nemo_checkpoint
parameter with the path to your .nemo checkpoint file.
Configure Model Parallelism#
NeMo models support advanced parallelism strategies for large model deployment:
Tensor Model Parallelism: Distributes model layers across multiple GPUs:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id large_llama \ --num_gpus 4 \ --tensor_model_parallel_size 4 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0,1,2,3"
Pipeline Model Parallelism: Distributes model layers sequentially across GPUs:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id large_llama \ --num_gpus 4 \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 4 \ --cuda_visible_devices "0,1,2,3"
Combined Parallelism: Uses both tensor and pipeline parallelism:
python /opt/NeMo-Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id large_llama \ --num_gpus 8 \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 4 \ --cuda_visible_devices "0,1,2,3,4,5,6,7"
Deploy Multiple Replicas#
Deploy multiple replicas of your NeMo model for increased throughput:
Single GPU per replica:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id llama \ --num_replicas 4 \ --num_gpus 4 \ --tensor_model_parallel_size 1 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0,1,2,3"
Multiple GPUs per replica with tensor parallelism:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id large_llama \ --num_replicas 2 \ --num_gpus 8 \ --tensor_model_parallel_size 4 \ --pipeline_model_parallel_size 1 \ --cuda_visible_devices "0,1,2,3,4,5,6,7"
Important GPU Configuration Notes:
GPUs per replica = Total GPUs ÷
--num_replicas
.Each replica needs:
--tensor_model_parallel_size
×--pipeline_model_parallel_size
×--context_parallel_size
GPUs.Ensure
--cuda_visible_devices
lists all GPUs that will be used.
Optimize Performance#
Enable performance optimizations for faster inference:
CUDA Graphs: Reduces kernel launch overhead:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id llama \ --enable_cuda_graphs \ --num_gpus 2 \ --tensor_model_parallel_size 2 \ --cuda_visible_devices "0,1"
Flash Attention Decode: Optimizes attention computation:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id llama \ --enable_flash_decode \ --num_gpus 2 \ --tensor_model_parallel_size 2 \ --cuda_visible_devices "0,1"
Combined Optimizations:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id llama \ --enable_cuda_graphs \ --enable_flash_decode \ --num_gpus 4 \ --tensor_model_parallel_size 2 \ --pipeline_model_parallel_size 2 \ --cuda_visible_devices "0,1,2,3"
Test Ray Deployment#
Use the query_ray_deployment.py
script to test your deployed NeMo model:
Basic testing:
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
The script will test multiple endpoints:
Health check endpoint:
/v1/health
Models list endpoint:
/v1/models
Text completions endpoint:
/v1/completions/
Available parameters for testing:
--host
: Host address of the Ray Serve server. Default is 0.0.0.0.--port
: Port number of the Ray Serve server. Default is 1024.--model_id
: Identifier for the model in the API responses. Default is “nemo-model”.
Configure Advanced Deployments#
For more advanced deployment scenarios:
Custom Resource Allocation:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id llama \ --num_replicas 2 \ --num_gpus 4 \ --tensor_model_parallel_size 2 \ --num_cpus 32 \ --num_cpus_per_replica 16 \ --cuda_visible_devices "0,1,2,3"
Legacy Checkpoint Support:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \ --nemo_checkpoint /opt/checkpoints/hf_llama31_8B_nemo2.nemo \ --model_id llama \ --legacy_ckpt \ --num_gpus 2 \ --tensor_model_parallel_size 2 \ --cuda_visible_devices "0,1"
API Endpoints#
Once deployed, your NeMo model will be available through OpenAI-compatible API endpoints:
Health Check:
GET /v1/health
List Models:
GET /v1/models
Text Completions:
POST /v1/completions/
Chat Completions:
POST /v1/chat/completions/
Example API request:
curl -X POST http://localhost:1024/v1/completions/ \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"prompt": "The capital of France is",
"max_tokens": 50,
"temperature": 0.7
}'
Troubleshooting#
Out of Memory Errors: Reduce
--num_replicas
or adjust parallelism settings.Port Already in Use: Change the
--port
parameter.Ray Cluster Issues: Ensure no other Ray processes are running:
ray stop
.GPU Allocation: Verify
--cuda_visible_devices
matches your available GPUs.Parallelism Configuration Errors: Ensure total parallelism per replica matches available GPUs per replica.
CUDA Device Mismatch: Make sure the number of devices in
--cuda_visible_devices
equals total GPUs.Checkpoint Loading Issues: Verify the
.nemo
checkpoint path is correct and accessible.Legacy Checkpoint: Use
--legacy_ckpt
flag for older checkpoint formats.
Note: Only NeMo 2.0 checkpoints are supported by default. For older checkpoints, use the --legacy_ckpt
flag.
For more information on Ray Serve, visit the Ray Serve documentation.