Deploy NeMo AutoModel LLM Models using Ray#
This section demonstrates how to deploy NeMo AutoModel LLM models using Ray Serve (referred to as ‘Ray for AutoModel LLM’). To support single-node, multi-instance deployment, Ray is now offered as an alternative to Triton. Ray Serve provides a scalable and flexible platform for deploying machine learning models, offering features such as automatic scaling, load balancing, and multi-replica deployment.
Quick Example#
If you need access to the Llama-3.2-1B model, visit the Llama 3.2 Hugging Face page to request access.
Pull and run the Docker container image. Replace
:vr
with your desired version:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm \ --shm-size=4g \ -p 1024:1024 \ -v ${PWD}/:/opt/checkpoints/ \ -w /opt/Export-Deploy \ --name nemo-fw \ nvcr.io/nvidia/nemo:vr
Log in to Hugging Face with your access token:
huggingface-cli login
Deploy the model to Ray:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1"
Note: If you encounter shared memory errors, increase
--shm-size
gradually by 50%.In a separate terminal, access the running container as follows:
docker exec -it nemo-fw bash
Test the deployed model:
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
Detailed Deployment Guide#
Deploy a NeMo AutoModel LLM Model#
Follow these steps to deploy your model on Ray Serve:
Start the container as shown in the Quick Example section.
Deploy your model:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1"
Available Parameters:
--model_path
: Path to a local Hugging Face model directory or model ID from the Hugging Face Hub.--task
: Task type for the Hugging Face model (currently only ‘text-generation’ is supported).--device_map
: Device mapping strategy for model placement (e.g., ‘auto’, ‘sequential’, etc.).--trust_remote_code
: Allow loading remote code from the Hugging Face Hub.--model_id
: Identifier for the model in the API responses.--host
: Host address to bind the Ray Serve server to. Default is 0.0.0.0.--port
: Port number to use for the Ray Serve server. Default is 1024.--num_cpus
: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.--num_gpus
: Number of GPUs to allocate for the Ray cluster. Default is 1.--include_dashboard
: Whether to include the Ray dashboard for monitoring.--num_replicas
: Number of model replicas to deploy. Default is 1.--num_gpus_per_replica
: Number of GPUs per model replica. Default is 1.--num_cpus_per_replica
: Number of CPUs per model replica. Default is 8.--cuda_visible_devices
: Comma-separated list of CUDA visible devices. Default is “0,1”.--max_memory
: Maximum memory allocation when using balanced device map.
To use a different model, modify the
--model_path
parameter. You can specify either a local path or a Hugging Face model ID.For models requiring authentication (e.g., StarCoder1, StarCoder2, LLama3):
Option 1 - Log in via CLI:
huggingface-cli login
Option 2 - Set environment variable:
export HF_TOKEN=your_token_here
Deploy Multiple Replicas#
Ray Serve excels at single-node multi-instance deployment. This allows you to deploy multiple instances of the same model to handle increased load:
Deploy multiple replicas using the
--num_replicas
parameter:python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 4 \ --num_gpus 4 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1,2,3"
For models that require multiple GPUs per replica:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 4 \ --num_gpus_per_replica 2 \ --cuda_visible_devices "0,1,2,3"
Ray automatically handles load balancing across replicas, distributing incoming requests to available instances.
Important GPU Configuration Notes:
--num_gpus
should equal--num_replicas
×--num_gpus_per_replica
.--cuda_visible_devices
should list all GPUs that will be usedEnsure the number of devices in
--cuda_visible_devices
matches--num_gpus
.
Test Ray Deployment#
Use the query_ray_deployment.py
script to test your deployed model:
Basic testing:
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
The script will test multiple endpoints:
Health check endpoint:
/v1/health
.Models list endpoint:
/v1/models
.Text completions endpoint:
/v1/completions/
.
Available parameters for testing:
--host
: Host address of the Ray Serve server. Default is 0.0.0.0.--port
: Port number of the Ray Serve server. Default is 1024.--model_id
: Identifier for the model in the API responses. Default is “nemo-model”.
Configure Advanced Deployments#
For more advanced deployment scenarios:
Custom Resource Allocation:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 3 \ --num_gpus 3 \ --num_gpus_per_replica 1 \ --num_cpus 48 \ --num_cpus_per_replica 16 \ --cuda_visible_devices "0,1,2"
Memory Management:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --device_map balanced \ --max_memory 75GiB \ --cuda_visible_devices "0,1"
API Endpoints#
Once deployed, your model will be available through OpenAI-compatible API endpoints:
Health Check:
GET /v1/health
.List Models:
GET /v1/models
.Text Completions:
POST /v1/completions/
.Chat Completions:
POST /v1/chat/completions/
.
Example API request:
curl -X POST http://localhost:1024/v1/completions/ \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"prompt": "The capital of France is",
"max_tokens": 50,
"temperature": 0.7
}'
Troubleshooting#
Out of Memory Errors: Reduce
--num_replicas
or--num_gpus_per_replica
.Port Already in Use: Change the
--port
parameter.Ray Cluster Issues: Ensure no other Ray processes are running:
ray stop
.GPU Allocation: Verify
--cuda_visible_devices
matches your available GPUs.GPU Configuration Errors: Ensure
--num_gpus
=--num_replicas
×--num_gpus_per_replica
.CUDA Device Mismatch: Make sure the number of devices in
--cuda_visible_devices
equals--num_gpus
.
For more information on Ray Serve, visit the Ray Serve documentation.