Deploy NeMo AutoModel LLM Models using Ray#
This section demonstrates how to deploy NeMo AutoModel LLM models using Ray Serve (referred to as ‘Ray for AutoModel LLM’). To support single-node, multi-instance deployment, Ray is now offered as an alternative to Triton. Ray Serve provides a scalable and flexible platform for deploying machine learning models, offering features such as automatic scaling, load balancing, and multi-replica deployment.
Quick Example#
If you need access to the Llama-3.2-1B model, visit the Llama 3.2 Hugging Face page to request access.
Pull and run the Docker container image. Replace
:vr
with your desired version:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm \ --shm-size=4g \ -p 1024:1024 \ -v ${PWD}/:/opt/checkpoints/ \ -w /opt/Export-Deploy \ --name nemo-fw \ nvcr.io/nvidia/nemo:vr
Log in to Hugging Face with your access token:
huggingface-cli login
Deploy the model to Ray:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1"
Note: If you encounter shared memory errors, increase
--shm-size
gradually by 50%.In a separate terminal, access the running container as follows:
docker exec -it nemo-fw bash
Test the deployed model:
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
Detailed Deployment Guide#
Deploy a NeMo AutoModel LLM Model#
Follow these steps to deploy your model on Ray Serve:
Start the container as shown in the Quick Example section.
Deploy your model:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1"
Available Parameters:
--model_path
: Path to a local Hugging Face model directory or model ID from the Hugging Face Hub.--task
: Task type for the Hugging Face model (currently only ‘text-generation’ is supported).--device_map
: Device mapping strategy for model placement (e.g., ‘auto’, ‘sequential’, etc.).--trust_remote_code
: Allow loading remote code from the Hugging Face Hub.--model_id
: Identifier for the model in the API responses.--host
: Host address to bind the Ray Serve server to. Default is 0.0.0.0.--port
: Port number to use for the Ray Serve server. Default is 1024.--num_cpus
: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.--num_gpus
: Number of GPUs to allocate for the Ray cluster. Default is 1.--include_dashboard
: Whether to include the Ray dashboard for monitoring.--num_replicas
: Number of model replicas to deploy. Default is 1.--num_gpus_per_replica
: Number of GPUs per model replica. Default is 1.--num_cpus_per_replica
: Number of CPUs per model replica. Default is 8.--cuda_visible_devices
: Comma-separated list of CUDA visible devices. Default is “0,1”.--max_memory
: Maximum memory allocation when using balanced device map.
To use a different model, modify the
--model_path
parameter. You can specify either a local path or a Hugging Face model ID.For models requiring authentication (e.g., StarCoder1, StarCoder2, LLama3):
Option 1 - Log in via CLI:
huggingface-cli login
Option 2 - Set environment variable:
export HF_TOKEN=your_token_here
Deploy Multiple Replicas#
Ray Serve excels at single-node multi-instance deployment. This allows you to deploy multiple instances of the same model to handle increased load:
Deploy multiple replicas using the
--num_replicas
parameter:python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 4 \ --num_gpus 4 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1,2,3"
For models that require multiple GPUs per replica:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 4 \ --num_gpus_per_replica 2 \ --cuda_visible_devices "0,1,2,3"
Ray automatically handles load balancing across replicas, distributing incoming requests to available instances.
Important GPU Configuration Notes:
--num_gpus
should equal--num_replicas
×--num_gpus_per_replica
.--cuda_visible_devices
should list all GPUs that will be usedEnsure the number of devices in
--cuda_visible_devices
matches--num_gpus
.
Test Ray Deployment#
Use the query_ray_deployment.py
script to test your deployed model:
Basic testing:
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
The script will test multiple endpoints:
Health check endpoint:
/v1/health
.Models list endpoint:
/v1/models
.Text completions endpoint:
/v1/completions/
.
Available parameters for testing:
--host
: Host address of the Ray Serve server. Default is 0.0.0.0.--port
: Port number of the Ray Serve server. Default is 1024.--model_id
: Identifier for the model in the API responses. Default isnemo-model
.
Configure Advanced Deployments#
For more advanced deployment scenarios:
Custom Resource Allocation:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 3 \ --num_gpus 3 \ --num_gpus_per_replica 1 \ --num_cpus 48 \ --num_cpus_per_replica 16 \ --cuda_visible_devices "0,1,2"
Memory Management:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --device_map balanced \ --max_memory 75GiB \ --cuda_visible_devices "0,1"
API Endpoints#
Once deployed, your model will be available through OpenAI-compatible API endpoints:
Health Check:
GET /v1/health
.List Models:
GET /v1/models
.Text Completions:
POST /v1/completions/
.Chat Completions:
POST /v1/chat/completions/
.
Example API request:
curl -X POST http://localhost:1024/v1/completions/ \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"prompt": "The capital of France is",
"max_tokens": 50,
"temperature": 0.7
}'
Troubleshooting#
Out of Memory Errors: Reduce
--num_replicas
or--num_gpus_per_replica
.Port Already in Use: Change the
--port
parameter.Ray Cluster Issues: Ensure no other Ray processes are running:
ray stop
.GPU Allocation: Verify
--cuda_visible_devices
matches your available GPUs.GPU Configuration Errors: Ensure
--num_gpus
=--num_replicas
×--num_gpus_per_replica
.CUDA Device Mismatch: Make sure the number of devices in
--cuda_visible_devices
equals--num_gpus
.
For more information on Ray Serve, visit the Ray Serve documentation.
Multi-node on SLURM using ray.sub#
Use scripts/deploy/utils/ray.sub
to bring up a Ray cluster across multiple SLURM nodes and run your AutoModel deployment automatically. This script starts a Ray head and workers, manages ports, and launches a driver command when the cluster is ready.
Script location:
scripts/deploy/utils/ray.sub
Upstream reference: See the NeMo RL cluster setup doc for background on this pattern: NVIDIA-NeMo RL cluster guide
Prerequisites#
SLURM with container support for
srun --container-image
and--container-mounts
.A container image that includes Export-Deploy at
/opt/Export-Deploy
.Any model access/auth if required (e.g.,
huggingface-cli login
orHF_TOKEN
).
Quick start (2 nodes, 16 GPUs total)#
Set environment variables used by
ray.sub
:
export CONTAINER=nvcr.io/nvidia/nemo:vr
export MOUNTS="${PWD}/:/opt/checkpoints/"
export GPUS_PER_NODE=8
# Driver command to run after the cluster is ready (multi-node AutoModel deployment)
export COMMAND="python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py --model_path meta-llama/Llama-3.2-1B --model_id llama --num_replicas 16 --num_gpus 16 --num_gpus_per_replica 1"
Submit the job:
sbatch --nodes=2 --account <ACCOUNT> --partition <PARTITION> \
--job-name automodel-ray --time 01:00:00 \
/opt/Export-Deploy/scripts/deploy/utils/ray.sub
The script will:
Start a Ray head on node 0 and one Ray worker per remaining node
Wait until all nodes register their resources
Launch the
COMMAND
on the head node (driver) once the cluster is healthy
Attaching and monitoring:
Logs:
$SLURM_SUBMIT_DIR/<jobid>-logs/
containsray-head.log
andray-worker-<n>.log
.Interactive shell: the job creates
<jobid>-attach.sh
. For head:bash <jobid>-attach.sh
. For worker i:bash <jobid>-attach.sh i
.Ray status: once attached to the head container, run
ray status
.
Query the deployment (from within the head container):
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \
--model_id llama --host 0.0.0.0 --port 1024
Notes#
Set
--num_gpus
in the deploy command to the total GPUs across all nodes; ensure--num_gpus = --num_replicas × --num_gpus_per_replica
.If your cluster uses GRES,
ray.sub
auto-detects and sets--gres=gpu:<GPUS_PER_NODE>
; ensureGPUS_PER_NODE
matches the node GPU count.You usually do not need to set
--cuda_visible_devices
for multi-node; Ray workers handle per-node visibility.