Deploy NeMo AutoModel LLM Models using Ray#
This section demonstrates how to deploy NeMo AutoModel LLM models using Ray Serve (referred to as ‘Ray for AutoModel LLM’). To support single-node, multi-instance deployment, Ray is now offered as an alternative to Triton. Ray Serve provides a scalable and flexible platform for deploying machine learning models, offering features such as automatic scaling, load balancing, and multi-replica deployment.
Quick Example#
If you need access to the Llama-3.2-1B model, visit the Llama 3.2 Hugging Face page to request access.
Pull and run the Docker container image. Replace
:vrwith your desired version:docker pull nvcr.io/nvidia/nemo:vr docker run --gpus all -it --rm \ --shm-size=4g \ -p 1024:1024 \ -v ${PWD}/:/opt/checkpoints/ \ -w /opt/Export-Deploy \ --name nemo-fw \ nvcr.io/nvidia/nemo:vr
Log in to Hugging Face with your access token:
huggingface-cli loginDeploy the model to Ray:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1"
Note: If you encounter shared memory errors, increase
--shm-sizegradually by 50%.In a separate terminal, access the running container as follows:
docker exec -it nemo-fw bash
Test the deployed model:
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
Detailed Deployment Guide#
Deploy a NeMo AutoModel LLM Model#
Follow these steps to deploy your model on Ray Serve:
Start the container as shown in the Quick Example section.
Deploy your model:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1"
Available Parameters:
--model_path: Path to a local Hugging Face model directory or model ID from the Hugging Face Hub.--task: Task type for the Hugging Face model (currently only ‘text-generation’ is supported).--device_map: Device mapping strategy for model placement (e.g., ‘auto’, ‘sequential’, etc.).--trust_remote_code: Allow loading remote code from the Hugging Face Hub.--model_id: Identifier for the model in the API responses.--host: Host address to bind the Ray Serve server to. Default is 0.0.0.0.--port: Port number to use for the Ray Serve server. Default is 1024.--num_cpus: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.--num_gpus: Number of GPUs to allocate for the Ray cluster. Default is 1.--include_dashboard: Whether to include the Ray dashboard for monitoring.--num_replicas: Number of model replicas to deploy. Default is 1.--num_gpus_per_replica: Number of GPUs per model replica. Default is 1.--num_cpus_per_replica: Number of CPUs per model replica. Default is 8.--cuda_visible_devices: Comma-separated list of CUDA visible devices. Default is “0,1”.--max_memory: Maximum memory allocation when using balanced device map.
To use a different model, modify the
--model_pathparameter. You can specify either a local path or a Hugging Face model ID.For models requiring authentication (e.g., StarCoder1, StarCoder2, LLama3):
Option 1 - Log in via CLI:
huggingface-cli loginOption 2 - Set environment variable:
export HF_TOKEN=your_token_here
Deploy Multiple Replicas#
Ray Serve excels at single-node multi-instance deployment. This allows you to deploy multiple instances of the same model to handle increased load:
Deploy multiple replicas using the
--num_replicasparameter:python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 4 \ --num_gpus 4 \ --num_gpus_per_replica 1 \ --cuda_visible_devices "0,1,2,3"
For models that require multiple GPUs per replica:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 4 \ --num_gpus_per_replica 2 \ --cuda_visible_devices "0,1,2,3"
Ray automatically handles load balancing across replicas, distributing incoming requests to available instances.
Important GPU Configuration Notes:
--num_gpusshould equal--num_replicas×--num_gpus_per_replica.--cuda_visible_devicesshould list all GPUs that will be usedEnsure the number of devices in
--cuda_visible_devicesmatches--num_gpus.
Test Ray Deployment#
Use the query_ray_deployment.py script to test your deployed model:
Basic testing:
python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \ --model_id llama \ --host 0.0.0.0 \ --port 1024
The script will test multiple endpoints:
Health check endpoint:
/v1/health.Models list endpoint:
/v1/models.Text completions endpoint:
/v1/completions/.
Available parameters for testing:
--host: Host address of the Ray Serve server. Default is 0.0.0.0.--port: Port number of the Ray Serve server. Default is 1024.--model_id: Identifier for the model in the API responses. Default is “nemo-model”.
Configure Advanced Deployments#
For more advanced deployment scenarios:
Custom Resource Allocation:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 3 \ --num_gpus 3 \ --num_gpus_per_replica 1 \ --num_cpus 48 \ --num_cpus_per_replica 16 \ --cuda_visible_devices "0,1,2"
Memory Management:
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \ --model_path meta-llama/Llama-3.2-1B \ --model_id llama \ --num_replicas 2 \ --num_gpus 2 \ --num_gpus_per_replica 1 \ --device_map balanced \ --max_memory 75GiB \ --cuda_visible_devices "0,1"
API Endpoints#
Once deployed, your model will be available through OpenAI-compatible API endpoints:
Health Check:
GET /v1/health.List Models:
GET /v1/models.Text Completions:
POST /v1/completions/.Chat Completions:
POST /v1/chat/completions/.
Example API request:
curl -X POST http://localhost:1024/v1/completions/ \
-H "Content-Type: application/json" \
-d '{
"model": "llama",
"prompt": "The capital of France is",
"max_tokens": 50,
"temperature": 0.7
}'
Troubleshooting#
Out of Memory Errors: Reduce
--num_replicasor--num_gpus_per_replica.Port Already in Use: Change the
--portparameter.Ray Cluster Issues: Ensure no other Ray processes are running:
ray stop.GPU Allocation: Verify
--cuda_visible_devicesmatches your available GPUs.GPU Configuration Errors: Ensure
--num_gpus=--num_replicas×--num_gpus_per_replica.CUDA Device Mismatch: Make sure the number of devices in
--cuda_visible_devicesequals--num_gpus.
For more information on Ray Serve, visit the Ray Serve documentation.