Deploy NeMo AutoModel LLM Models using Ray#

This section demonstrates how to deploy NeMo AutoModel LLM models using Ray Serve (referred to as ‘Ray for AutoModel LLM’). To support single-node, multi-instance deployment, Ray is now offered as an alternative to Triton. Ray Serve provides a scalable and flexible platform for deploying machine learning models, offering features such as automatic scaling, load balancing, and multi-replica deployment.

Quick Example#

If you need access to the Llama-3.2-1B model, visit the Llama 3.2 Hugging Face page to request access.

Pull and run the Docker container image. Replace :vr with your desired version:

docker pull nvcr.io/nvidia/nemo:vr

docker run --gpus all -it --rm \
    --shm-size=4g \
    -p 1024:1024 \
    -v ${PWD}/:/opt/checkpoints/ \
    -w /opt/Export-Deploy \
    --name nemo-fw \
    nvcr.io/nvidia/nemo:vr

Deploy the model to Ray:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
   --model_path meta-llama/Llama-3.2-1B \
   --model_id llama \
   --num_replicas 2 \
   --num_gpus 2 \
   --num_gpus_per_replica 1 \
   --cuda_visible_devices "0,1"

Note: If you encounter shared memory errors, increase --shm-size gradually by 50%.

In a separate terminal, access the running container as follows:
```
docker exec -it nemo-fw bash
```

Test the deployed model:

python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \
   --model_id llama \
   --host 0.0.0.0 \
   --port 1024

Detailed Deployment Guide#

Deploy a NeMo AutoModel LLM Model#

Follow these steps to deploy your model on Ray Serve:

Start the container as shown in the Quick Example section.
Deploy your model:
```
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
   --model_path meta-llama/Llama-3.2-1B \
   --model_id llama \
   --num_replicas 2 \
   --num_gpus 2 \
   --num_gpus_per_replica 1 \
   --cuda_visible_devices "0,1"
```
Available Parameters:
- --model_path: Path to a local Hugging Face model directory or model ID from the Hugging Face Hub.
- --task: Task type for the Hugging Face model (currently only ‘text-generation’ is supported).
- --device_map: Device mapping strategy for model placement (e.g., ‘auto’, ‘sequential’, etc.).
- --trust_remote_code: Allow loading remote code from the Hugging Face Hub.
- --model_id: Identifier for the model in the API responses.
- --host: Host address to bind the Ray Serve server to. Default is 0.0.0.0.
- --port: Port number to use for the Ray Serve server. Default is 1024.
- --num_cpus: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.
- --num_gpus: Number of GPUs to allocate for the Ray cluster. Default is 1.
- --include_dashboard: Whether to include the Ray dashboard for monitoring.
- --num_replicas: Number of model replicas to deploy. Default is 1.
- --num_gpus_per_replica: Number of GPUs per model replica. Default is 1.
- --num_cpus_per_replica: Number of CPUs per model replica. Default is 8.
- --cuda_visible_devices: Comma-separated list of CUDA visible devices. Default is “0,1”.
- --max_memory: Maximum memory allocation when using balanced device map.
To use a different model, modify the --model_path parameter. You can specify either a local path or a Hugging Face model ID.
For models requiring authentication (e.g., StarCoder1, StarCoder2, LLama3):

Option 1 - Log in via CLI:
```
huggingface-cli login
```
Option 2 - Set environment variable:
```
export HF_TOKEN=your_token_here
```

Deploy Multiple Replicas#

Ray Serve excels at single-node multi-instance deployment. This allows you to deploy multiple instances of the same model to handle increased load:

Deploy multiple replicas using the --num_replicas parameter:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
   --model_path meta-llama/Llama-3.2-1B \
   --model_id llama \
   --num_replicas 4 \
   --num_gpus 4 \
   --num_gpus_per_replica 1 \
   --cuda_visible_devices "0,1,2,3"

For models that require multiple GPUs per replica:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
   --model_path meta-llama/Llama-3.2-1B \
   --model_id llama \
   --num_replicas 2 \
   --num_gpus 4 \
   --num_gpus_per_replica 2 \
   --cuda_visible_devices "0,1,2,3"

Ray automatically handles load balancing across replicas, distributing incoming requests to available instances.

Important GPU Configuration Notes:

--num_gpus should equal --num_replicas × --num_gpus_per_replica.
--cuda_visible_devices should list all GPUs that will be used
Ensure the number of devices in --cuda_visible_devices matches --num_gpus.

Test Ray Deployment#

Use the query_ray_deployment.py script to test your deployed model:

Basic testing:

python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \
   --model_id llama \
   --host 0.0.0.0 \
   --port 1024

The script will test multiple endpoints:
- Health check endpoint: /v1/health.
- Models list endpoint: /v1/models.
- Text completions endpoint: /v1/completions/.
Available parameters for testing:
- --host: Host address of the Ray Serve server. Default is 0.0.0.0.
- --port: Port number of the Ray Serve server. Default is 1024.
- --model_id: Identifier for the model in the API responses. Default is “nemo-model”.

Configure Advanced Deployments#

For more advanced deployment scenarios:

Custom Resource Allocation:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
   --model_path meta-llama/Llama-3.2-1B \
   --model_id llama \
   --num_replicas 3 \
   --num_gpus 3 \
   --num_gpus_per_replica 1 \
   --num_cpus 48 \
   --num_cpus_per_replica 16 \
   --cuda_visible_devices "0,1,2"

Memory Management:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_hf.py \
   --model_path meta-llama/Llama-3.2-1B \
   --model_id llama \
   --num_replicas 2 \
   --num_gpus 2 \
   --num_gpus_per_replica 1 \
   --device_map balanced \
   --max_memory 75GiB \
   --cuda_visible_devices "0,1"

API Endpoints#

Once deployed, your model will be available through OpenAI-compatible API endpoints:

Health Check: GET /v1/health.
List Models: GET /v1/models.
Text Completions: POST /v1/completions/.
Chat Completions: POST /v1/chat/completions/.

Example API request:

curl -X POST http://localhost:1024/v1/completions/ \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Troubleshooting#

Out of Memory Errors: Reduce --num_replicas or --num_gpus_per_replica.
Port Already in Use: Change the --port parameter.
Ray Cluster Issues: Ensure no other Ray processes are running: ray stop.
GPU Allocation: Verify --cuda_visible_devices matches your available GPUs.
GPU Configuration Errors: Ensure --num_gpus = --num_replicas × --num_gpus_per_replica.
CUDA Device Mismatch: Make sure the number of devices in --cuda_visible_devices equals --num_gpus.

For more information on Ray Serve, visit the Ray Serve documentation.