Deploy Megatron-Bridge LLMs with Ray Serve#

This section demonstrates how to deploy Megatron-Bridge LLM models using Ray Serve. Ray Serve deployment support provides scalable and flexible deployment for Megatron-Bridge LLMs, offering features such as automatic scaling, load balancing, and multi-replica deployment with support for advanced parallelism strategies.

Note: Single-node examples are shown below. For multi-node clusters managed by SLURM, you can deploy across nodes using the ray.sub helper described in the section “Multi-node on SLURM using ray.sub”.

Quick Example#

Follow the steps on the Generate A Megatron-Bridge Checkpoint page to generate a Megatron-Bridge Llama checkpoint.

In a terminal, go to the folder where the hf_llama31_8B_mbridge checkpoint is located. Pull and run the Docker container image. Replace :vr with your desired version:

docker pull nvcr.io/nvidia/nemo:vr

docker run --gpus all -it --rm \
    --shm-size=4g \
    -p 1024:1024 \
    -v ${PWD}/:/opt/checkpoints/ \
    -w /opt/Export-Deploy \
    --name nemo-fw \
    nvcr.io/nvidia/nemo:vr

Deploy the Megatron-Bridge LLM with Ray Serve:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id llama \
   --num_replicas 1 \
   --num_gpus 1 \
   --tensor_model_parallel_size 1 \
   --pipeline_model_parallel_size 1 \
   --cuda_visible_devices "0"

In a separate terminal, access the running container as follows:
```
docker exec -it nemo-fw bash
```

Test the deployed model:

python scripts/deploy/nlp/query_ray_deployment.py \
   --model_id llama \
   --host 0.0.0.0 \
   --port 1024

Detailed Deployment Guide#

Deploy a Megatron-Bridge LLM#

Follow these steps to deploy your Megatron-Bridge model on Ray Serve:

Start the container as shown in the Quick Example section.
Deploy your model:
```
python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id llama \
   --num_replicas 1 \
   --num_gpus 2 \
   --tensor_model_parallel_size 2 \
   --pipeline_model_parallel_size 1 \
   --cuda_visible_devices "0,1"
```
Available Parameters:
- --megatron_checkpoint: Path to the Megatron-Bridge checkpoint directory (required).
- --num_gpus: Number of GPUs to use per node. Default is 1.
- --tensor_model_parallel_size: Size of the tensor model parallelism. Default is 1.
- --pipeline_model_parallel_size: Size of the pipeline model parallelism. Default is 1.
- --expert_model_parallel_size: Size of the expert model parallelism. Default is 1.
- --context_parallel_size: Size of the context parallelism. Default is 1.
- --model_id: Identifier for the model in the API responses. Default is nemo-model.
- --host: Host address to bind the Ray Serve server to. Default is 0.0.0.0.
- --port: Port number to use for the Ray Serve server. Default is 1024.
- --num_cpus: Number of CPUs to allocate for the Ray cluster. If None, will use all available CPUs.
- --num_cpus_per_replica: Number of CPUs per model replica. Default is 8.
- --include_dashboard: Whether to include the Ray dashboard for monitoring.
- --cuda_visible_devices: Comma-separated list of CUDA visible devices. Default is “0,1”.
- --enable_cuda_graphs: Whether to enable CUDA graphs for faster inference.
- --enable_flash_decode: Whether to enable Flash Attention decode.
- --num_replicas: Number of replicas for the deployment. Default is 1.
To use a different model, modify the --megatron_checkpoint parameter with the path to your Megatron-Bridge checkpoint directory.

Configure Model Parallelism#

Megatron-Bridge models support advanced parallelism strategies for large model deployment:

Tensor Model Parallelism: Distributes model layers across multiple GPUs:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id large_llama \
   --num_gpus 4 \
   --tensor_model_parallel_size 4 \
   --pipeline_model_parallel_size 1 \
   --cuda_visible_devices "0,1,2,3"

Pipeline Model Parallelism: Distributes model layers sequentially across GPUs:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id large_llama \
   --num_gpus 4 \
   --tensor_model_parallel_size 1 \
   --pipeline_model_parallel_size 4 \
   --cuda_visible_devices "0,1,2,3"

Combined Parallelism: Uses both tensor and pipeline parallelism:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id large_llama \
   --num_gpus 8 \
   --tensor_model_parallel_size 2 \
   --pipeline_model_parallel_size 4 \
   --cuda_visible_devices "0,1,2,3,4,5,6,7"

Deploy Multiple Replicas#

Deploy multiple replicas of your Megatron-Bridge model for increased throughput:

Single GPU per replica:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id llama \
   --num_replicas 4 \
   --num_gpus 4 \
   --tensor_model_parallel_size 1 \
   --pipeline_model_parallel_size 1 \
   --cuda_visible_devices "0,1,2,3"

Multiple GPUs per replica with tensor parallelism:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id large_llama \
   --num_replicas 2 \
   --num_gpus 8 \
   --tensor_model_parallel_size 4 \
   --pipeline_model_parallel_size 1 \
   --cuda_visible_devices "0,1,2,3,4,5,6,7"

Important GPU Configuration Notes:

GPUs per replica = Total GPUs ÷ --num_replicas.
Each replica needs: --tensor_model_parallel_size × --pipeline_model_parallel_size × --context_parallel_size GPUs.
Ensure --cuda_visible_devices lists all GPUs that will be used.

Optimize Performance#

Enable performance optimizations for faster inference:

Flash Attention Decode: Optimizes attention computation:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id llama \
   --enable_flash_decode \
   --num_gpus 2 \
   --tensor_model_parallel_size 2 \
   --cuda_visible_devices "0,1"

Flash Attention Decode and Cuda Graphs:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id llama \
   --enable_cuda_graphs \
   --enable_flash_decode \
   --num_gpus 4 \
   --tensor_model_parallel_size 2 \
   --pipeline_model_parallel_size 2 \
   --cuda_visible_devices "0,1,2,3"

Test Ray Deployment#

Use the query_ray_deployment.py script to test your deployed Megatron-Bridge model:

Basic testing:

python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \
   --model_id llama \
   --host 0.0.0.0 \
   --port 1024

The script will test multiple endpoints:
- Health check endpoint: /v1/health
- Models list endpoint: /v1/models
- Text completions endpoint: /v1/completions/
Available parameters for testing:
- --host: Host address of the Ray Serve server. Default is 0.0.0.0.
- --port: Port number of the Ray Serve server. Default is 1024.
- --model_id: Identifier for the model in the API responses. Default is nemo-model.

Configure Advanced Deployments#

For more advanced deployment scenarios:

Custom Resource Allocation:

python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py \
   --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ \
   --model_id llama \
   --num_replicas 2 \
   --num_gpus 4 \
   --tensor_model_parallel_size 2 \
   --num_cpus 32 \
   --num_cpus_per_replica 16 \
   --cuda_visible_devices "0,1,2,3"

API Endpoints#

Once deployed, your Megatron-Bridge model will be available through OpenAI-compatible API endpoints:

Health Check: GET /v1/health
List Models: GET /v1/models
Text Completions: POST /v1/completions/
Chat Completions: POST /v1/chat/completions/

Example API request:

curl -X POST http://localhost:1024/v1/completions/ \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama",
    "prompt": "The capital of France is",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Troubleshooting#

Out of Memory Errors: Reduce --num_replicas or adjust parallelism settings.
Port Already in Use: Change the --port parameter.
Ray Cluster Issues: Ensure no other Ray processes are running: ray stop.
GPU Allocation: Verify --cuda_visible_devices matches your available GPUs.
Parallelism Configuration Errors: Ensure total parallelism per replica matches available GPUs per replica.
CUDA Device Mismatch: Make sure the number of devices in --cuda_visible_devices equals total GPUs.
Checkpoint Loading Issues: Verify the Megatron-Bridge checkpoint directory path is correct and contains the expected iter_* subdirectory (e.g. iter_0000000).

For more information on Ray Serve, visit the Ray Serve documentation.

Multi-node on SLURM using ray.sub#

Use scripts/deploy/utils/ray.sub to bring up a Ray cluster across multiple SLURM nodes and run your in-framework Megatron-Bridge deployment automatically. This script configures the Ray head and workers, handles ports, and can optionally run a driver command once the cluster is online.

Script location: scripts/deploy/utils/ray.sub
Upstream reference: See the NeMo RL cluster setup doc for background on this pattern: NVIDIA-NeMo RL cluster guide

Prerequisites#

A SLURM cluster with container support for srun --container-image and --container-mounts.
A container image that includes Export-Deploy at /opt/Export-Deploy and the needed dependencies.
A Megatron-Bridge checkpoint directory (e.g. .../hf_llama31_8B_mbridge/iter_0000000/) accessible on the cluster filesystem.

Quick start (2 nodes, 16 GPUs total)#

Set environment variables to parameterize ray.sub (these are read by the script at submission time):

export CONTAINER=nvcr.io/nvidia/nemo:vr
export MOUNTS="${PWD}/:/opt/checkpoints/"

# Optional tuning
export GPUS_PER_NODE=8                   # default 8; set to your node GPU count

# Driver command to run after the cluster is ready (multi-node Megatron-Bridge deployment)
export COMMAND="python /opt/Export-Deploy/scripts/deploy/nlp/deploy_ray_inframework.py --megatron_checkpoint /opt/checkpoints/hf_llama31_8B_mbridge/iter_0000000/ --model_id llama --num_replicas 16 --num_gpus 16"

Submit the job (you can override SBATCH directives on the command line):

sbatch --nodes=2 --account <ACCOUNT> --partition <PARTITION> \
  --job-name nemo-ray --time 01:00:00 \
  /opt/Export-Deploy/scripts/deploy/utils/ray.sub

The script will:

Start a Ray head on node 0 and one Ray worker per remaining node
Wait until all nodes register their resources
Launch the COMMAND on the head node (driver) once the cluster is healthy

Attaching and monitoring:

Logs: $SLURM_SUBMIT_DIR/<jobid>-logs/ contains ray-head.log, ray-worker-<n>.log, and (if set) synced Ray logs.
Interactive shell: the job creates <jobid>-attach.sh. For head: bash <jobid>-attach.sh. For worker i: bash <jobid>-attach.sh i.
Ray status: once attached to the head container, run ray status.

Query the deployment (from within the head container):

python /opt/Export-Deploy/scripts/deploy/nlp/query_ray_deployment.py \
  --model_id llama --host 0.0.0.0 --port 1024

Notes#

Set --num_gpus in the deploy command to the total GPUs across all nodes; adjust --num_replicas and model parallel sizes per your topology.
If your cluster uses GRES, ray.sub auto-detects and sets --gres=gpu:<GPUS_PER_NODE>; ensure GPUS_PER_NODE matches the node’s GPU count.
You can leave --cuda_visible_devices unset for multi-node runs; per-node visibility is managed by Ray workers.